Facebook
TwitterCodon usage in individual genes has been calculated using the nucleotide sequence data obtained from the GenBank Genetic Sequence Database. The compilation of codon usage is synchronized with each major release of GenBank.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Codon similarity data in ATTED-II ver 8.0
The gene-to-gene codon similarity data is organized in the form of tables, each named according to the Entrez Gene ID of a particular query gene. Each table encompasses three columns, specifying: the Entrez Gene ID of a corresponding gene, an MR (Mutual Rank) value (where a smaller number signifies a stronger relationship), and a Pearson correlation coefficient (where a larger number suggests a stronger association).
Protein-coding sequences utilized in this study were retrieved from NCBI's RefSeq database. For each gene, a 61-dimensional vector was derived from the count of codons in the protein-coding sequence. In instances where multiple RefSeq sequences were associated with a single gene, the longest sequence was selected for the codon usage calculation. Pearson correlation coefficients (PCCs) were determined between the vectors of any two given genes. These PCCs were subsequently converted into MRs, employed as an index to evaluate the similarity in codon usage between the genes.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We examined codon usage frequencies in the genomic coding DNA of a large sample of diverse organisms from different taxa tabulated in the CUTG database, where we further manually curated and harmonized these existing entries by re-classifying CUTG's bacteria (bct) class into archaea (arc), plasmids (plm), and bacteria proper (keeping with the original label bct'). The reclassification in the originalbct' domain was simplified by extracting from files qbxxx.spsum.txt' (where xxx = bct (bacteria), inv (invertebrates), mam (mammals), pln (plants), pri (primates), rod (rodents), vrt (vertebrates)) the different genus names of the entries, and making the classification by genus. There were 514 different genus names. The different genus categories were checked and relabeled asarc' where appropriate. In the eubacterial entries, the distinction was made of the bacterial genomes proper (keeping with the original label bct'), and bacterial plasmids (now labeledplm').
Column 1: Kingdom Column 2: DNAtype Column 3: SpeciesID Column 4: Ncodons Column 5: SpeciesName Columns 6-69: codon (header: nucleotide bases; entries: frequency of usage (5 digit floating point number))
The 'Kingdom' is a 3-letter code corresponding to `xxx' in the CUTG database name: 'arc'(archaea), 'bct'(bacteria), 'phg'(bacteriophage), 'plm' (plasmid), 'pln' (plant), 'inv' (invertebrate), 'vrt' (vertebrate), 'mam' (mammal), 'rod' (rodent), 'pri' (primate), and 'vrl'(virus) sequence entries. Note that the CUTG database does not contain 'arc' and 'plm' (these have been manually curated ourselves).
The 'DNAtype' is denoted as an integer for the genomic composition in the species: 0-genomic, 1-mitochondrial, 2-chloroplast, 3-cyanelle, 4-plastid, 5-nucleomorph, 6-secondary_endosymbiont, 7-chromoplast, 8-leucoplast, 9-NA, 10-proplastid, 11-apicoplast, and 12-kinetoplast.
The species identifier ('SpeciesID') is an integer, which uniquely indicates the entries of an organism. It is an accession identifier for each different species in the original CUTG database, followed by the first item listed in each genome.
The number of codons (`Ncodons') is the algebraic sum of the numbers listed for the different codons in an entry of CUTG. Codon frequencies are normalized to the total codon count, hence the number of occurrences divided by 'Ncodons' is the codon frequencies listed in the data file.
The species' name ('SpeciesName') is represented in strings purged of comma' (which are now replaced byspace'). This is a descriptive label of the name of the species for data interpretations.
Lastly, the codon frequencies ('codon') including 'UUU', 'UUA', 'UUG', 'CUU', etc., are recorded as floats (with decimals in 5 digits).
Khomtchouk BB: 'Codon usage bias levels predict taxonomic identity and genetic composition'. bioRxiv, 2020, doi: 10.1101/2020.10.26.356295.
Nakamura Y, Gojobori T, Ikemura T: 'Codon usage tabulated from international DNA sequence databases: status for the year 2000'. Nucleic Acids Research, 2000, 28:292.
Extend Biology Research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transfer RNAs (tRNAs) play critical roles in human cancer. Currently, no database provides the expression landscape and clinical relevance of tRNAs across a variety of human cancers. Utilizing miRNA-seq data from The Cancer Genome Atlas, we quantified the relative expression of tRNA genes and merged them into the codon level and amino level across 31 cancer types. The expression of tRNAs is associated with clinical features of patient smoking history and overall survival, and disease stage, subtype, and grade. We further analysed codon frequency and amino acid frequency for each protein coding gene and linked alterations of tRNA expression with protein translational efficiency. We include these data resources in a user-friendly data portal, tRic (tRNA in cancer, https://hanlab.uth.edu/tRic/ or http://bioinfo.life.hust.edu.cn/tRic/), which can be of significant interest to the research community.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Haplotype designations 1A-16A are by Sahagun-Ruiz et al.[2]. B, C and D show haplotypes in which the SNP does not change the amino acid compared to A [4]. The table includes the FPR1 SNPs in the following order: c.32C>T/p.T11I, c.140T>C/p.V47A, c.301G>C/p.V101L, c.306T>C/p.F102F, c.348C>T/p.I116I, c.546C>A/p.P182P, c.568A>T/p.R190W, c.576T>G>C/p.N192K, c.993C>T/p.T331T, c.1037C>A/p.A356E. The codon bias results show the differences between the various haplotypes based on the total of each SNP codon usage score, as obtained from the GenBank Homo sapiens Codon Usage Database (http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=9606). The codon pair bias results show the differences between the various haplotypes based on the total of each SNP codon pair score, as calculated from the Supplemental Material by Coleman et al.www.sciencemag.org/cgi/content/full/320/5884/1784/DC1[6]. Amino acids are shown in single letter code. The nucleotide in the 3rd position of the synonymous codons is as shown.
Facebook
TwitterWe present a comprehensive analysis of stop codon usage in bacteria by analyzing over eight million coding sequences of 4684 bacterial sequences. Using a newly developed program called "stop codon counter," the frequencies of the three classical stop codons TAA, TAG, and TGA were analyzed, and a publicly available stop codon database was built.
Datafiles contain: 1) Complete data set of stop codon usage of all analyzed sequences as described in the publication "Comprehensive Analysis of Stop Codon Usage in Bacteria and Its Correlation with Release Factor Abundance" by Korkmaz et al (2014).
2) The Java program that was used for the analysis of the coding sequences. Execute the file in Program\ProjectStopCodonCounter\dist
The dataset was originally published in DiVA and moved to SND in 2024.
Facebook
TwitterBackground: Correlations between genome composition (in terms of GC content) and usage of particular codons and amino acids have been widely reported, but poorly explained. We show here that a simple model of processes acting at the nucleotide level explains codon usage across a large sample of species (311 bacteria, 28 archaea and 257 eukaryotes). The model quantitatively predicts responses (slope and intercept of the regression line on genome GC content) of individual codons and amino acids to genome composition. Results: Codons respond to genome composition on the basis of their GC content relative to their synonyms (explaining 71-87% of the variance in response among the different codons, depending on measure). Amino-acid responses are determined by the mean GC content of their codons (explaining 71-79% of the variance). Similar trends hold for genes within a genome. Position-dependent selection for error minimization explains why individual bases respond differently to directional mutation pressure. Conclusions: Our model suggests that GC content drives codon usage (rather than the converse). It unifies a large body of empirical evidence concerning relationships between GC content and amino-acid or codon usage in disparate systems. The relationship between GC content and codon and amino-acid usage is ahistorical; it is replicated independently in the three domains of living organisms, reinforcing the idea that genes and genomes at mutation/selection equilibrium reproduce a unique relationship between nucleic acid and protein composition. Thus, the model may be useful in predicting amino-acid or nucleotide sequences in poorly characterized taxa.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relative synonymous codon usage was calculated for all genomes in the Y1000+ database. The RSCU for orders with a CTG codon reassignment were computed taking the reassignment into account. The script used to generate the RSCU values can be found here: https://github.com/The-Lab-LaBella/RSCU_Calculation_AnalysisThe file contains the name of the assemblies, as listed in the supplemental data 1 of Opulente et al 2024. The columns contain the clade/order and codons analyzed.
Facebook
TwitterPhylogenomic analyses of ancient relationships are usually performed using amino acid data, but it is unclear whether amino acids or nucleotides should be preferred. With the 2-fold aim of addressing this problem and clarifying pancrustacean relationships, we explored the signals in the 62 protein-coding genes carefully assembled by Regier et al. in 2010. With reference to the pancrustaceans, this data set infers a highly supported nucleotide tree that is substantially different to the corresponding, but poorly supported, amino acid one. We show that the discrepancy between the nucleotide-based and the amino acids-based trees is caused by substitutions within synonymous codon families (especially those of serine—TCN and AGY). We show that different arthropod lineages are differentially biased in their usage of serine, arginine, and leucine synonymous codons, and that the serine bias is correlated with the topology derived from the nucleotides, but not the amino acids. We suggest that a ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Snapshot of the data used to run downstream analysis. A detailed methods description can be found in the manuscript and the associated analysis code.
Facebook
TwitterDatabase includes genomic codon-pair and dinucleotide statistics of all organisms with sequenced genome. Facilitates genetic variation analyses and recombinant gene design. Derived from all available GenBank and RefSeq data.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Protein-Protein, Genetic, and Chemical Interactions for Fu Y (2022):Host adaptation of codon usage in SARS-CoV-2 from mammals indicates potential natural selection and viral fitness. curated by BioGRID (https://thebiogrid.org); ABSTRACT: SARS-CoV-2 infection, which is the cause of the COVID-19 pandemic, has expanded across various animal hosts, and the virus can be transmitted particularly efficiently in minks. It is still not clear how SARS-CoV-2 is selected and evolves in its hosts, or how mutations affect viral fitness. In this report, sequences of SARS-CoV-2 isolated from human and animal hosts were analyzed, and the binding energy and capacity of the spike protein to bind human ACE2 and the mink receptor were compared. Codon adaptation index (CAI) analysis indicated the optimization of viral codons in some animals such as bats and minks, and a neutrality plot demonstrated that natural selection had a greater influence on some SARS-CoV-2 sequences than mutational pressure. Molecular dynamics simulation results showed that the mutations Y453F and N501T in mink SARS-CoV-2 could enhance the binding of the viral spike to the mink receptor, indicating the involvement of these mutations in natural selection and viral fitness. Receptor binding analysis revealed that the mink SARS-CoV-2 spike interacted more strongly with the mink receptor than the human receptor. Tracking the variations and codon bias of SARS-CoV-2 is helpful for understanding the fitness of the virus in virus transmission, pathogenesis, and immune evasion.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data generated by CUBseq on Escherichia coli RNA-seq data.
Facebook
TwitterExpressionDataThe expression data from Tribolium castaneum whole body and reproductive tracts samples provided here is the output of ArrayStar gene expression software that was used to processes and normalize NimbleGen-generated raw expression data (Prince, Kirkland and Demuth 2010, Genome Biol Evol 2:336-346).
Facebook
TwitterThe genetic code in mRNA is redundant, with 61 sense codons translated into 20 different amino acids. Individual amino acids are encoded by up to six different codons but within codon families some are used more frequently than others. This phenomenon is referred to as synonymous codon usage bias. The genomes of free-living unicellular organisms such as bacteria have an extreme codon usage bias and the degree of bias differs between genes within the same genome. The strong positive correlation between codon usage bias and gene expression levels in many microorganisms is attributed to selection for translational efficiency. However, this putative selective advantage has never been measured in bacteria and theoretical estimates vary widely. By systematically exchanging optimal codons for synonymous codons in the tuf genes we quantified the selective advantage of biased codon usage in highly expressed genes to be in the range 0.2–4.2 x 10−4 per codon per generation. These data quantify for the first time the potential for selection on synonymous codon choice to drive genome-wide sequence evolution in bacteria, and in particular to optimize the sequences of highly expressed genes. This quantification may have predictive applications in the design of synthetic genes and for heterologous gene expression in biotechnology.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Protein-Protein, Genetic, and Chemical Interactions for Fu J (2016):Codon usage affects the structure and function of the Drosophila circadian clock protein PERIOD. curated by BioGRID (https://thebiogrid.org); ABSTRACT: Codon usage bias is a universal feature of all genomes, but its in vivo biological functions in animal systems are not clear. To investigate the in vivo role of codon usage in animals, we took advantage of the sensitivity and robustness of the Drosophila circadian system. By codon-optimizing parts of Drosophila period (dper), a core clock gene that encodes a critical component of the circadian oscillator, we showed that dper codon usage is important for circadian clock function. Codon optimization of dper resulted in conformational changes of the dPER protein, altered dPER phosphorylation profile and stability, and impaired dPER function in the circadian negative feedback loop, which manifests into changes in molecular rhythmicity and abnormal circadian behavioral output. This study provides an in vivo example that demonstrates the role of codon usage in determining protein structure and function in an animal system. These results suggest a universal mechanism in eukaryotes that uses a codon usage "code" within genetic codons to regulate cotranslational protein folding.
Facebook
TwitterVariation in synonymous codon usage is abundant across multiple levels of organization: between codons of an amino acid, between genes in a genome, and between genomes of different species. It is now well understood that variation in synonymous codon usage is influenced by mutational bias coupled with both natural selection for translational efficiency and genetic drift, but how these processes shape patterns of codon usage bias across entire lineages remains unexplored. To address this question, we used a rich genomic data set of 327 species that covers nearly one third of the known biodiversity of the budding yeast subphylum Saccharomycotina. We found that, while genome-wide relative synonymous codon usage (RSCU) for all codons was highly correlated with the GC content of the third codon position (GC3), the usage of codons for the amino acids proline, arginine, and glycine was inconsistent with the neutral expectation where mutational bias coupled with genetic drift drive codon usage. Examination between genes’ effective numbers of codons and their GC3 contents in individual genomes revealed that nearly a quarter of genes (381,174/1,683,203; 23%), as well as most genomes (308/327; 94%), significantly deviate from the neutral expectation. Finally, by evaluating the imprint of translational selection on codon usage, measured as the degree to which genes’ adaptiveness to the tRNA pool were correlated with selective pressure, we show that translational selection is widespread in budding yeast genomes (264/327; 81%). These results suggest that the contribution of translational selection and drift to patterns of synonymous codon usage across budding yeasts varies across codons, genes, and genomes; whereas drift is the primary driver of global codon usage across the subphylum, the codon bias of large numbers of genes in the majority of genomes is influenced by translational selection.
Facebook
TwitterPhylogenetic analyses using concatenation of genomic-scale data have been seen as the panacea to resolving the incongruences among inferences from few or single genes. However, phylogenomics may also suffer from systematic errors, due to the, perhaps cumulative, effects of saturation, among-taxa compositional (GC content) heterogeneity, or codon-usage bias plaguing the individual nucleotide loci that are concatenated. Here we provide an example of how these factors affect the inferences of the phylogeny of early land plants based on mitochondrial genomic data. Mitochondrial sequences evolve slowly in plants and hence are thought to be suitable for resolving deep relationships. We newly assembled mitochondrial genomes from 20 bryophytes, complemented these with 40 other streptophytes (land plants plus algal outgroups), compiling a data matrix of 60 taxa and 41 mitochondrial genes. Homogeneous analyses of the concatenated nucleotide data resolve mosses as sister-group to the remaining lan...
Facebook
TwitterSynonymous codon usage (SCU) patterns are shaped by a balance between mutation, drift, and natural selection. To date, detection of translational selection in vertebrates has proven to be a challenging task, obscured by small long-term effective population sizes in larger animals and the existence of isochores in some species. The consensus is that, in such species, natural selection is either completely ineffective at overcoming mutational pressures and genetic drift or perhaps is effective but so weak that it is not detectable. The aim of this research is to understand the interplay between mutation, selection, and genetic drift in vertebrates. We observe that although variation in mutational bias is undoubtedly the dominant force influencing codon usage, translational selection acts as a weak additional factor influencing synonymous codon usage. These observations indicate that translational selection is a widespread phenomenon in vertebrates and is not limited to a few species.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These codons are significantly enriched at the estimated A-site in the top 10% of normalized footprint counts using a Bonferroni corrected p-value of 8.2 * 10−4 (.05/61). These codons are also analyzed with respect to each bias measure, such that a larger negative number indicates a stronger correspondence with the model. Note that there are only four bias measures listed (compared to the five codon usage models analyzed earlier) as the High-Phi %MinMax and High-Phi CAI models use the same underlying CUB measure.
Facebook
TwitterCodon usage in individual genes has been calculated using the nucleotide sequence data obtained from the GenBank Genetic Sequence Database. The compilation of codon usage is synchronized with each major release of GenBank.