54 datasets found

f
The GenBank Non-Redundant Protein Sequence Database (NRDB)
fungidb.org
piroplasmadb.org
Updated Aug 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). The GenBank Non-Redundant Protein Sequence Database (NRDB) [Dataset]. https://fungidb.org/fungidb/app/record/dataset/DS_a7163a9f0d
Explore at:
Dataset updated
Aug 16, 2019
Description
The GenBank non-redundant protein sequence database (NRDB) is a component of the NCBI BLAST databases and contains entries from GenPept, Swissprot, PIR, PDF, PDB and NCBI RefSeq.
Darwin: an amino acid sequence collection of complete proteomes from...
zenodo.org
explore.openaire.eu
bin, pdf, xls
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joe Win; Joe Win; Sophien Kamoun; Sophien Kamoun (2024). Darwin: an amino acid sequence collection of complete proteomes from eukaryotes with different phylogenetic affinities (v. 03_2020_137) [Dataset]. http://doi.org/10.5281/zenodo.3699564
Explore at:
xls, pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3699564
Dataset updated
Jul 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joe Win; Joe Win; Sophien Kamoun; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background

Every time we find an interesting gene in an organism of interest, the first question is often “how widely is this gene distributed in the eukaryotic kingdom?”. Naturally, one could use NCBI BLAST search against the non-redundant sequence database provided by GenBank to answer this question. However, it can be cumbersome to parse the results and assign them to taxonomic units. It is also not straightforward to get an overview of which eukaryotic groups are represented in the results. Top BLAST hits can be crowded with sequences from closely-related organisms making it difficult gain an overview of the overall distribution across eukaryotes. To streamline this process, we developed an in-house database of complete eukaryotic proteomes. We tagged each sequence with a eukaryotic group handle (two-character symbol) and combined them into a single data set searchable by standalone BLAST on one’s own computer. We named this data set “Darwin” to reflect the diverse nature of the sequences it contains.

Methods

We downloaded predicted proteomes in FASTA format from different sources such as GenBank, Joint Genome Institute (Depart of Energy, USA), Broad Institute (Massachusetts Institute of Technology, USA), Phytozome and a number of other specialized websites catering for a specific organism such as the Arabidopsis Information Resource (TAIR), or the Saccharomyces Genome Database (SGD). All the organisms we included in Darwin are listed in Table 1. To reduce redundancy, we took care not to include the same species more than once unless subspecies were known to show wide diversity. Each sequence header was tagged with a eukaryotic group handle composed of two-character symbols (based on Keeling et al., 2005). These handles clearly appear in BLAST output and can be parsed easily. We combined sequences from all proteomes into a single data set and named it “Darwin”.

Results

The current version of Darwin (v. 03_2020_137) contains 2,601,132 amino acid sequences from 137 eukaryotes (Table 1, Data file 1). The sizes of the proteomes were diverse, ranging from ~4000 sequences in some alveolates to 60,000-76,000 in plants. Darwin represents most of the supergroups of eukaryotic kingdom described in Keeling et al., (2005) except those in Rhizaria whose genomes were not available at the time of data set construction. The data set contains larger numbers of proteomes from fungi and plants reflecting areas of interest in our group.

Conclusions

Darwin is provided as a text fasta file that can be formatted for BLAST searches on standalone computers. The results from the BLAST searches can be parsed to determine how widely a gene of interest is distributed among different eukaryotes. Simple counting of the eukaryotic group handles would also yield an overview of the distribution across taxa. Darwin is also useful for rapidly finding out whether a gene is missing in particular taxa.

Reference

Keeling PJ, Burger G, Durnford DG, Lang BF, Lee RW, Pearlman RE, Roger AJ, Gray MW (2005) The tree of eukaryotes. Trends Ecol. Evol. 20: 670-676
Z
Data from: COInr a comprehensive, non-redundant COI database from NCBI-nt...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated May 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meglecz, Emese (2024). COInr a comprehensive, non-redundant COI database from NCBI-nt and BOLD [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6555984
Explore at:
Dataset updated
May 6, 2024
Dataset authored and provided by
Meglecz, Emese
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COInr is a non-redundant, comprehensive database of COI sequences extracted from NCBI-nt and BOLD. It is not limited to a taxon, a gene region, or a taxonomic resolution. Sequences are dereplicated between databases and within taxa.

Each taxon has a unique taxonomic Identifier (taxID), fundamental to avoid ambiguous associations of homonyms and synonyms in the source database. TaxIDs form a coherent hierarchical system fully compatible with the NCBI taxIDs allowing creating their full or ranked linages.

COInr is a good starting point to create custom databases according to the users’ needs using mkCOInr scripts available at https://github.com/meglecz/mkCOInr
It is possible to select/eliminate sequences for a list of taxa, select a specific gene region, select for minimum taxonomic resolution, add new custom sequences, and format the database for BLAST, QIIME, RDP classifiers.
r
Antibiotic Resistance Genes Database
rrid.site
neuinfo.org
+2more
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Antibiotic Resistance Genes Database [Dataset]. http://identifiers.org/RRID:SCR_007040
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007040
Dataset updated
Jun 15, 2025
Description
The goals of Antibiotic Resistance Genes Database (ARGB) are to provide a centralized compendium of information on antibiotic resistance, to facilitate the consistent annotation of resistance information in newly sequenced organisms, and also to facilitate the identification and characterization of new genes. ARGB contains six types of database groups: - Resistance Type: This database contains information, such as resistance profile, mechanism, requirement, epidemiology for each type. - Resistance Gene: This database contains information, such as resistance profile, resistance type, requirement, protein and DNA sequence for each gene.This database only includes NON-REDUNDANT, NON-VECTOR, COMPLETE genes. - Antibiotic: This database contains information, such as producer, action mechanism, resistance type, for each gene. - Resistance Gene(NonRD): This database contains the same information as Resistance Gene. It does NOT include NON-REDUNDANT, NON-VECTOR genes, but includes INCOMPLETE genes. - Resistance Gene(ALL): This database contains the same information as Resistance Gene. It includes all REDUNDANT, VECTOR AND INCOMPLETE genes. - Resistance Species: This database contains resistance profile and corresponding resistance genes for each species. Furthermore, ARDB also contians three types BLAST database: - Resistance Genes Complete: Contains only NON-REDUNDANT, NON-VECTOR, COMPLETE genes sequences. - Resistance Genes Non-redundant: Contains NON-REDUNDANT, NON-VECTOR, COMPLETE, INCOMPLETE genes sequences. - Resistance Genes All: Contains all REDUNDANT, VECTOR, COMPLETE, INCOMPLETE genes sequences. Lastly, ARDB provides four types of Analytical tools: - Normal BLAST: This function allows an user to input a DNA or protein sequence, and find similar DNA (Nucleotide BLAST) or protein (Protein BLAST) sequences using blastn, blastp, blastx, tblastn, tblastx - RPS BLAST: A web RPSBLAST (RPS BLAST) interface is provided to align a query sequence against the Position Specific Scoring Matrix (PSSM) for each type. Normally, this will give the same annotation information as using regular BLAST mentioned above. - Multiple Sequences BLAST (Genome Annotation): This function allows an user to annotate multiple (less than 5000) query sequences in FASTA format. - Mutation Resistance Identification: This function allows an user to identify mutations that will cause potential antibiotic resistance, for 12 genes (16S rRNA, 23S rRNA, gyrA, gyrB, parC, parE, rpoB, katG, pncA, embB, folP, dfr). �� :Sponsors: ARDB is funded by Uniformed Services University of the Health Sciences, administered by the Henry Jackson Foundation. :
s
3. Ecological genomics of the Northern krill: Genome assembly annotations...
figshare.scilifelab.se
researchdata.se
application/x-gzip
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Wallberg; Per Unneberg (2025). 3. Ecological genomics of the Northern krill: Genome assembly annotations (genes and repeats) [Dataset]. http://doi.org/10.17044/scilifelab.22786925.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.22786925.v1
Dataset updated
Jan 15, 2025
Dataset provided by
Uppsala University
Authors
Andreas Wallberg; Per Unneberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This item holds multiple gene and repeat model and annotation files, including coordinates in GFF/GTF formats, TXT/TSV table and sequences in FASTA format. It also contains some accessory RNA-seq gene resources, such as Trinity-assembled transcripts and Nanopore cDNA sequences that were used at various stages of assembly and annotation. Coordinates refer to the main genome assembly reference sequence (1.m_norvegica.main_w_mito.fasta) but focus on the nuclear genome assembly and rarely include features of the mitochondrial assembly. Mitochondrial annotations are provided separately (see below). Contents:

trinity_transcripts.tar.gz, an archive with n=573,869 RNA transcripts that have been assembled with Trinity using Illumina RNA-seq data in FASTA format. trinity_transcript.16509_single_isoforms.cds.fasta.tar.gz, a subset of 16,509 single (longest) isoform of putatively protein-coding transcripts used to assess genome assembly metrics such as duplication and base-level error. Sequences are in FASTA format. nanopore_cDNA.representative_sequences_vsearch.tar.gz, n=25,484 cDNA Nanopore sequence reads used to filter gene models and scaffold the genome. annotations.all_genes_and_isoforms.redundant.tar.gz, an archive with all (n=202,138) gene models and isoforms/alternative splice variants, including also non-protein coding genes. annotations.protein_coding_gene_models.non_redundant.gff3, a non-redundant (i.e. single-isoform) set of putative protein-coding gene bodies (n=42,227) in standard GFF3 format. annotations.protein_coding_gene_models.non_redundant.CDS.fasta, the matching set of putative protein-coding genes in FASTA format (CDS nucleotide sequences). annotations.protein_coding_gene_models.non_redundant.PEP.fasta, the matching set of putative protein sequences in FASTA format (PEP peptide sequences). annotations.protein_coding_gene_models.non_redundant.PEP.fasta.BLAST.DROSOPHILA.tsv.tar.gz, output from BLASTP analyses between Northern krill and Drosophila peptide sequences (BLAST outfmt 6). annotations.protein_coding_gene_models.non_redundant.PEP.fasta.EnTAP.final_annotations_lvl1.tsv, main output from EnTAP functional annotations of protein coding genes. annotations.protein_coding_gene_models.non_redundant_added_stop_codons.gff, non-redundant protein-coding models as above, but missing stop-codons have been added if detected in the reference genome assembly (GFF format). annotations.protein_coding_gene_models.non_redundant_added_stop_codons.CDS.fasta, but missing stop-codons have been added if detected in the reference genome assembly (FASTA format). mitochondrion.tar.gz, an archive with gene coordinates and sequences of tRNAs, rRNAs, protein-coding genes and repeat features on the mitochondrial chromosome, as inferred using MITOS2. Files are standard BED/GFF/TSV/TXT/FASTA files and more information about formats can be found on the site for the original tool: http://mitos2.bioinf.uni-leipzig.de/help.py annotations.repeat_library.fasta, a custom set of n=10,909 non-redundant repeat sequences in FASTA format that were used to annotate the genome for repeats using RepeatMasker. annotations.repeats_across_the_genome_repeatmasker.tbl, the standard RepeatMasker masking overview output table. annotations.repeats_across_the_genome_repeatmasker.out.tar.gz, the full set of masked repeats and their coordinates across the genome.

trinity_transcripts.tar.gz This archive contains the assembled transcripts assembled from RNA-seq data produced from six RNA extractions/tissues of the reference specimen. There are three FASTA files:

trinity_transcripts.all_genes_and_isoforms.fasta = all assembled transcripts (n=573,869) trinity_transcripts.metazoan_genes_and_isoforms.CDS.fasta = a subset of n=60,677 assembled and putatively coding transcripts with best hits against Metazoan sequences (CDS nucleotide sequences) trinity_transcripts.metazoan_genes_and_isoforms.PEP.fasta = the n=60,677 corresponding peptide sequences.

nanopore_cDNA.representative_sequences_vsearch.tar.gz This archive contains putatively full-length cDNA reads in three FASTA files:

clusters.fa = VSEARCH cluster representatives (i.e. cluster centroids with low error rates) that retain the original Nanopore sequence headers (n=25,484) clusters.renamed.fa = as above, but renamed with simple incrementing headers. clusters.renamed.min_500bp.fa = as above, but only reads longer than 500 bp (n=24,632). These reads were used to scaffold the genome.

annotations.all_genes_and_isoforms.redundant.tar.gz This archive contains gene models in four files:

annotations.all_genes_and_isoforms.redundant.gtf, coordinates in GTF format annotations.all_genes_and_isoforms.redundant.gff3, coordinates in GFF3 format annotations.all_genes_and_isoforms.redundant.fasta, sequences in FASTA format annotations.all_genes_and_isoforms.redundant.transcripts.tsv, a TSV table with three fields specifying: 1) the final name of the isoform/splice variant; 2) the name of the gene model it belongs to; 3) the original name the isoform.

These models were consolidated into loci using GFFCOMPARE from multiple sources of data, including RNA and comparative data. The names of the original isoforms indicate source:

STRG = HISAT/STRINGTIE RNA-seq gene model. Tagged "REF_STRG" in the final gene model. mRNA = Assembled Trinity transcript. Tagged "REF_TRIN" in the final gene model. COMPARATIVE_SPALN = Comparative model derived other crustaceans.

GFF and GTF format specifications are available here: https://www.ensembl.org/info/website/upload/gff.html https://www.ensembl.org/info/website/upload/gff3.html annotations.protein_coding_gene_models.non_redundant.(gff3|CDS.fasta|PEP.fasta) These files contains a filtered set of the "best" model isoform of each locus (n=42,227) in total, which were determined by comparison to NCBI RefSeq. These models were used to annotate SNPs, infer homology/orthology, gene family evolution and molecular evolution. annotations.repeat_library.fasta This FASTA file contains the representative and non-redundant template repeat sequences that were used to annotate the Northern krill genome for interspersed repeats. The sequence headers indicate several aspects of each repeat. Example: "seq_c_98391_5186_12351_FIN_ReC99C#LTR/Pao" This indicates that the template is:

located on sequence seq_c_98391 with start/stop coordinates 5186/12351 originally detected using LTR_Finder ("FIN") classified as "LTR/Pao" using RepeatClassifier ("ReC") has 99% identity between the 5' and 3' LTRs ("99") and was considered complete, with respect to the expected protein domains detected along the repeat.

Additional tags and nomenclature are described in the paper methods.
f
SNPs predicted between the two Ae. sharonensis accessions from two selected...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Costas Bouyioukos; Matthew J. Moscou; Nicolas Champouret; Inmaculada Hernández-Pinzón; Eric R. Ward; Brande B. H. Wulff (2023). SNPs predicted between the two Ae. sharonensis accessions from two selected assemblies. [Dataset]. https://plos.figshare.com/articles/dataset/SNPs_predicted_between_the_two_i_Ae_i_i_sharonensis_i_accessions_from_two_selected_assemblies_/4129896
Explore at:
xlsAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Costas Bouyioukos; Matthew J. Moscou; Nicolas Champouret; Inmaculada Hernández-Pinzón; Eric R. Ward; Brande B. H. Wulff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SNPs predicted between the two Ae. sharonensis accessions from two selected assemblies.
f
Predicted NB-LRR proteins from Ae. sharonensis assemblies and BRBH against...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Costas Bouyioukos; Matthew J. Moscou; Nicolas Champouret; Inmaculada Hernández-Pinzón; Eric R. Ward; Brande B. H. Wulff (2023). Predicted NB-LRR proteins from Ae. sharonensis assemblies and BRBH against grass NB-LRRs. [Dataset]. https://plos.figshare.com/articles/dataset/Predicted_NB-LRR_proteins_from_i_Ae_i_i_sharonensis_i_assemblies_and_BRBH_against_grass_NB-LRRs_/4129893
Explore at:
xlsAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Costas Bouyioukos; Matthew J. Moscou; Nicolas Champouret; Inmaculada Hernández-Pinzón; Eric R. Ward; Brande B. H. Wulff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predicted NB-LRR proteins from Ae. sharonensis assemblies and BRBH against grass NB-LRRs.
Summary of 454 sequencing data generated for sheepgrass transcriptome and...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuangyan Chen; Xin Huang; Xueqing Yan; Ye Liang; Yuezhu Wang; Xiaofeng Li; Xianjun Peng; Xingyong Ma; Lexin Zhang; Yueyue Cai; Tian Ma; Liqin Cheng; Dongmei Qi; Huajun Zheng; Xiaohan Yang; Xiaoxia Li; Gongshe Liu (2023). Summary of 454 sequencing data generated for sheepgrass transcriptome and quality filtering. [Dataset]. http://doi.org/10.1371/journal.pone.0067974.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0067974.t001
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Shuangyan Chen; Xin Huang; Xueqing Yan; Ye Liang; Yuezhu Wang; Xiaofeng Li; Xianjun Peng; Xingyong Ma; Lexin Zhang; Yueyue Cai; Tian Ma; Liqin Cheng; Dongmei Qi; Huajun Zheng; Xiaohan Yang; Xiaoxia Li; Gongshe Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of 454 sequencing data generated for sheepgrass transcriptome and quality filtering.
f
Table S4 BLAST results of SNP-associated sequences from masson pine...
figshare.com
txt
Updated Feb 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mo-Hua Yang (2023). Table S4 BLAST results of SNP-associated sequences from masson pine accessions compared with the non-redundant (nr) protein database. [Dataset]. http://doi.org/10.6084/m9.figshare.22085762.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22085762.v1
Dataset updated
Feb 13, 2023
Dataset provided by
figshare
Authors
Mo-Hua Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Table S4 BLAST results of SNP-associated sequences from masson pine accessions compared with the non-redundant (nr) protein database.
r
Data from: RefSeq
rrid.site
scicrunch.org
+2more
Updated Jun 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). RefSeq [Dataset]. http://identifiers.org/RRID:SCR_003496
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_003496
Dataset updated
Jun 21, 2025
Description
Collection of curated, non-redundant genomic DNA, transcript RNA, and protein sequences produced by NCBI. Provides a reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses. Accessed through the Nucleotide and Protein databases.
n
Data from: Fitness effects of mutations: An assessment of PROVEAN...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linnea Sandell; Nathaniel Sharp (2022). Fitness effects of mutations: An assessment of PROVEAN predictions using mutation accumulation data [Dataset]. http://doi.org/10.5061/dryad.j0zpc86ct
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.j0zpc86ct
Dataset updated
Feb 7, 2022
Dataset provided by
University of Wisconsin–Madison
University of British Columbia
Authors
Linnea Sandell; Nathaniel Sharp
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Predicting fitness in natural populations is a major challenge in biology. It may be possible to leverage fast-accumulating genomic datasets to infer the fitness effects of mutant alleles, allowing evolutionary questions to be addressed in any organism. In this paper, we investigate the utility of one such tool, called PROVEAN. This program compares a query sequence with existing data to provide an alignment-based score for any protein variant, with scores categorized as neutral or deleterious based on a preset threshold. PROVEAN has been used widely in evolutionary studies, e.g., to estimate mutation load in natural populations, but has not been formally tested as a predictor of aggregate mutational effects on fitness. Using three large, published datasets on the genome sequences of laboratory mutation accumulation lines, we assessed how well PROVEAN predicted the actual fitness patterns observed, relative to other metrics. In most cases, we find that a simple count of the total number of mutant proteins is a better predictor of fitness than the number of variants scored as deleterious by PROVEAN. We also find that the sum of all mutant protein scores explains variation in fitness better than the number of mutant proteins in one of the datasets. We discuss the implications of these results for studies of populations in the wild. Methods We used previously published datasets of growth rates of, and mutations in, mutation accumulation lines in Saccharomyces cerevisiae and Chlamydomonas reinhardtii. We computed the mutated proteins and ran the protein variant, as compared to the laboratory ancestor, through PROVEAN.

We ran PROVEAN on the ComputeCanada cluster. As the program failed to run with the recent BLAST software (version 2.9.0), we configured PROVEAN to run with PSI-BLAST and BLASTDBCMD (Altschul et al. 1997)from BLAST version 2.4.0. We used version 4.8.1 of CD-HIT. We ran our variants with the NCBI nr database from 12/11/2019, which holds 142 GB of non-redundant sequences (229,636,095 sequences). We ran a subset of variants using the 2012 database, on which PROVEAN was developed (the first 5 GB), without radical changes to the PROVEAN scores of variants. The supporting sequence sets used to compute the alignment scores for all proteins were saved.

Sc1 We used the mutations reported in Sharp et al. (2018; Dataset_S2.xlsx). There were 1474 genic mutations in the dataset, occurring in 1219 unique genes across 218 MA lines. We extracted the nucleotide and protein sequence of the genes affected using YeastMine (Balakrishnan et al. 2012). From the same database, we downloaded the location of introns in these genes. The reference nucleotide sequence was then mutated in silicoto represent the mutant sequence, which was then transcribed and translated, using the seqinr package (Charif and Lobry 2007)in R (R Core Team 2019). Additionally, we analyzed VCF files to obtain a table of mutations in the ancestral line as compared to the yeast reference genome (version R64-2-1). In cases where the ancestor and reference strain differed for a mutated gene (126 genes) we separately computed the ancestral protein and used it for comparison to the MA lines. We wrote a script to produce protein variants in the format PROVEAN requires. From 1474 genic mutations, 1126 protein variants were computed (in 961 unique proteins). Two samples (lines 113 and 206) had no nonsynonymous mutations. When an MA line had more than one nonsynonymous mutation in a particular gene both mutations were considered when altering the protein and the number of mutant proteins is reported once. Out of 961 altered proteins, 126 already differed between the S288C reference genome and the laboratory ancestor, in which case the latter was used as the query sequence.

Sc2 We used the mutations reported in Liu and Zhang (2019; Data_S1.xlsx). Additionally, the authors supplied us with a table of mutations in their ancestral line relative to the S288C reference genome. We used the same method as described above for dataset Sc1. There were 1147 genic mutations, occurring in 968 unique genes, across 165 MA lines. From 1147 genic mutations, 877 protein variants were computed (in 754 unique proteins). Out of 754 altered proteins, 16 already differed between the S288C reference genome and the laboratory ancestor, in which case the latter was used as the query sequence.

Cr We received an annotated table of the mutations reported in Ness et al. (2015)as well as VCF files containing the mutations in their six ancestral lines compared to the reference genome. We downloaded an annotated table for all transcripts in the Chlamydomonasreference genome from Dicots PLAZA 4.0 (version 5.5, Van Bel et al., 2018)to identify mutations in coding sequences. Out of the original 6843 mutations, 3889 affected protein sequence, representing 1439 mutated proteins after combining mutations. We found that the majority of transcripts that were mutated during mutation accumulation already had existing variants in the ancestral strain, relative to the reference (table 1). 1397 out of the originally predicted 1439 protein variants remained once ancestral variation had been considered (table 1). As in the other datasets, we use the ancestral protein as the query protein. We found 2 cases in the C. reinhardtiidataset where the reported reference nucleotide deviated from that found in the Dicots PLAZA 4.0 sequence; in each case, the differences between the two reference sequences were synonymous. This discrepancy was likely due to the two different reference genomes used (Ness et al. used v5.3; Van Bel et al. used v5.5). To test the accuracy of our sequence-mutating code, we mutated the coding sequence to the reference nucleotide given by the C. reinhardtiidataset and verified that this produced the reference transcript. We converted the protein variants into the format PROVEAN requires. In cases with alternative transcripts, we treat these as separate proteins in PROVEAN and then report the minimum score given to any protein variant of a gene. This occurred in 42 unique cases, involving all genetic backgrounds. While the difference in scores between transcripts in general was small, we found two cases where the score for one affected transcript was below the default threshold of –2.5 while the other was above it, and six cases where the scores fell above and below zero. Six out of the total 1397 protein variants failed to receive a score from PROVEAN, likely because the changes to the protein were too large to compute alignment scores between the clusters gathered and the mutant protein and were ignored in the analysis (these occurred in six different samples across five ancestral backgrounds).
s
EXProt- database for EXPerimentally verified Protein functions
scicrunch.org
Updated Feb 1, 2002
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2002). EXProt- database for EXPerimentally verified Protein functions [Dataset]. http://identifiers.org/RRID:SCR_007652
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007652
Dataset updated
Feb 1, 2002
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 23, 2016. EXProt (database for EXPerimentally verified Protein functions) is a new non-redundant database containing protein sequences for which the function has been experimentally verified. EXProt is a selection of 6491 entries which are described to have an experimentally verified function. The entries in EXProt all have a unique ID number and provide information about organism, protein sequence, functional annotation, link to entry in original database, and if known, gene name and link to references in PubMed. The EXProt database can be searched with BLAST or FASTA with amino acid or nucleotide sequence as query sequence. Note that only the sequence goes into the field. EXProt database is also searchable in SRS6 at CMBI. In a near future entries from the genome project of Lactobacillus plantarum by Wageningen Centre for Food Sciences (WCFS) will be added to EXProt.
e
Extra-cellular proteins from Lactic Acid Bacteria
ebi.ac.uk
data.niaid.nih.gov
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eile Butler, Extra-cellular proteins from Lactic Acid Bacteria [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD000187
Explore at:
Authors
Eile Butler
Variables measured
Proteomics
Description
This data displays both known and unknown extra-cellular proteins from 13 species of Lactic Acid bacteria found in the honey-crop of the honeybee Apis. mellifera mellifera. The tryptic peptides from the secreted proteins were run on an Agilent HPLC on a C18 reverse phase column (75 µm x 150 mm, particle size 3 µm). Total run time was 90 min and flow rate 300 nl/min. Buffers used for gradient was 0.1% formic acid in water (buffer A) and 0.1% formic acid in acetonitrile (buffer B). The buffer mixing was 5 min 5% buffer B, followed by 5%-45% buffer B in a linear gradient for 50 min, followed by 45%-80% buffer B in a linear gradient for 5 min. The 80% of buffer B was then kept for 15 min and then rapidly back to 5% buffer B for the final 15 min. The fractions from HPLC were loaded on an LCQ Deca XP Plus Ion trap mass spectrometer (ThermoScientific). Genomic DNA were prepared from all 13 LAB strains depicted earlier and sequenced at MWG Eurofins Operon (Ebensburg, Germany) using Roche GS FLX Titanium technology from Roche (Basel, Switzerland). For each genome a shotgun library was constructed with up to 700,000 reads per segment and was generated by sequencing in 2x half segment of a full FLX+ run. Each genome had an 8 kpb long-paired end library constructed. Approximately 300,000 true paired end reads, sequence tags, and scaffolds with GS FLX+ chemistry using 2x half segment of a full run were generated. Clonal amplification was performed by emPCR in both library types. The sequencing was continued until 15-20 fold coverage was reached. The obtained reads were assembled by the software Newbler 2.6 from Roche (Basel, Switzerland). ORF prediction and automated annotation was performed at Integrated Genomics Assets Inc. (Mount Prospect, Illinois, USA). In ORF prediction three different software were used, GLIMMER, Critica, and Prokpeg. Automated annotation was performed with the ERGOTM algorithms (Integrated Genomics Assets Inc. Mount Prospect, Illinois, USA). The resulting mass spectra-files obtained from the mass spectrometry analysis were searched using MASCOT against a local database containing the predicted proteome of the 13 LAB. We used a cut off Ions score of 38 as a value for determining that the protein was identified. Individual ion scores that were greater than 38 indicated identity or extensive homology (p<0.05) of the protein. Protein sequence similarity searches were performed with software BLASTP in the software package BLAST 2.27+ against a non-redundant protein database at NCBI. Pfam (default database), and InterProScan (default databases). Expressed proteins identified by peptide mass fingerprinting were manually re-annotated.
r
Data from: Indexed reference databases for KMA and CCMetagen
researchdata.edu.au
Updated Apr 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr Vanessa Rossetto Marcelino; Dr Vanessa Rossetto Marcelino; Dr Jan Buchmann; Clausen Philip (2019). Indexed reference databases for KMA and CCMetagen [Dataset]. http://doi.org/10.25910/5CC7CD40FCA8E
Explore at:
Unique identifier
https://doi.org/10.25910/5CC7CD40FCA8E
Dataset updated
Apr 30, 2019
Dataset provided by
The University of Sydney
Authors
Dr Vanessa Rossetto Marcelino; Dr Vanessa Rossetto Marcelino; Dr Jan Buchmann; Clausen Philip
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Time period covered
Apr 9, 2019 - Apr 30, 2019
Description
This database was built to identify taxa in metagenome samples using the CCMetagen pipeline. The whole NCBI nt collection allows a complete taxonomic overview, including from microbial eukaryotes that may be present in the dataset. This database is already indexed, ready to use with KMA and CCMetagen.

A manual describing how to use this dataset can be found at: https://github.com/vrmarcelino/CCMetagen

Additionally, a tutorial on the whole analysis of a set of metatranscriptome samples can be found at: https://github.com/vrmarcelino/CCMetagen/tree/master/tutorial

The database was built as follows:

The partially non-redundant nucleotide database was downloaded from the NCBI website (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz) in January 2018. This database was formatted to include taxids in sequence headers.

Indexing was then performed with KMA using the commands:

kma_index -i nt_taxid.fas -o ncbi_nt -NI -Sparse TG

Three indexed databases are provided:

NCBI nucleotide collection

RefSeq database of bacterial and fungal genomes
n
Dataset for article: Co-evolutionary landscape at the interface and...
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ishita Mukherjee; Saikat Chakrabarti (2021). Dataset for article: Co-evolutionary landscape at the interface and non-interface regions of protein-protein interaction complexes [Dataset]. http://doi.org/10.5061/dryad.zgmsbcc8g
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zgmsbcc8g
Dataset updated
Jul 26, 2021
Dataset provided by
Indian Institute of Chemical Biology
Authors
Ishita Mukherjee; Saikat Chakrabarti
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Proteins involved in interactions throughout the course of evolution tend to co-evolve and compensatory changes may occur in interacting proteins to maintain or refine such interactions. However, certain residue pair alterations may prove to be detrimental for functional interactions. Hence, determining co-evolutionary pairings that could be structurally or functionally relevant for maintaining the conservation of an inter-protein interaction is important. Inter-protein co-evolution analysis in several complexes utilizing multiple existing methodologies suggested that co-evolutionary pairings can occur in spatially proximal and distant regions in inter-protein interactions. Subsequently, the Co-Var (Correlated Variation) method based on mutual information and Bhattacharyya coefficient was developed, validated, and found to perform relatively better than CAPS and EV-complex. Interestingly, while applying the Co-Var measure and EV-complex program on a set of protein-protein interaction complexes, co-evolutionary pairings were obtained in interface and non-interface regions in protein complexes. The Co-Var approach involves determining high degree co-evolutionary pairings that include multiple co-evolutionary connections between particular co-evolved residue positions in one protein with multiple residue positions in the binding partner. Detailed analyses of high degree co-evolutionary pairings in protein-protein complexes involved in cancer metastasis suggested that most of the residue positions forming such co-evolutionary connections mainly occurred within functional domains of constituent proteins and substitution mutations were also common among these positions. The physiological relevance of these predictions suggests that Co-Var can predict residues that could be crucial for preserving functional protein-protein interactions. Finally, Co-Var web server (http://www.hpppi.iicb.res.in/ishi/covar/index.html) that implements this methodology identifies co-evolutionary pairings in intra and inter-protein interactions.

Methods A number of protein-protein interaction complexes [100] were identified from previous published data (1-3) and complexes involving proteins with sufficient number of homologs and available crystal structure were selected. Around 50 protein complexes were considered as “positive set”. Additionally, non-interacting proteins from the Negatome database (4) were considered as the “negative set”. Close orthologs or similar sequences were determined using DELTA-BLAST (Domain enhanced lookup time accelerated BLAST) (5) and taxonomy filtered non-redundant sequences having E-value <= 1E-04, query coverage >= 70%, sequence identity >= 45% were utilized for preparing multiple sequence alignments (MSA) representative of each sequence family in MAFFT (6). Alignments for homologous sequences of the representative interacting and non-interacting proteins in the “positive set” and the “negative set” were prepared in this manner.

References

Mintseris, J. and Weng, Z. (2003), Atomic contact vectors in protein‐protein recognition. Proteins, 53: 629-639. doi:10.1002/prot.10432 Sowmya, G., Breen, E. J., & Ranganathan, S. (2015). Linking structural features of protein complexes and biological function. Protein science : a publication of the Protein Society, 24(9), 1486-94. Rodriguez-Rivas, J., Marsili, S., Juan, D., & Valencia, A. (2016). Conservation of coevolving protein interfaces bridges prokaryote-eukaryote homologies in the twilight zone. Proceedings of the National Academy of Sciences of the United States of America, 113(52), 15018–1502 doi:10.1073/pnas.1611861114 Smialowski, P., Pagel, P., Wong, P., Brauner, B., Dunger, I., Fobo, G., Frishman, G., Montrone, C., Rattei, T., Frishman, D., et al. (2009). The Negatome database: a reference set of non-interacting protein pairs. Nucleic acids research, 38(Database issue), D540-4. Boratyn, G. M., Schäffer, A. A., Agarwala, R., Altschul, S. F., Lipman, D. J., & Madden, T. L. (2012). Domain enhanced lookup time accelerated BLAST. Biology direct, 7, 12.doi:10.1186/1745-6150-7-12 Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on Fast Fourier transform. Nucleic Acids Res. 2002;30(14):3059–66.
d
Data from: Expressed Sequence Tags from the Ciliate Protozoan Parasite...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Expressed Sequence Tags from the Ciliate Protozoan Parasite Ichthyophthirius Multifiliis [Dataset]. https://catalog.data.gov/dataset/expressed-sequence-tags-from-the-ciliate-protozoan-parasite-ichthyophthirius-multifiliis-b99f0
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
Researchers sequenced 10,368 expressed sequence tags (EST) clones using a normalized cDNA library made from pooled samples of the trophont, tomont, and theront life-cycle stages, and generated 9,769 sequences (94.2% success rate). Post-sequencing processing led to 8,432 high quality sequences. Clustering analysis of these ESTs allowed identification of 4,706 unique sequences containing 976 contigs and 3,730 singletons. The ciliate protozoan Ichthyophthirius multifiliis (Ich) is an important parasite of freshwater fish that causes 'white spot disease' leading to significant losses. A genomic resource for large-scale studies of this parasite has been lacking. To study gene expression involved in Ich pathogenesis and virulence, our goal was to generate ESTs for the development of a powerful microarray platform for the analysis of global gene expression in this species. Here, we initiated a project to sequence and analyze over 10,000 ESTs. Resources in this dataset:Resource Title: Data Dictionary - Supplemental Tables 1, 2, and 3. File Name: IchthyophthiriusESTs_DataDictionary.csvResource Description: Machine-readable comma-separated values (CSV) definitions for data elements of Supplemental Tables 1-3 concerning I. multifiliis unique EST sequences, BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes, and gene ontology (GO) profile.Resource Title: Table 3. Table of gene ontology (GO) profiles.. File Name: 12864_2006_889_MOESM3_ESM.xlsResource Description: Supplemental Table 3, Excel spreadsheet; Table of gene ontology (GO) profiles; Provided information includes unique EST name, accession numbers, BLASTX top hit, GO identification numbers and enzyme commission (EC) numbers. Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176 Direct download for this resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM3_ESM.xls Title: Table I. Multifiliis unique EST sequences. File Name: 12864_2006_889_MOESM1_ESM.xlsResource Description: Supplemental Table 1 for article, "Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis." Excel spreadsheet; Table of I. multifiliis unique EST sequences; Provided information includes I. multifiliis BLASTX top hits to the non-redundant database in GenBank with unique EST name and accession numbers. Also included are significant protein domain comparisons to the Swiss-Prot database. Putative secretory proteins are highlighted. Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176 Direct download for this resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM1_ESM.xls Title: Table 2. Excel spreadsheet; Summary of BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes. File Name: 12864_2006_889_MOESM2_ESM.xlsResource Description: Table 2 from "Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis." Excel spreadsheet; Summary of BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes. Provided information includes I. multifiliis BLASTX top hits to the non-redundant database in GenBank with unique EST name, tBLASTx top hits to the T. thermophila genome, and BLASTX top hits to the P. falciparum genome sequences. This table correlates with the Venn diagram in figure 1. Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176 Direct download link for this data resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM2_ESM.xls
H
Data from: Comparative description of ten transcriptomes of newly sequenced...
dataverse.harvard.edu
Updated Mar 21, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A. Riesgo; S.C.S. Andrade; P.P. Sharma; M. Novo; A.R. Perez-Porro; V. Vahtera; V.L. Gonzalez; G.Y. Kawauchi; G. Giribet (2014). Comparative description of ten transcriptomes of newly sequenced invertebrates and efficiency estimation of genomic sampling in non-model taxa [Dataset]. http://doi.org/10.7910/DVN/25071
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/25071
Dataset updated
Mar 21, 2014
Dataset provided by
Harvard Dataverse
Authors
A. Riesgo; S.C.S. Andrade; P.P. Sharma; M. Novo; A.R. Perez-Porro; V. Vahtera; V.L. Gonzalez; G.Y. Kawauchi; G. Giribet
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2012
Description
cDNA libraries of ten species belonging to five animal phyla (2 Annelida [including Sipuncula], 2 Arthropoda, 2 Mollusca, 2 Nemertea, and 2 Porifera) were sequenced in different batches with an Illumina Genome Analyzer II (read length 100 or 150 bp), rendering between ca. 25 and 52 million reads per species. Read thinning, trimming, and de novo assembly were performed under different parameters to optimize output. Between 67,423 and 207,559 contigs were obtained across the ten species, post-optimization. Of those, 9,069 to 25,681 contigs retrieved blast hits against the NCBI non-redundant database, and approximately 50% of these were assigned with Gene Ontology terms, covering all major categories, and with similar percentages in all species. Local blasts against our datasets, using selected genes from major signaling pathways and housekeeping genes, revealed high efficiency in gene recovery compared to available genomes of closely related species. Intriguingly, our transcriptomic datasets detected multiple paralogues in all phyla and in nearly all gene pathways, including housekeeping genes that are traditionally used in phylogenetic applications for their purported single-copy nature. We generated the first study of comparative transcriptomics across multiple animal phyla (comparing two species per phylum in most cases), established the first Illumina-based transcriptomic datasets for sponge, nemertean, and sipunculan species, and generated a tractable catalogue of annotated genes (or gene fragments) and protein families for ten newly sequenced non-model organisms, some of commercial importance (i.e., Octopus vulgaris). These comprehensive sets of genes can be readily used for phylogenetic analysis, gene expression profiling, d evelopmental analysis, and can also be a powerful resource for gene discovery. The characterization of the transcriptomes of such a diverse array of animal species permitted the comparison of sequencing depth, functional annotation, and efficiency of genomic sampling using the same pipelines, which proved to be similar for all considered species. In addition, the datasets revealed their potential as a resource for paralogue detection, a recurrent concern in various aspects of biological inquiry, including phylogenetics, molecular evolution, development, and cellular biochemistry.
Newbler 2.5 assembly statistics of sheepgrass transcripts.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuangyan Chen; Xin Huang; Xueqing Yan; Ye Liang; Yuezhu Wang; Xiaofeng Li; Xianjun Peng; Xingyong Ma; Lexin Zhang; Yueyue Cai; Tian Ma; Liqin Cheng; Dongmei Qi; Huajun Zheng; Xiaohan Yang; Xiaoxia Li; Gongshe Liu (2023). Newbler 2.5 assembly statistics of sheepgrass transcripts. [Dataset]. http://doi.org/10.1371/journal.pone.0067974.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0067974.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Shuangyan Chen; Xin Huang; Xueqing Yan; Ye Liang; Yuezhu Wang; Xiaofeng Li; Xianjun Peng; Xingyong Ma; Lexin Zhang; Yueyue Cai; Tian Ma; Liqin Cheng; Dongmei Qi; Huajun Zheng; Xiaohan Yang; Xiaoxia Li; Gongshe Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Newbler 2.5 assembly statistics of sheepgrass transcripts.
s
AceView
scicrunch.org
dknet.org
+2more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AceView [Dataset]. http://identifiers.org/RRID:SCR_002277
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002277
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented May 10, 2017. A pilot effort that has developed a centralized, web-based biospecimen locator that presents biospecimens collected and stored at participating Arizona hospitals and biospecimen banks, which are available for acquisition and use by researchers. Researchers may use this site to browse, search and request biospecimens to use in qualified studies. The development of the ABL was guided by the Arizona Biospecimen Consortium (ABC), a consortium of hospitals and medical centers in the Phoenix area, and is now being piloted by this Consortium under the direction of ABRC. You may browse by type (cells, fluid, molecular, tissue) or disease. Common data elements decided by the ABC Standards Committee, based on data elements on the National Cancer Institute''s (NCI''s) Common Biorepository Model (CBM), are displayed. These describe the minimum set of data elements that the NCI determined were most important for a researcher to see about a biospecimen. The ABL currently does not display information on whether or not clinical data is available to accompany the biospecimens. However, a requester has the ability to solicit clinical data in the request. Once a request is approved, the biospecimen provider will contact the requester to discuss the request (and the requester''s questions) before finalizing the invoice and shipment. The ABL is available to the public to browse. In order to request biospecimens from the ABL, the researcher will be required to submit the requested required information. Upon submission of the information, shipment of the requested biospecimen(s) will be dependent on the scientific and institutional review approval. Account required. Registration is open to everyone., documented August 29, 2016. AceView offers an integrated view of the human, nematode and Arabidopsis genes reconstructed by co-alignment of all publicly available mRNAs and ESTs on the genome sequence. Our goals are to offer a reliable up-to-date resource on the genes and their functions and to stimulate further validating experiments at the bench. AceView provides a curated, comprehensive and non-redundant sequence representation of all public mRNA sequences (mRNAs from GenBank or RefSeq, and single pass cDNA sequences from dbEST and Trace). These experimental cDNA sequences are first co-aligned on the genome then clustered into a minimal number of alternative transcript variants and grouped into genes. Using exhaustively and with high quality standards the available cDNA sequences evidences the beauty and complexity of mammals' transcriptome, and the relative simplicity of the nematode and plant transcriptomes. Genes are classified according to their inferred coding potential; many presumably non-coding genes are discovered. Genes are named by Entrez Gene names when available, else by AceView gene names, stable from release to release. Alternative features (promoters, introns and exons, polyadenylation signals) and coding potential, including motifs, domains, and homologies are annotated in depth; tissues where expression has been observed are listed in order of representation; diseases, phenotypes, pathways, functions, localization or interactions are annotated by mining selected sources, in particular PubMed, GAD and Entrez Gene, and also by performing manual annotation, especially in the worm. In this way, both the anatomy and physiology of the experimentally cDNA supported human, mouse and nematode genes are thoroughly annotated. Our goals are to offer an up-to-date resource on the genes, in the hope to stimulate further experiments at the bench, or to help medical research. AceView can be queried by meaningful words or groups of words as well as by most standard identifiers, such as gene names, Entrez Gene ID, UniGene ID, GenBank accessions.
Data from: ActDES – a Curated Actinobacterial Database for Evolutionary...
figshare.com
explore.openaire.eu
txt
Updated Apr 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jana Schniete; Nelly Selem; Anna Birke; Pablo Cruz-Morales; Iain S. Hunter; Francisco Barona-Gómez; Paul A Hoskisson (2020). ActDES – a Curated Actinobacterial Database for Evolutionary Studies [Dataset]. http://doi.org/10.6084/m9.figshare.12167529.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12167529.v1
Dataset updated
Apr 21, 2020
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jana Schniete; Nelly Selem; Anna Birke; Pablo Cruz-Morales; Iain S. Hunter; Francisco Barona-Gómez; Paul A Hoskisson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Actinobacteria are a large diverse phylum of bacteria, often with large genomes with a high G+C content. There is great variation in the sequence quality, equivalence of annotation and phylogenetic representation in the sequence databases meaning that evolutionary and phylogenetic studies may be challenging. To address this, we have assembled a curated, high-level, taxa specific, non-redundant database to aid detailed comparative analysis of Actinobacteria. ActDES constitutes a novel resource for the community of Actinobacterial researchers that will be useful primarily for two types of analyses: (i) comparative genomic studies - facilitated by reliable orthologs identification across a set of defined, phylogenetically representative genomes, and (ii) phylogenomic studies which will be improved by identification of gene subsets at specified taxonomic level. These studies can then act as a springboard for the study of the evolution of virulence genes, studying the evolution of metabolism and metabolic engineering target identification. Data summary All genome sequences used in this study can be found in the NCBI taxonomy browser https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi and are summarised along with Accession numbers in Table S11. All other data is available on Figshare.a. Perl script filesb. List of genomes from NCBI (Actinobacteria database.xlsx) Table S1c. CVS genome annotation files including the FASTA files of nucleotide and amino acids sequences (612 individual .cvs files – folder cvs)d. BLAST nucleotide database (.fasta file)e. BLAST protein database (.fasta file)f. Table S2 Expansion table genus level (Expansion table.xlsx Tab Genus level)g. Table S2 Expansion table species level (Expansion table.xlsx Tab species level)h. All data for GlcP and Glk data – blast hits from ActDES database, MUSCLE Alignment files and .nwk tree files

Facebook

Twitter

Click to copy link

Link copied

Cite

(2019). The GenBank Non-Redundant Protein Sequence Database (NRDB) [Dataset]. https://fungidb.org/fungidb/app/record/dataset/DS_a7163a9f0d

The GenBank Non-Redundant Protein Sequence Database (NRDB)

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Aug 16, 2019

Description

The GenBank non-redundant protein sequence database (NRDB) is a component of the NCBI BLAST databases and contains entries from GenPept, Swissprot, PIR, PDF, PDB and NCBI RefSeq.

Clear search

Close search

Google apps

Main menu

The GenBank Non-Redundant Protein Sequence Database (NRDB)

Darwin: an amino acid sequence collection of complete proteomes from...

Data from: COInr a comprehensive, non-redundant COI database from NCBI-nt...

Antibiotic Resistance Genes Database

3. Ecological genomics of the Northern krill: Genome assembly annotations...

SNPs predicted between the two Ae. sharonensis accessions from two selected...

Predicted NB-LRR proteins from Ae. sharonensis assemblies and BRBH against...

Summary of 454 sequencing data generated for sheepgrass transcriptome and...

Table S4 BLAST results of SNP-associated sequences from masson pine...

Data from: RefSeq

Data from: Fitness effects of mutations: An assessment of PROVEAN...

EXProt- database for EXPerimentally verified Protein functions

Extra-cellular proteins from Lactic Acid Bacteria

Data from: Indexed reference databases for KMA and CCMetagen

Dataset for article: Co-evolutionary landscape at the interface and...

Data from: Expressed Sequence Tags from the Ciliate Protozoan Parasite...

Data from: Comparative description of ten transcriptomes of newly sequenced...

Newbler 2.5 assembly statistics of sheepgrass transcripts.

AceView

Data from: ActDES – a Curated Actinobacterial Database for Evolutionary...

The GenBank Non-Redundant Protein Sequence Database (NRDB)