100+ datasets found

n
NCBI Genome Survey Sequences Database
neuinfo.org
Updated Sep 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). NCBI Genome Survey Sequences Database [Dataset]. http://identifiers.org/RRID:SCR_002146
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002146
Dataset updated
Sep 15, 2024
Description
Database of unannotated short single-read primarily genomic sequences from GenBank including random survey sequences clone-end sequences and exon-trapped sequences. The GSS division of GenBank is similar to the EST division, with the exception that most of the sequences are genomic in origin, rather than cDNA (mRNA). It should be noted that two classes (exon trapped products and gene trapped products) may be derived via a cDNA intermediate. Care should be taken when analyzing sequences from either of these classes, as a splicing event could have occurred and the sequence represented in the record may be interrupted when compared to genomic sequence. The GSS division contains (but is not limited to) the following types of data: * random single pass read genome survey sequences. * cosmid/BAC/YAC end sequences * exon trapped genomic sequences * Alu PCR sequences * transposon-tagged sequences Although dbGSS sequences are incorporated into the GSS Division of GenBank, annotation in dbGSS is more comprehensive and includes detailed information about the contributors, experimental conditions, and genetic map locations.
f
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics
datasetcatalog.nlm.nih.gov
Updated Sep 12, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sun, Zhi; Omenn, Gilbert S.; Moritz, Robert L.; Shteynberg, David; Campbell, David S.; Binz, Pierre-Alain; Mendoza, Luis; Deutsch, Eric W.; Farrah, Terry (2016). Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001553063
Explore at:
Dataset updated
Sep 12, 2016
Authors
Sun, Zhi; Omenn, Gilbert S.; Moritz, Robert L.; Shteynberg, David; Campbell, David S.; Binz, Pierre-Alain; Mendoza, Luis; Deutsch, Eric W.; Farrah, Terry
Description
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstancesa problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ∼20,000 primary isoforms plus contaminants to a very large database that includes almost all nonredundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/.
f
Data_Sheet_1_Contamination in Reference Sequence Databases: Time for...
datasetcatalog.nlm.nih.gov
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanderschuren, Hervé; Cornet, Luc; Baurain, Denis; Kerff, Frédéric; Lupo, Valérian; Van Vlierberghe, Mick (2021). Data_Sheet_1_Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000866183
Explore at:
Dataset updated
Oct 22, 2021
Authors
Vanderschuren, Hervé; Cornet, Luc; Baurain, Denis; Kerff, Frédéric; Lupo, Valérian; Van Vlierberghe, Mick
Description
Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.
f
Iterative Genome Correction Largely Improves Proteomic Analysis of Nonmodel...
acs.figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaohui Wu; Lina Xu; Wei Gu; Qian Xu; Qing-Yu He; Xuesong Sun; Gong Zhang (2023). Iterative Genome Correction Largely Improves Proteomic Analysis of Nonmodel Organisms [Dataset]. http://doi.org/10.1021/pr500369b.s003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1021/pr500369b.s003
Dataset updated
Jun 9, 2023
Dataset provided by
ACS Publications
Authors
Xiaohui Wu; Lina Xu; Wei Gu; Qian Xu; Qing-Yu He; Xuesong Sun; Gong Zhang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The current application and development of proteomic studies typically depend on the availability of sequenced genomes. Protein identification based on the detected peptides with liquid chromatography tandem mass spectrometry is limited by the absence of sequenced genomes in many nonmodel organisms. In this study, we demonstrated a new strategy based on our stable, accurate, and error-tolerant FANSe (Fast and Accurate mapping tool for Nucleotide Sequencing datasets) mapping algorithm to correct genome sequences in an iterative manner. To evaluate the efficiency of the corrected genome databases in proteomic study, MS/MS spectra of whole proteome extracted from a Bacillus pumilus strain without complete genome sequence were searched against the protein sequence databases derived from the complete reference genome sequence of a homologous bacterium and from the corrected genome sequence. The results indicated that the corrected protein sequence database could significantly facilitate peptide/protein identification. Importantly, this strategy can help to detect novel peptide variants. This strategy of genome correction will promote the development of functional proteomics in nonmodel organisms.
u
Data from: SoyBase and the Soybean Breeder's Toolbox
agdatacommons.nal.usda.gov
bin
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David M. Grant (2024). SoyBase and the Soybean Breeder's Toolbox [Dataset]. http://doi.org/10.15482/USDA.ADC/1212265
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1212265
Dataset updated
Feb 8, 2024
Dataset provided by
Ag Data Commons
Authors
David M. Grant
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
SoyBase is a repository for genetics, genomics and related data resources for soybean. It contains current genetic, physical and genomic sequence maps integrated with qualitative and quantitative traits. SoyBase database was established in the 1990s as the USDA Soybean Genetics Database. Originally, it contained only genetic information about soybeans such as genetic maps and information about the Mendelian genetics of soybean. In time SoyBase was expanded to include molecular data regarding soybean genes and sequences as they became available. In 2010, the soybean genome sequence was published and it and supporting gene sequences have been integrated into the SoyBase sequence browser. SoyBase genetic maps were used in the assembly of both the Williams 82 2010 assembly (Wm82.a1.v1) and the newest genome assembly (Wm82.a2.v1). SoyBase also incorporates information about mutant and other soybean genetic stocks and serves as a contact point for ordering strains from those populations. As association analyses continue due to various re-sequencing efforts SoyBase will also incorporate those data into the soybean genome browser as they become available. Gene expression patterns are also available at SoyBase through the SoyBase expression pages and the Soybean Gene Atlas. Other expression/transcriptome/methylomic data sets also have been and continue to be incorporated into the SoyBase genome browser. Project No:3625-21000-062-00D Accession No: 0425040 Resources in this dataset:Resource Title: SoyBase, the USDA-ARS soybean genetics and genomics database web site. File Name: Web Page, url: https://soybase.org SoyBase database was established in the 1990s as the USDA Soybean Genetics Database. Originally, it contained only genetic information about soybeans such as genetic maps and information about the Mendelian genetics of soybean. In time SoyBase was expanded to include molecular data regarding soybean genes and sequences as they became available. In 2010, the soybean genome sequence was published and it and supporting gene sequences have been integrated into the SoyBase sequence browser. SoyBase genetic maps were used in the assembly of both the Williams 82 2010 assembly (Wm82.a1.v1) and the newest genome assembly (Wm82.a2.v1).

Soybean Pods and Seeds SoyBase also incorporates information about mutant and other soybean genetic stocks and serves as a contact point for ordering strains from those populations. As association analyses continue due to various re-sequencing efforts SoyBase will also incorporate those data into the soybean genome browser as they become available. Gene expression patterns are also available at SoyBase through the SoyBase expression pages and the Soybean Gene Atlas. Other expression/transcriptome/methylomic data sets also have been and continue to be incorporated into the SoyBase genome browser.
d
High Throughput Genomic Sequences Division
dknet.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). High Throughput Genomic Sequences Division [Dataset]. http://identifiers.org/RRID:SCR_002150/resolver
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002150 https://identifiers.org/RRID:SCR_002150/resolver
Dataset updated
Jan 29, 2022
Description
Database of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences. It was created to accommodate a growing need to make unfinished genomic sequence data rapidly available to the scientific community in a coordinated effort among the International Nucleotide Sequence databases, DDBJ, EMBL, and GenBank. Sequences are prepared for submission by using NCBI's software tools Sequin or tbl2asn. Each center has an FTP directory into which new or updated sequence files are placed. Sequence data in this division are available for BLAST homology searches against either the htgs database or the month database, which includes all new submissions for the prior month. Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and deposited in the HTG division. A typical HTG record might consist of all the first-pass sequence data generated from a single cosmid, BAC, YAC, or P1 clone, which together make up more than 2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences, and each record includes a clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data are unfinished and may contain errors. The accession number does not change as sequence records are updated; only the most recent version of a HTG record remains in GenBank.
r
High Throughput Genomic Sequences Division
rrid.site
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
High Throughput Genomic Sequences Division [Dataset]. http://identifiers.org/RRID:SCR_002150
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002150
Description
Database of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences. It was created to accommodate a growing need to make unfinished genomic sequence data rapidly available to the scientific community in a coordinated effort among the International Nucleotide Sequence databases, DDBJ, EMBL, and GenBank. Sequences are prepared for submission by using NCBI's software tools Sequin or tbl2asn. Each center has an FTP directory into which new or updated sequence files are placed. Sequence data in this division are available for BLAST homology searches against either the htgs database or the month database, which includes all new submissions for the prior month. Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and deposited in the HTG division. A typical HTG record might consist of all the first-pass sequence data generated from a single cosmid, BAC, YAC, or P1 clone, which together make up more than 2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences, and each record includes a clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data are unfinished and may contain errors. The accession number does not change as sequence records are updated; only the most recent version of a HTG record remains in GenBank.
s
Human Gene and Protein Database (HGPD)
scicrunch.org
Updated Nov 23, 2008
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2008). Human Gene and Protein Database (HGPD) [Dataset]. http://identifiers.org/RRID:SCR_002889
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002889
Dataset updated
Nov 23, 2008
Description
THIS RESOURCE IS NO LONGER IN SERVICE. Documented on January 4,2023.The Human Gene and Protein Database presents SDS-PAGE patterns and other informations of human genes and proteins. The HGPD was constructed from full-length cDNAs. For conversion to Gateway entry clones, we first determined an open reading frame (ORF) region in each cDNA meeting the criteria. Those ORF regions were PCR-amplified utilizing selected resource cDNAs as templates. All the details of the construction and utilization of entry clones will be published elsewhere. Amino acid and nucleotide sequences of an ORF for each cDNA and sequence differences of Gateway entry clones from source cDNAs are presented in the GW: Gateway Summary window. Utilizing those clones with a very efficient cell-free protein synthesis system featuring wheat germ, we have produced a large number of human proteins in vitro. Expressed proteins were detected in almost all cases. Proteins in both total and supernatant fractions are shown in the PE: Protein Expression window. In addition, we have also successfully expressed proteins in HeLa cells and determined subcellular localizations of human proteins. These biological data are presented on the frame of cDNA clusters in the Human Gene and Protein Database. To build the basic frame of HGPD, sequences of FLJ full-length cDNAs and others deposited in public databases (Human ESTs, RefSeq, Ensembl, MGC, etc.) are assembled onto the genome sequences (NCBI Build 35 (UCSC hg17)). The majority of analysis data for cDNA sequences in HGPD are shared with the FLJ Human cDNA Database (http://flj.hinv.jp/) constructed as a human cDNA sequence analysis database focusing on mRNA varieties caused by variations in transcription start site (TSS) and splicing.
n
Human Gene and Protein Database (HGPD)
neuinfo.org
Updated Nov 23, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2008). Human Gene and Protein Database (HGPD) [Dataset]. http://identifiers.org/RRID:SCR_002889
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002889
Dataset updated
Nov 23, 2008
Description
THIS RESOURCE IS NO LONGER IN SERVICE. Documented on January 4,2023.The Human Gene and Protein Database presents SDS-PAGE patterns and other informations of human genes and proteins. The HGPD was constructed from full-length cDNAs. For conversion to Gateway entry clones, we first determined an open reading frame (ORF) region in each cDNA meeting the criteria. Those ORF regions were PCR-amplified utilizing selected resource cDNAs as templates. All the details of the construction and utilization of entry clones will be published elsewhere. Amino acid and nucleotide sequences of an ORF for each cDNA and sequence differences of Gateway entry clones from source cDNAs are presented in the GW: Gateway Summary window. Utilizing those clones with a very efficient cell-free protein synthesis system featuring wheat germ, we have produced a large number of human proteins in vitro. Expressed proteins were detected in almost all cases. Proteins in both total and supernatant fractions are shown in the PE: Protein Expression window. In addition, we have also successfully expressed proteins in HeLa cells and determined subcellular localizations of human proteins. These biological data are presented on the frame of cDNA clusters in the Human Gene and Protein Database. To build the basic frame of HGPD, sequences of FLJ full-length cDNAs and others deposited in public databases (Human ESTs, RefSeq, Ensembl, MGC, etc.) are assembled onto the genome sequences (NCBI Build 35 (UCSC hg17)). The majority of analysis data for cDNA sequences in HGPD are shared with the FLJ Human cDNA Database (http://flj.hinv.jp/) constructed as a human cDNA sequence analysis database focusing on mRNA varieties caused by variations in transcription start site (TSS) and splicing.
d
Genome Reviews
dknet.org
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Genome Reviews [Dataset]. http://identifiers.org/RRID:SCR_007685
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007685
Dataset updated
Jun 15, 2025
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented April 24, 2017. The Genome Reviews database provides an up-to-date, standardized and comprehensively annotated view of the genomic sequence of organisms with completely deciphered genomes. Currently, Genome Reviews contains the genomes of archaea, bacteria, bacteriophages and selected eukaryota. Genome Reviews is available as a MySQL relational database, or a flat file format derived from that in the EMBL Nucleotide Sequence Database. An Ensembl-style browser is now available for Genome Reviews, providing a zoomable graphical view of all chromosomes and plasmids represented in the database. The location and structure of all genes is shown and the distribution of features throughout the sequence is displayed.
ARS Microbial Genomic Sequence Database Server
agdatacommons.nal.usda.gov
bin
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA Agricultural Research Service (2024). ARS Microbial Genomic Sequence Database Server [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/ARS_Microbial_Genomic_Sequence_Database_Server/24661200
Explore at:
binAvailable download formats
Dataset updated
Feb 9, 2024
Dataset provided by
United States Department of Agriculturehttp://usda.gov/
Agricultural Research Servicehttps://www.ars.usda.gov/
Authors
USDA Agricultural Research Service
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
This database server is supported in fulfilment of the research mission of the Mycotoxin Prevention and Applied Microbiology Research Unit at the National Center for Agricultural Utilization Research in Peoria, Illinois. The linked website provides access to gene sequence databases for various groups of microorganisms, such as Streptomyces species or Aspergillus species and their relatives, that are the product of ARS research programs. The sequence databases are organized in the BIGSdb (Bacterial Isolate Genomic Sequence Database) software package developed by Keith Jolley and Martin Maiden at Oxford University. Resources in this dataset:Resource Title: ARS Microbial Genomic Sequence Database Server. File Name: Web Page, url: http://199.133.98.43
n
GenBank
neuinfo.org
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). GenBank [Dataset]. http://identifiers.org/RRID:SCR_002760
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002760
Dataset updated
Sep 17, 2024
Description
NIH genetic sequence database that provides annotated collection of all publicly available DNA sequences for almost 280 000 formally described species (Jan 2014) .These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun (WGS) and environmental sampling projects. Most submissions are made using web-based BankIt or standalone Sequin programs, and GenBank staff assigns accession numbers upon data receipt. It is part of International Nucleotide Sequence Database Collaboration and daily data exchange with European Nucleotide Archive (ENA) and DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through NCBI Entrez retrieval system, which integrates data from major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of GenBank database are available by FTP.
r
SpBase - Strongylocentrotus purpuratus: the Sea Urchin Genome Database
rrid.site
Updated Jan 20, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2026). SpBase - Strongylocentrotus purpuratus: the Sea Urchin Genome Database [Dataset]. http://identifiers.org/RRID:SCR_007441
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007441
Dataset updated
Jan 20, 2026
Description
SpBase is designed to present the results of the genome sequencing project for the purple sea urchin. The sequences and annotations emerging from this effort are organized in a database that provides the research community access to those data not normally presented through National Center for Biotechnology Information and other large databases. Additionally, the unique information on that links gene identities and sequences to the plate and well location to the library filters from the Sea Urchin genome Resource will also be presented. The software used to organize and present the sea urchin genome comes from GMOD, a collection of open source software tools for creating and managing genome-scale biological databases. That sea urchins eggs and embryos have long remained a popular research subject for cell and developmental biologists is one rationale for sequencing the genome. In addition, studies of embryonic development in the California Purple Sea Urchin, Strongylocentrotus purpuratus , have paralleled the emergence of molecular techniques ranging from the characterization of genomic repeat sequences in the 1970''s to the elucidation of gene regulatory networks in recent times. The parent of this site, SUGP, was meant to provide a focal point for the exchange of genomic information as the genome of the Purple sea urchin was being sequenced. Over these past years it has served as a repository for small sequencing projects and a source of sequence information useful for gene discovery projects. Here one could find information on macro-array libraries of cDNAs from the purple sea urchin and genomic DNA from several species. In addition, a Sequence Tag Connector (STC) collection has been assembled from 5% of the genome sequence and a very extensive repeat sequence catalog prepared. All of the sequence data that we maintained at SUGP was incorporated into the new SPBase. Of course, it is all in public sequence databases such as the National Center for Biological Information as well. Some additional sequence information is available at the Resource Center of the German Human Genome Project. With the publication of The Genome of the Sea Urchin Strongylocentrotus purpuratus by The Sea Urchin Genome Sequencing Consortium a link to the first 9941 gene annotations are now publicly available. The effort to sequence the whole purple sea urchin genome was a cooperative one that included contributions from the Sea Urchin Genome Facility here at the Center for Computational Regulatory Genomics, Beckman Institute, Caltech, and support from the Human Genome Research Institute of the National Institutes of Health. The sequencing was done at the Baylor College of Medicine, Human Genome Sequencing Center, Houston, Texas. Funding was approved based on an initiative submitted by the Sea Urchin Genome Advisory Committee.
Data from: Cacao Genome Database
agdatacommons.nal.usda.gov
bin
Updated Feb 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raymond J. Schnell; Alan W. Meerow; Tomas Ayala-Silva; Osman Gutierrez; David Kuhn; Cecile L. Tondo; Juan Carlos Motamayor (2024). Cacao Genome Database [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Cacao_Genome_Database/24852516
Explore at:
binAvailable download formats
Dataset updated
Feb 9, 2024
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Authors
Raymond J. Schnell; Alan W. Meerow; Tomas Ayala-Silva; Osman Gutierrez; David Kuhn; Cecile L. Tondo; Juan Carlos Motamayor
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Not only is cacao the basic ingredient in the world’s favorite confection, chocolate, but it provides a livelihood for over 6.5 million farmers in Africa, South America and Asia and ranks as one of the top ten agriculture commodities in the world. Historically, cocoa production has been plagued by serious losses due to pests and diseases. The release of the cacao genome sequence will provide researchers with access to the latest genomic tools, enabling more efficient research and accelerating the breeding process, thereby expediting the release of superior cacao cultivars. The sequenced genotype, Matina 1-6, is representative of the genetic background most commonly found in the cacao producing countries, enabling results to be applied immediately and broadly to current commercial cultivars. Matina 1-6 is highly homozygous which greatly reduces the complexity of the sequence assembly process. While the sequence provided is a preliminary release, it already covers 92% of the genome, with approximately 35,000 genes. We will continue to refine the assembly and annotation, working toward a complete finished sequence. Updates will be made available via the main project website. Resources in this dataset:Resource Title: Cacao Genome Database. File Name: Web Page, url: http://www.cacaogenomedb.org/
d
Genome Database for Rosaceae
dknet.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Genome Database for Rosaceae [Dataset]. http://identifiers.org/RRID:SCR_012756
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_012756
Dataset updated
Jan 29, 2022
Description
GDR is a curated and integrated web-based relational database. GDR contains comprehensive data of the genetically anchored peach physical map, annotated EST databases of apple, peach, almond, cherry, rose, raspberry and strawberry, Rosaceae maps and markers and all publicly available Rosaceae sequences. Annotations of ESTs include contig assembly, putative function, simple sequence repeats, ORFs, Gene Ontology and anchored position to the peach physical map where applicable. Our integrated map viewer provides graphical interface to the genetic, transcriptome and physical mapping information. We continue to add Rosaceae map data to CMap, a web-based tool that allows users to view comparisons of genetic and physical maps. ESTs, BACs and markers can be queried by various categories and the search result sites are linked to the integrated map viewer or to the WebFPC physical map sites. In addition to browsing and querying the database, users can compare their sequences with the annotated GDR sequences via a dedicated sequence similarity server running either the BLAST or FASTA algorithm, search their sequences for microsatellites using the SSR server or assemble their ESTs using the CAP3 Server.
r
Interrupted CoDing Sequence Database
rrid.site
Updated Jan 21, 2026
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2026). Interrupted CoDing Sequence Database [Dataset]. http://identifiers.org/RRID:SCR_002949/resolver?q=&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002949 https://identifiers.org/RRID:SCR_002949/resolver?q=&i=rrid
Dataset updated
Jan 21, 2026
Description
Database of interrupted coding sequences detected by a similarity-based approach in complete prokaryotic genomes. The definition of each interrupted gene is provided as well as the ICDS genomic localization with the surrounding sequence. To facilitate the experimental characterization of ICDS, optimized primers are proposed for re-sequencing purposes. The database is accessible by BLAST search or by genome. 118 Genomes are available in the database.
MARMICRODB database for taxonomic classification of (marine) metagenomes
zenodo.org
application/gzip, bin +3
Updated Mar 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shane L Hogle; Shane L Hogle (2020). MARMICRODB database for taxonomic classification of (marine) metagenomes [Dataset]. http://doi.org/10.5281/zenodo.3520509
Explore at:
bin, application/gzip, tsv, html, bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.3520509
Dataset updated
Mar 20, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shane L Hogle; Shane L Hogle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction:
This sequence database (MARMICRODB) was introduced in the publication JW Becker, SL Hogle, K Rosendo, and SW Chisholm. 2019. Co-culture and biogeography of Prochlorococcus and SAR11. ISME J. doi:10.1038/s41396-019-0365-4. Please see the original publication and its associated supplementary material for the original description of this resource.

Motivation:
We needed a reference database to annotate shotgun metagenomes from the Tara Oceans project [1] the GEOTRACES cruises GA02, GA03, GA10, and GP13 and the HOT and BATS time series [2]. Our interests are primarily in quantifying and annotating the free-living, oligotrophic bacterial groups Prochlorococcus, Pelagibacterales/SAR11, SAR116, and SAR86 from these samples using the protein classifier tool Kaiju [3]. Kaiju’s sensitivity and classification accuracy depend on the composition of the reference database, and highest sensitivity is achieved when the reference database contains a comprehensive representation of expected taxa from an environment/sample of interest. However, the speed of the algorithm decreases as database size increases. Therefore, we aimed to create a reference database that maximized the representation of sequences from marine bacteria, archaea, and microbial eukaryotes, while minimizing (but not excluding) the sequences from clinical, industrial, and terrestrial host-associated samples.

Results/Description:
MARMICRODB consists of 56 million sequence non-redundant protein sequences from 18769 bacterial/archaeal/eukaryote genome and transcriptome bins and 7492 viral genomes optimized for use with the protein homology classifier Kaiju [3]. To ensure maximum representation of marine bacteria, archaea, and microbial eukaryotes, we included translated genes/transcripts from 5397 representative “specI” species clusters from the proGenomes database [4]; 113 transcriptomes from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [5]; 10509 metagenome assembled genomes from the Tara Oceans expedition [6,7], the Red Sea [8], the Baltic Sea [9], and other aquatic and terrestrial sources [10]; 994 isolate genomes from the Genomic Encyclopedia of Bacteria and Archaea [11]; 7492 viral genomes from NCBI RefSeq [12]; 786 bacterial and archaeal genomes from MarRef [13]; and 677 marine single cell genomes [14]. In order to annotate metagenomic reads at the clade/ecotype level (subspecies) for the focal taxa Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116, we generated custom MARMICRODB taxonomies based on curated genome phylogenies for each group. The curated phylogenies, Kaiju formatted Burrows-Wheeler index, translated genes, the custom taxonomy hierarchy, an interactive kronaplot of the taxonomic composition, and scripts and instructions for how to use or rebuild the resource is available from 10.5281/zenodo.3520509.

Methods:
The curation and quality control of MARMICRODB single cell, metagenome assembled, and isolate genomes was performed as described in [15]. Briefly, we downloaded all MARMICRODB genomes as raw nucleotide assemblies from NCBI. We determined an initial genome taxonomy for these assemblies using checkM with the default lineage workflow [16]. All genome bins met the completion/contamination thresholds outlined in prior studies [7,17]. For single cell and metagenome assembled genomes, especially those from Tara Oceans Mediterranean sea samples [18], we use the GTDB-Tk classification workflow [19] to verify the taxonomic fidelity of each genome bin. We then selected genomes with a checkM taxonomic assignment of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 for further analysis and confirmed taxonomic assignment using blast matches to known Prochlorococcus/Synechococcus ITS sequences and by matching 16S sequences to the SILVA database [20]. To refine our estimates of completeness/contamination of Prochlorococcus genome bins we created a custom set of 730 single copy protein families (available from 10.5281/zenodo.3719132) from closed, isolate Prochlorococcus genomes [21] for quality assessments with checkM. For Synechococcus we used the CheckM taxonomic-specific workflow with the genus Synechococcus. After the custom CheckM quality control, we excluded any genome bins from downstream analysis that had an estimated quality < 30, defined as %completeness – 5x %contamination resulting in 18769 genome/transcriptome bins. We predicted genes in the resulting genome bins using prodigal [22] and excluded protein sequences with lengths less than 20 and greater than 20000 amino acids, removed non-standard amino acid residues, and condensed redundant protein sequences to a single representative sequence to which we assigned a lowest common ancestor (LCA) taxonomy identifier from the NCBI taxonomy database [23]. The resulting protein sequences were compiled and used to build a Kaiju [3] search database.

The above filtering criteria resulted in 605 Prochlorococcus, 96 Synechococcus, 186 SAR11/Pelagibacterales, 60 SAR86, and 59 SAR116 high-quality genome bins. We constructed a high quality fixed reference phylogenetic tree for each taxonomic group based on genomes manually selected for completeness and the phylogenetic diversity. For example the Prochlorococcus and Synechococcus genomes for the fixed reference phylogeny are estimated > 90% complete, and SAR11 genomes are estimated > 70% complete. We created multiple sequence alignments of phylogenetically conserved genes from these genomes using the GTDB-Tk pipeline [19] with default settings. The pipeline identifies conserved proteins (120 bacterial proteins) and generates concatenated multi-protein alignments [17] from the genome assemblies using hmmalign from the hmmer software suite. We further filtered the resulting alignment columns using the bacterial and archaeal alignment masks from [17] (http://gtdb.ecogenomic.org/downloads). We removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occuring at a frequency greater than 25%. We trimmed the alignments using trimal [24] with the automated -gappyout option to trim columns based on their gap distribution. We inferred reference phylogenies using multithreaded RAxML [25] with the GAMMA model of rate heterogeneity, empirically determined base frequencies, and the LG substitution model [26](PROTGAMMALGF). Branch support is based on 250 resampled bootstrap trees. This tree was then pruned to only allow a maximum average distance to the closest leaf (ADCL) of 0.003 to reduce the phylogenetic redundancy in the tree [27]. We then “placed” genomes that either did not pass completeness threshold or were considered phylogenetically redundant by ADCL within the fixed reference phylogeny for each group using pplacer [28] representing each placed genome as a pendant edge in the final tree. We then examined the resulting tree and manually selected clade/ecotype cutoffs to be as consistent as possible with clade definitions previously outlined for these groups [29–32]. We then gave clades from each taxonomic group custom taxonomic identifiers and we added these identifiers to the MARMICRODB Kaiju taxonomic hierarchy.

Software/databases used:
checkM v1.0.11[16]
HMMERv3.1b2 (http://hmmer.org/)
prodigal v2.6.3 [22]
trimAl v1.4.rev22 [24]
AliView v1.18.1 [33] [34]
Phyx v0.1 [35]
RAxML v8.2.12 [36]
Pplacer v1.1alpha [28]
GTDB-Tk v0.1.3 [19]
Kaiju v1.6.0 [34]
GTDB RS83 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release83/83.0/)
NCBI Taxonomy (accessed 2018-07-02) [23]
TIGRFAM v14.0 [37]
PFAM v31.0 [38]

Discussion/Caveats:
MARMICRODB is optimized for metagenomic samples from the marine environment, in particular planktonic microbes from the pelagic euphotic zone. We expect this database may also be useful for classifying other types of marine metagenomic samples (for example mesopelagic, bathypelagic, or even benthic or marine host-associated), but it has not been tested as such. The original purpose of this database was to quantify clades/ecotypes of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 in metagenomes from Tara Oceans Expedition and the GEOTRACES project. We carefully annotated and quality controlled genomes from these five groups, but the processing of the other marine taxa was largely automated and unsupervised. Taxonomy for other groups was copied over from the Genome Taxonomy Database (GTDB) [19,39] and NCBI Taxonomy [23] so any inconsistencies in those databases will be propagated to MARMICRODB. For most use cases MARMICRODB can probably be used unmodified, but if the user’s goal is to focus on a particular organism/clade that we did not curate in the database then the user may wish to spend some time curating those genomes (ie checking for contamination, dereplicating, building a genome phylogeny for custom taxonomy node assignment). Currently the custom taxonomy is hardcoded in the MARMICRODB.fmi index, but if users wish to modify MARMICRODB by adding or removing genomes, or reconfiguring taxonomic ranks the names.dmp and nodes.dmp files can easily be modified as well as the fasta file of protein sequences. However, the Kaiju index will need to be rebuilt, and user will require a high
u
Genome sequence assembly and annotation of MATA and MATB strains of Yarrowia...
agdatacommons.nal.usda.gov
bin
Updated Jan 22, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Narges Zali; Osama E. Demerdash; Kapeel Chougule; Zhenyuan Lu; Doreen Ware; Bruce Stillman (2026). Genome sequence assembly and annotation of MATA and MATB strains of Yarrowia lipolytica [Dataset]. http://doi.org/10.5061/dryad.v15dv427f
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.v15dv427f
Dataset updated
Jan 22, 2026
Dataset provided by
Dryad
Authors
Narges Zali; Osama E. Demerdash; Kapeel Chougule; Zhenyuan Lu; Doreen Ware; Bruce Stillman
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Yeast is commonly utilized in molecular and cell biology research, and Yarrowia lipolytica is favored by bio-engineers due to its ability to produce copious amounts of lipids, chemicals, and enzymes for industrial applications. Y. lipolytica is a dimorphic yeast that can proliferate in aerobic and hydrophobic environments conducive to industrial use. However, there is limited knowledge about the basic molecular biology of this yeast, including how the genome is duplicated and how gene silencing occurs. Genome sequences of Y. lipolytica strains have offered insights into the genetic basis of this yeast species and have facilitated the development of new industrial applications. Although previous studies have reported the genome sequence of a few Y. lipolytica strains, it is of value to have more precise sequences and annotation, particularly for studies of the biology of this yeast. To further study and characterize the molecular biology of this microorganism, a high-quality reference genome assembly and annotation has been produced for two related Y. lipolytica strains of the opposite mating type, MATA (E122) and MATB (22301-5). The combination of short-read and long-read sequencing of genome DNA and short-read and long-read sequencing of transcript cDNAs allowed the genome assembly and a comparison with a distantly related Yarrowia strain.
d
Chloroplast Genome Database
dknet.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Chloroplast Genome Database [Dataset]. http://identifiers.org/RRID:SCR_013421
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_013421
Dataset updated
Jan 29, 2022
Description
The Chloroplast Genome Database contains annotated chloroplast/plastid genomes from the NCBI Organelle Genomes section at NCBI. Users can search for genes by their annotated names, conduct flexible BLAST searches, download protein and nucleotide sequences extracted from a selected chloroplast genome, and browse the putative protein families (tribes) created using TribeMCL.
M
MicrobesOnline Comparative Genomics Database
datacatalog.mskcc.org
Updated Nov 13, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Virtual Institute for Microbial Stress and Survival (2019). MicrobesOnline Comparative Genomics Database [Dataset]. https://datacatalog.mskcc.org/dataset/10391
Explore at:
Dataset updated
Nov 13, 2019
Dataset provided by
Virtual Institute for Microbial Stress and Survival
Description
The MicrobesOnline genome database contains over 1000 prokaryotic genomes. Genomes were last updated in late 2011 and no further database updates are planned.

All genomes are analyzed through the VIMSS genome pipeline. We use publicly available sequence analysis tools and databases to search for homologs (NCBI BLAST, UCSC Blat, SwissProt, COG) and protein domains (HMMer, InterPro), to assign gene ontologies (Gene Ontology Consortium) and EC numbers and to map the metabolic pathways (KEGG). We then link the orthology relationships between genes and predict operon structures.

Most genome data is downloaded from RefSeq. When an incomplete genome is directly downloaded from a sequencing center, we submit the genome sequence to RAST for automated annotation. For all genomes, we also search for CRISPR regions using PILER-CR and CRT.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). NCBI Genome Survey Sequences Database [Dataset]. http://identifiers.org/RRID:SCR_002146

NCBI Genome Survey Sequences Database

RRID:SCR_002146, SCR_015063, nif-0000-20938, NCBI Genome Survey Sequences Database (RRID:SCR_002146), GSS, Entrez GSS, NCBI dbGSS, dbGSS

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://identifiers.org/RRID:SCR_002146

Dataset updated

Sep 15, 2024

Description

Database of unannotated short single-read primarily genomic sequences from GenBank including random survey sequences clone-end sequences and exon-trapped sequences. The GSS division of GenBank is similar to the EST division, with the exception that most of the sequences are genomic in origin, rather than cDNA (mRNA). It should be noted that two classes (exon trapped products and gene trapped products) may be derived via a cDNA intermediate. Care should be taken when analyzing sequences from either of these classes, as a splicing event could have occurred and the sequence represented in the record may be interrupted when compared to genomic sequence. The GSS division contains (but is not limited to) the following types of data: * random single pass read genome survey sequences. * cosmid/BAC/YAC end sequences * exon trapped genomic sequences * Alu PCR sequences * transposon-tagged sequences Although dbGSS sequences are incorporated into the GSS Division of GenBank, annotation in dbGSS is more comprehensive and includes detailed information about the contributors, experimental conditions, and genetic map locations.

Clear search

Close search

Google apps

Main menu

NCBI Genome Survey Sequences Database

Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics

Data_Sheet_1_Contamination in Reference Sequence Databases: Time for...

Iterative Genome Correction Largely Improves Proteomic Analysis of Nonmodel...

Data from: SoyBase and the Soybean Breeder's Toolbox

High Throughput Genomic Sequences Division

High Throughput Genomic Sequences Division

Human Gene and Protein Database (HGPD)

Human Gene and Protein Database (HGPD)

Genome Reviews

ARS Microbial Genomic Sequence Database Server

GenBank

SpBase - Strongylocentrotus purpuratus: the Sea Urchin Genome Database

Data from: Cacao Genome Database

Genome Database for Rosaceae

Interrupted CoDing Sequence Database

MARMICRODB database for taxonomic classification of (marine) metagenomes

Genome sequence assembly and annotation of MATA and MATB strains of Yarrowia...

Chloroplast Genome Database

MicrobesOnline Comparative Genomics Database

NCBI Genome Survey Sequences Database

RRID:SCR_002146, SCR_015063, nif-0000-20938, NCBI Genome Survey Sequences Database (RRID:SCR_002146), GSS, Entrez GSS, NCBI dbGSS, dbGSS