100+ datasets found

d
NCBI Genome Survey Sequences Database
dknet.org
rrid.site
+2more
Updated Aug 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). NCBI Genome Survey Sequences Database [Dataset]. http://identifiers.org/RRID:SCR_002146
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002146
Dataset updated
Aug 15, 2024
Description
Database of unannotated short single-read primarily genomic sequences from GenBank including random survey sequences clone-end sequences and exon-trapped sequences. The GSS division of GenBank is similar to the EST division, with the exception that most of the sequences are genomic in origin, rather than cDNA (mRNA). It should be noted that two classes (exon trapped products and gene trapped products) may be derived via a cDNA intermediate. Care should be taken when analyzing sequences from either of these classes, as a splicing event could have occurred and the sequence represented in the record may be interrupted when compared to genomic sequence. The GSS division contains (but is not limited to) the following types of data: * random single pass read genome survey sequences. * cosmid/BAC/YAC end sequences * exon trapped genomic sequences * Alu PCR sequences * transposon-tagged sequences Although dbGSS sequences are incorporated into the GSS Division of GenBank, annotation in dbGSS is more comprehensive and includes detailed information about the contributors, experimental conditions, and genetic map locations.
Genome Sequence Data Set01
catalog.data.gov
data.amerigeoss.org
Updated Nov 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Genome Sequence Data Set01 [Dataset]. https://catalog.data.gov/dataset/genome-sequence-data-set01-d2862
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The fasta files (Genome_Set01.zip) contain the reference-assisted de novo assemblies (as contigs) of three Escherichia coli isolates. The table contains rows as isolates (yellow) and columns as attributes (green) for each individual genome. This dataset is associated with the following publication: Gomez-Alvarez, V., and J. Hoelle-Schwalbach. Draft Genome Sequences of Antibiotic-Resistant Escherichia coli Isolates from U.S. Wastewater Treatment Plants. Microbiology Resource Announcements. American Society for Microbiology, Washington, DC, USA, 8(23): e00351-19, (2019).
Genome Sequence Data Set02
catalog.data.gov
s.cnmilf.com
Updated Mar 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Genome Sequence Data Set02 [Dataset]. https://catalog.data.gov/dataset/genome-sequence-data-set02
Explore at:
Dataset updated
Mar 15, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The Whole Genome Shotgun project has been deposited in DDBJ/ENA/GenBank under the BioProject PRJNA487286 with the following accession numbers CP061840 (chromosome) and CP061841 (plasmid). The raw sequence reads have been submitted to the NCBI SRA under the accession numbers SRR13076822 and SRR13076823. This dataset is associated with the following publication: Gomez-Alvarez, V., L. Boczek, I. Raffenberg, and R. Revetta. Closed Genome and Plasmid Sequences of Legionella pneumophila AW-13-4, Isolated from a Hot Water Loop System of a Large Occupational Building. Microbiology Resource Announcements. American Society for Microbiology, Washington, DC, USA, 10(1): e01276-20, (2021).
n
Genome Reviews
neuinfo.org
dknet.org
+2more
Updated Oct 31, 2005
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2005). Genome Reviews [Dataset]. http://identifiers.org/RRID:SCR_007685
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007685
Dataset updated
Oct 31, 2005
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented April 24, 2017. The Genome Reviews database provides an up-to-date, standardized and comprehensively annotated view of the genomic sequence of organisms with completely deciphered genomes. Currently, Genome Reviews contains the genomes of archaea, bacteria, bacteriophages and selected eukaryota. Genome Reviews is available as a MySQL relational database, or a flat file format derived from that in the EMBL Nucleotide Sequence Database. An Ensembl-style browser is now available for Genome Reviews, providing a zoomable graphical view of all chromosomes and plasmids represented in the database. The location and structure of all genes is shown and the distribution of features throughout the sequence is displayed.
u
Data from: SoyBase and the Soybean Breeder's Toolbox
agdatacommons.nal.usda.gov
s.cnmilf.com
+3more
bin
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David M. Grant (2024). SoyBase and the Soybean Breeder's Toolbox [Dataset]. http://doi.org/10.15482/USDA.ADC/1212265
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1212265
Dataset updated
Feb 8, 2024
Dataset provided by
Ag Data Commons
Authors
David M. Grant
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
SoyBase is a repository for genetics, genomics and related data resources for soybean. It contains current genetic, physical and genomic sequence maps integrated with qualitative and quantitative traits. SoyBase database was established in the 1990s as the USDA Soybean Genetics Database. Originally, it contained only genetic information about soybeans such as genetic maps and information about the Mendelian genetics of soybean. In time SoyBase was expanded to include molecular data regarding soybean genes and sequences as they became available. In 2010, the soybean genome sequence was published and it and supporting gene sequences have been integrated into the SoyBase sequence browser. SoyBase genetic maps were used in the assembly of both the Williams 82 2010 assembly (Wm82.a1.v1) and the newest genome assembly (Wm82.a2.v1). SoyBase also incorporates information about mutant and other soybean genetic stocks and serves as a contact point for ordering strains from those populations. As association analyses continue due to various re-sequencing efforts SoyBase will also incorporate those data into the soybean genome browser as they become available. Gene expression patterns are also available at SoyBase through the SoyBase expression pages and the Soybean Gene Atlas. Other expression/transcriptome/methylomic data sets also have been and continue to be incorporated into the SoyBase genome browser. Project No:3625-21000-062-00D Accession No: 0425040 Resources in this dataset:Resource Title: SoyBase, the USDA-ARS soybean genetics and genomics database web site. File Name: Web Page, url: https://soybase.org SoyBase database was established in the 1990s as the USDA Soybean Genetics Database. Originally, it contained only genetic information about soybeans such as genetic maps and information about the Mendelian genetics of soybean. In time SoyBase was expanded to include molecular data regarding soybean genes and sequences as they became available. In 2010, the soybean genome sequence was published and it and supporting gene sequences have been integrated into the SoyBase sequence browser. SoyBase genetic maps were used in the assembly of both the Williams 82 2010 assembly (Wm82.a1.v1) and the newest genome assembly (Wm82.a2.v1).

Soybean Pods and Seeds SoyBase also incorporates information about mutant and other soybean genetic stocks and serves as a contact point for ordering strains from those populations. As association analyses continue due to various re-sequencing efforts SoyBase will also incorporate those data into the soybean genome browser as they become available. Gene expression patterns are also available at SoyBase through the SoyBase expression pages and the Soybean Gene Atlas. Other expression/transcriptome/methylomic data sets also have been and continue to be incorporated into the SoyBase genome browser.
n
T4-like genome database
neuinfo.org
rrid.site
+2more
Updated Nov 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). T4-like genome database [Dataset]. http://identifiers.org/RRID:SCR_005367
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005367
Dataset updated
Nov 1, 2025
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 22, 2016. A database of information on bacterial phages. It contains multiple phage genomes, which users can BLAST and MegaBLAST, and also hosts a Phage Forum in which users can discuss phage data. Interactive browsing of completed phage genomes is available using the program. The browser allows users to scan the genome for particular features and to download sequence information plus analyses of those features. Views of the genome are generated showing named genes BLAST similarities to other phages predicted tRNAs and other sequence features.
ARS Microbial Genomic Sequence Database Server
agdatacommons.nal.usda.gov
catalog.data.gov
bin
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA Agricultural Research Service (2024). ARS Microbial Genomic Sequence Database Server [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/ARS_Microbial_Genomic_Sequence_Database_Server/24661200
Explore at:
binAvailable download formats
Dataset updated
Feb 9, 2024
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Authors
USDA Agricultural Research Service
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
This database server is supported in fulfilment of the research mission of the Mycotoxin Prevention and Applied Microbiology Research Unit at the National Center for Agricultural Utilization Research in Peoria, Illinois. The linked website provides access to gene sequence databases for various groups of microorganisms, such as Streptomyces species or Aspergillus species and their relatives, that are the product of ARS research programs. The sequence databases are organized in the BIGSdb (Bacterial Isolate Genomic Sequence Database) software package developed by Keith Jolley and Martin Maiden at Oxford University. Resources in this dataset:Resource Title: ARS Microbial Genomic Sequence Database Server. File Name: Web Page, url: http://199.133.98.43
d
3D-Genomics Database
dknet.org
scicrunch.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007430
Dataset updated
Jan 29, 2022
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome
The results of whole genome sequence database (the TrueBacTM ID-Genome...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oh Joo Kweon; Yong Kwan Lim; Hye Ryoun Kim; Tae-Hyoung Kim; Sung-min Ha; Mi-Kyung Lee (2023). The results of whole genome sequence database (the TrueBacTM ID-Genome system) matching for the novel Cupriavidus species. [Dataset]. http://doi.org/10.1371/journal.pone.0232850.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0232850.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Oh Joo Kweon; Yong Kwan Lim; Hye Ryoun Kim; Tae-Hyoung Kim; Sung-min Ha; Mi-Kyung Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The results of whole genome sequence database (the TrueBacTM ID-Genome system) matching for the novel Cupriavidus species.
r
High Throughput Genomic Sequences Division
rrid.site
scicrunch.org
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
High Throughput Genomic Sequences Division [Dataset]. http://identifiers.org/RRID:SCR_002150
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002150
Description
Database of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences. It was created to accommodate a growing need to make unfinished genomic sequence data rapidly available to the scientific community in a coordinated effort among the International Nucleotide Sequence databases, DDBJ, EMBL, and GenBank. Sequences are prepared for submission by using NCBI's software tools Sequin or tbl2asn. Each center has an FTP directory into which new or updated sequence files are placed. Sequence data in this division are available for BLAST homology searches against either the htgs database or the month database, which includes all new submissions for the prior month. Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and deposited in the HTG division. A typical HTG record might consist of all the first-pass sequence data generated from a single cosmid, BAC, YAC, or P1 clone, which together make up more than 2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences, and each record includes a clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data are unfinished and may contain errors. The accession number does not change as sequence records are updated; only the most recent version of a HTG record remains in GenBank.
n
Genome Database for Rosaceae
neuinfo.org
scicrunch.org
+2more
Updated Jun 20, 2008
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2008). Genome Database for Rosaceae [Dataset]. http://identifiers.org/RRID:SCR_012756
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_012756
Dataset updated
Jun 20, 2008
Description
GDR is a curated and integrated web-based relational database. GDR contains comprehensive data of the genetically anchored peach physical map, annotated EST databases of apple, peach, almond, cherry, rose, raspberry and strawberry, Rosaceae maps and markers and all publicly available Rosaceae sequences. Annotations of ESTs include contig assembly, putative function, simple sequence repeats, ORFs, Gene Ontology and anchored position to the peach physical map where applicable. Our integrated map viewer provides graphical interface to the genetic, transcriptome and physical mapping information. We continue to add Rosaceae map data to CMap, a web-based tool that allows users to view comparisons of genetic and physical maps. ESTs, BACs and markers can be queried by various categories and the search result sites are linked to the integrated map viewer or to the WebFPC physical map sites. In addition to browsing and querying the database, users can compare their sequences with the annotated GDR sequences via a dedicated sequence similarity server running either the BLAST or FASTA algorithm, search their sequences for microsatellites using the SSR server or assemble their ESTs using the CAP3 Server.
Data from: Cacao Genome Database
agdatacommons.nal.usda.gov
datasets.ai
+2more
bin
Updated Feb 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raymond J. Schnell; Alan W. Meerow; Tomas Ayala-Silva; Osman Gutierrez; David Kuhn; Cecile L. Tondo; Juan Carlos Motamayor (2024). Cacao Genome Database [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Cacao_Genome_Database/24852516
Explore at:
binAvailable download formats
Dataset updated
Feb 9, 2024
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Authors
Raymond J. Schnell; Alan W. Meerow; Tomas Ayala-Silva; Osman Gutierrez; David Kuhn; Cecile L. Tondo; Juan Carlos Motamayor
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Not only is cacao the basic ingredient in the world’s favorite confection, chocolate, but it provides a livelihood for over 6.5 million farmers in Africa, South America and Asia and ranks as one of the top ten agriculture commodities in the world. Historically, cocoa production has been plagued by serious losses due to pests and diseases. The release of the cacao genome sequence will provide researchers with access to the latest genomic tools, enabling more efficient research and accelerating the breeding process, thereby expediting the release of superior cacao cultivars. The sequenced genotype, Matina 1-6, is representative of the genetic background most commonly found in the cacao producing countries, enabling results to be applied immediately and broadly to current commercial cultivars. Matina 1-6 is highly homozygous which greatly reduces the complexity of the sequence assembly process. While the sequence provided is a preliminary release, it already covers 92% of the genome, with approximately 35,000 genes. We will continue to refine the assembly and annotation, working toward a complete finished sequence. Updates will be made available via the main project website. Resources in this dataset:Resource Title: Cacao Genome Database. File Name: Web Page, url: http://www.cacaogenomedb.org/
Data from: Pinus taeda Genome sequencing
agdatacommons.nal.usda.gov
datasetcatalog.nlm.nih.gov
bin
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UC Davis; TreeGenes Database (2025). Pinus taeda Genome sequencing [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Pinus_taeda_Genome_sequencing/25079168
Explore at:
binAvailable download formats
Dataset updated
Mar 11, 2025
Dataset provided by
National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/
Authors
UC Davis; TreeGenes Database
License
https://rightsstatements.org/vocab/UND/1.0/https://rightsstatements.org/vocab/UND/1.0/
Description
Development of a high quality reference genome sequence for loblolly pine, Douglas-fir and sugar pine by means that can serve as a model approach for sequencing other large, complex genomes and empower the forest tree biology research community and the broader biological research community in the practical use and application of this resource.
Gene database of genes on different human chromosomes
figshare.com
zip
Updated Feb 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenfa Ng (2021). Gene database of genes on different human chromosomes [Dataset]. http://doi.org/10.6084/m9.figshare.13932119.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13932119.v1
Dataset updated
Feb 12, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Wenfa Ng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Since the advent of the genomics age that began in the 1990s with the sequencing of a couple of model bacterial and eukaryotic genomes, humans have been on a quest to sequence many species in our ecosystems to find commonalities and differences in sequence that help explain phenotypes. This led to the field of functional genomics, and which is what gave us the capability to automatically annotate a genome with sequence homology as probe. This work sought to provide the gene database of all genes in the human genome on a granular level by categorizing the genetic repertoire of humans at the chromosomal level. Specifically, an in-house MATLAB genome analysis software was used to parse the annotated genome sequence file of different chromosomes of the human genome. Variables that have been output for each gene includes gene name, gene function, promoter sequence and gene sequence. Such information, when aggregated at the level of chromosomes, and entire genome, should inform further studies seeking to unravel the mysteries that link gene sequence, gene expression, cell differentiation, and organismal developmental trajectories and phenotypes.
r
GenBank
rrid.site
dknet.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). GenBank [Dataset]. http://identifiers.org/RRID:SCR_002760
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002760
Dataset updated
Jan 29, 2022
Description
NIH genetic sequence database that provides annotated collection of all publicly available DNA sequences for almost 280 000 formally described species (Jan 2014) .These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun (WGS) and environmental sampling projects. Most submissions are made using web-based BankIt or standalone Sequin programs, and GenBank staff assigns accession numbers upon data receipt. It is part of International Nucleotide Sequence Database Collaboration and daily data exchange with European Nucleotide Archive (ENA) and DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through NCBI Entrez retrieval system, which integrates data from major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of GenBank database are available by FTP.
n
Animal Genome Database
neuinfo.org
rrid.site
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Animal Genome Database [Dataset]. http://identifiers.org/RRID:SCR_008165
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008165
Dataset updated
Jan 29, 2022
Description
Database of comparative gene mapping between species to assist the mapping of the genes related to phenotypic traits in livestock. The linkage maps, cytogenetic maps, polymerase chain reaction primers of pig, cattle, mouse and human, and their references have been included in the database, and the correspondence among species have been stipulated in the database. AGP is an animal genome database developed on a Unix workstation and maintained by a relational database management system. It is a joint project of National Institute of Agrobiological Sciences (NIAS) and Institute of the Society for Techno-innovation of Agriculture, Forestry and Fisheries (STAFF-Institute), under cooperation with other related research institutes. AGP also contains the Pig Expression Data Explorer (PEDE), a database of porcine EST collections derived from full-length cDNA libraries and full-length sequences of the cDNA clones picked from the EST collection. The EST sequences have been clustered and assembled, and their similarity to sequences in RefSeq, and UniGene determined. The PEDE database system was constructed to store sequences and similarity data of swine full-length cDNA libraries and to make them available to users. It provides interfaces for keyword and ID searches of BLAST results and enables users to obtain sequence data and names of clones of interest. Putative SNPs in EST assemblies have been classified according to breed specificity and their effect on coding amino acids, and the assemblies are equipped with an SNP search interface. The database contains porcine nucleotide sequences and cDNA clones that are ready for analyses such as expression in mammalian cells, because of their high likelihood of containing full-length CDS. PEDE will be useful for researchers who want to explore genes that may be responsible for traits such as disease susceptibility. The database also offers information regarding major and minor porcine-specific antigens, which might be investigated in regard to the use of pigs as models in various medical research applications.
9MM Gallus gallus protein BLAST (tabular).
plos.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthys G. Potgieter; Andrew J. M. Nel; Suereta Fortuin; Shaun Garnett; Jerome M. Wendoh; David L. Tabb; Nicola J. Mulder; Jonathan M. Blackburn (2023). 9MM Gallus gallus protein BLAST (tabular). [Dataset]. http://doi.org/10.1371/journal.pcbi.1011163.s014
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1011163.s014
Dataset updated
Jun 29, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Matthys G. Potgieter; Andrew J. M. Nel; Suereta Fortuin; Shaun Garnett; Jerome M. Wendoh; David L. Tabb; Nicola J. Mulder; Jonathan M. Blackburn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundMicrobiome research is providing important new insights into the metabolic interactions of complex microbial ecosystems involved in fields as diverse as the pathogenesis of human diseases, agriculture and climate change. Poor correlations typically observed between RNA and protein expression datasets make it hard to accurately infer microbial protein synthesis from metagenomic data. Additionally, mass spectrometry-based metaproteomic analyses typically rely on focused search sequence databases based on prior knowledge for protein identification that may not represent all the proteins present in a set of samples. Metagenomic 16S rRNA sequencing only targets the bacterial component, while whole genome sequencing is at best an indirect measure of expressed proteomes. Here we describe a novel approach, MetaNovo, that combines existing open-source software tools to perform scalable de novo sequence tag matching with a novel algorithm for probabilistic optimization of the entire UniProt knowledgebase to create tailored sequence databases for target-decoy searches directly at the proteome level, enabling metaproteomic analyses without prior expectation of sample composition or metagenomic data generation and compatible with standard downstream analysis pipelines.ResultsWe compared MetaNovo to published results from the MetaPro-IQ pipeline on 8 human mucosal-luminal interface samples, with comparable numbers of peptide and protein identifications, many shared peptide sequences and a similar bacterial taxonomic distribution compared to that found using a matched metagenome sequence database—but simultaneously identified many more non-bacterial peptides than the previous approaches. MetaNovo was also benchmarked on samples of known microbial composition against matched metagenomic and whole genomic sequence database workflows, yielding many more MS/MS identifications for the expected taxa, with improved taxonomic representation, while also highlighting previously described genome sequencing quality concerns for one of the organisms, and identifying an experimental sample contaminant without prior expectation.ConclusionsBy estimating taxonomic and peptide level information directly on microbiome samples from tandem mass spectrometry data, MetaNovo enables the simultaneous identification of peptides from all domains of life in metaproteome samples, bypassing the need for curated sequence databases to search. We show that the MetaNovo approach to mass spectrometry metaproteomics is more accurate than current gold standard approaches of tailored or matched genomic sequence database searches, can identify sample contaminants without prior expectation and yields insights into previously unidentified metaproteomic signals, building on the potential for complex mass spectrometry metaproteomic data to speak for itself.
d
Data from: Towards understanding the first genome sequence of a crenarchaeon...
catalog.data.gov
odgavaprod.ogopendata.com
Updated Sep 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs) [Dataset]. https://catalog.data.gov/dataset/towards-understanding-the-first-genome-sequence-of-a-crenarchaeon-by-genome-annotation-usi
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
National Institutes of Health
Description
Background: Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi. Results: A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix. Conclusions: Special-purpose databases organized on the basis of phylogenetic analysis and carefully curated with respect to known and predicted protein functions provide for a significant improvement in genome annotation. A differential genome display approach helps in a systematic investigation of common and distinct features of gene repertoires and in some cases reveals unexpected connections that may be indicative of functional similarities between phylogenetically distant organisms and of lateral gene exchange.
r
Sequencing of Idd regions in the NOD mouse genome
rrid.site
neuinfo.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Sequencing of Idd regions in the NOD mouse genome [Dataset]. http://identifiers.org/RRID:SCR_001483
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_001483
Dataset updated
Jan 29, 2022
Description
Genetic variations associated with type 1 diabetes identified by sequencing regions of the non-obese diabetic (NOD) mouse genome and comparing them with the same areas of a diabetes-resistant C57BL/6J reference mouse allowing identification of single nucleotide polymorphisms (SNPs) or other genomic variations putatively associated with diabetes in mice. Finished clones from the targeted insulin-dependent diabetes (Idd) candidate regions are displayed in the NOD clone sequence section of the website, where they can be downloaded either as individual clone sequences or larger contigs that make up the accession golden path (AGP). All sequences are publicly available via the International Nucleotide Sequence Database Collaboration. Two NOD mouse BAC libraries were constructed and the BAC ends sequenced. Clones from the DIL NOD BAC library constructed by RIKEN Genomic Sciences Centre (Japan) in conjunction with the Diabetes and Inflammation Laboratory (DIL) (University of Cambridge) from the NOD/MrkTac mouse strain are designated DIL. Clones from the CHORI-29 NOD BAC library constructed by Pieter de Jong (Children's Hospital, Oakland, California, USA) from the NOD/ShiLtJ mouse strain are designated CHORI-29. All NOD mouse BAC end-sequences have been submitted to the International Nucleotide Sequence Database Consortium (INSDC), deposited in the NCBI trace archive. They have generated a clone map from these two libraries by mapping the BAC end-sequences to the latest assembly of the C57BL/6J mouse reference genome sequence. These BAC end-sequence alignments can then be visualized in the Ensembl mouse genome browser where the alignments of both NOD BAC libraries can be accessed through the Distributed Annotation System (DAS). The Mouse Genomes Project has used the Illumina platform to sequence the entire NOD/ShiLtJ genome and this should help to position unaligned BAC end-sequences to novel non-reference regions of the NOD genome. Further information about the BAC end-sequences, such as their alignment, variation data and Ensembl gene coverage, can be obtained from the NOD mouse ftp site.
r
China National Center for Bioinformation Genome Sequence Archive for Human...
rrid.site
scicrunch.org
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). China National Center for Bioinformation Genome Sequence Archive for Human database [Dataset]. http://identifiers.org/RRID:SCR_027207/resolver?q=*&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_027207 https://identifiers.org/RRID:SCR_027207/resolver?q=*&i=rrid
Dataset updated
Jul 14, 2025
Description
Data repository for archiving raw sequence data, which provides data storage and sharing services for worldwide scientific communities. Data repository specialized for human genetic related data derived from biomedical researches.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). NCBI Genome Survey Sequences Database [Dataset]. http://identifiers.org/RRID:SCR_002146

NCBI Genome Survey Sequences Database

RRID:SCR_002146, SCR_015063, nif-0000-20938, NCBI Genome Survey Sequences Database (RRID:SCR_002146), GSS, Entrez GSS, NCBI dbGSS, dbGSS

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://identifiers.org/RRID:SCR_002146

Dataset updated

Aug 15, 2024

Description

Database of unannotated short single-read primarily genomic sequences from GenBank including random survey sequences clone-end sequences and exon-trapped sequences. The GSS division of GenBank is similar to the EST division, with the exception that most of the sequences are genomic in origin, rather than cDNA (mRNA). It should be noted that two classes (exon trapped products and gene trapped products) may be derived via a cDNA intermediate. Care should be taken when analyzing sequences from either of these classes, as a splicing event could have occurred and the sequence represented in the record may be interrupted when compared to genomic sequence. The GSS division contains (but is not limited to) the following types of data: * random single pass read genome survey sequences. * cosmid/BAC/YAC end sequences * exon trapped genomic sequences * Alu PCR sequences * transposon-tagged sequences Although dbGSS sequences are incorporated into the GSS Division of GenBank, annotation in dbGSS is more comprehensive and includes detailed information about the contributors, experimental conditions, and genetic map locations.

Clear search

Close search

Google apps

Main menu

NCBI Genome Survey Sequences Database

Genome Sequence Data Set01

Genome Sequence Data Set02

Genome Reviews

Data from: SoyBase and the Soybean Breeder's Toolbox

T4-like genome database

ARS Microbial Genomic Sequence Database Server

3D-Genomics Database

The results of whole genome sequence database (the TrueBacTM ID-Genome...

High Throughput Genomic Sequences Division

Genome Database for Rosaceae

Data from: Cacao Genome Database

Data from: Pinus taeda Genome sequencing

Gene database of genes on different human chromosomes

GenBank

Animal Genome Database

9MM Gallus gallus protein BLAST (tabular).

Data from: Towards understanding the first genome sequence of a crenarchaeon...

Sequencing of Idd regions in the NOD mouse genome

China National Center for Bioinformation Genome Sequence Archive for Human...

NCBI Genome Survey Sequences Database

RRID:SCR_002146, SCR_015063, nif-0000-20938, NCBI Genome Survey Sequences Database (RRID:SCR_002146), GSS, Entrez GSS, NCBI dbGSS, dbGSS