100+ datasets found

d
GenBank
catalog.data.gov
healthdata.gov
+3more
Updated Jul 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (NIH) (2023). GenBank [Dataset]. https://catalog.data.gov/dataset/genbank
Explore at:
Dataset updated
Jul 26, 2023
Dataset provided by
National Institutes of Health (NIH)
Description
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information.
d
NCBI Genome Survey Sequences Database
dknet.org
neuinfo.org
+1more
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). NCBI Genome Survey Sequences Database [Dataset]. http://identifiers.org/RRID:SCR_002146
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002146
Dataset updated
Jun 24, 2025
Description
Database of unannotated short single-read primarily genomic sequences from GenBank including random survey sequences clone-end sequences and exon-trapped sequences. The GSS division of GenBank is similar to the EST division, with the exception that most of the sequences are genomic in origin, rather than cDNA (mRNA). It should be noted that two classes (exon trapped products and gene trapped products) may be derived via a cDNA intermediate. Care should be taken when analyzing sequences from either of these classes, as a splicing event could have occurred and the sequence represented in the record may be interrupted when compared to genomic sequence. The GSS division contains (but is not limited to) the following types of data: * random single pass read genome survey sequences. * cosmid/BAC/YAC end sequences * exon trapped genomic sequences * Alu PCR sequences * transposon-tagged sequences Although dbGSS sequences are incorporated into the GSS Division of GenBank, annotation in dbGSS is more comprehensive and includes detailed information about the contributors, experimental conditions, and genetic map locations.
d
GenBank
dknet.org
Updated Nov 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). GenBank [Dataset]. http://identifiers.org/RRID:SCR_002760
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002760
Dataset updated
Nov 10, 2024
Description
NIH genetic sequence database that provides annotated collection of all publicly available DNA sequences for almost 280 000 formally described species (Jan 2014) .These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun (WGS) and environmental sampling projects. Most submissions are made using web-based BankIt or standalone Sequin programs, and GenBank staff assigns accession numbers upon data receipt. It is part of International Nucleotide Sequence Database Collaboration and daily data exchange with European Nucleotide Archive (ENA) and DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through NCBI Entrez retrieval system, which integrates data from major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of GenBank database are available by FTP.
n
GenBank Database
cmr.earthdata.nasa.gov
Updated Apr 20, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). GenBank Database [Dataset]. https://cmr.earthdata.nasa.gov/search/concepts/C1214138025-SCIOPS.html
Explore at:
Dataset updated
Apr 20, 2017
Time period covered
Jan 1, 1970 - Present
Area covered

Description
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank (at NCBI), together with the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL) comprise the International Nucleotide Sequence Database Collaboration. These three organizations exchange data on a daily basis.

GenBank grows at an exponential rate, with the number of nucleotide bases doubling approximately every 14 months. Currently, GenBank contains more than 13 billion bases from over 100,000 species.
s
Human Gene and Protein Database (HGPD)
scicrunch.org
neuinfo.org
Updated Nov 23, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2008). Human Gene and Protein Database (HGPD) [Dataset]. http://identifiers.org/RRID:SCR_002889
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002889
Dataset updated
Nov 23, 2008
Description
THIS RESOURCE IS NO LONGER IN SERVICE. Documented on January 4,2023.The Human Gene and Protein Database presents SDS-PAGE patterns and other informations of human genes and proteins. The HGPD was constructed from full-length cDNAs. For conversion to Gateway entry clones, we first determined an open reading frame (ORF) region in each cDNA meeting the criteria. Those ORF regions were PCR-amplified utilizing selected resource cDNAs as templates. All the details of the construction and utilization of entry clones will be published elsewhere. Amino acid and nucleotide sequences of an ORF for each cDNA and sequence differences of Gateway entry clones from source cDNAs are presented in the GW: Gateway Summary window. Utilizing those clones with a very efficient cell-free protein synthesis system featuring wheat germ, we have produced a large number of human proteins in vitro. Expressed proteins were detected in almost all cases. Proteins in both total and supernatant fractions are shown in the PE: Protein Expression window. In addition, we have also successfully expressed proteins in HeLa cells and determined subcellular localizations of human proteins. These biological data are presented on the frame of cDNA clusters in the Human Gene and Protein Database. To build the basic frame of HGPD, sequences of FLJ full-length cDNAs and others deposited in public databases (Human ESTs, RefSeq, Ensembl, MGC, etc.) are assembled onto the genome sequences (NCBI Build 35 (UCSC hg17)). The majority of analysis data for cDNA sequences in HGPD are shared with the FLJ Human cDNA Database (http://flj.hinv.jp/) constructed as a human cDNA sequence analysis database focusing on mRNA varieties caused by variations in transcription start site (TSS) and splicing.
u
Data from: SoyBase and the Soybean Breeder's Toolbox
agdatacommons.nal.usda.gov
gimi9.com
+3more
bin
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David M. Grant (2024). SoyBase and the Soybean Breeder's Toolbox [Dataset]. http://doi.org/10.15482/USDA.ADC/1212265
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1212265
Dataset updated
Feb 8, 2024
Dataset provided by
Ag Data Commons
Authors
David M. Grant
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
SoyBase is a repository for genetics, genomics and related data resources for soybean. It contains current genetic, physical and genomic sequence maps integrated with qualitative and quantitative traits. SoyBase database was established in the 1990s as the USDA Soybean Genetics Database. Originally, it contained only genetic information about soybeans such as genetic maps and information about the Mendelian genetics of soybean. In time SoyBase was expanded to include molecular data regarding soybean genes and sequences as they became available. In 2010, the soybean genome sequence was published and it and supporting gene sequences have been integrated into the SoyBase sequence browser. SoyBase genetic maps were used in the assembly of both the Williams 82 2010 assembly (Wm82.a1.v1) and the newest genome assembly (Wm82.a2.v1). SoyBase also incorporates information about mutant and other soybean genetic stocks and serves as a contact point for ordering strains from those populations. As association analyses continue due to various re-sequencing efforts SoyBase will also incorporate those data into the soybean genome browser as they become available. Gene expression patterns are also available at SoyBase through the SoyBase expression pages and the Soybean Gene Atlas. Other expression/transcriptome/methylomic data sets also have been and continue to be incorporated into the SoyBase genome browser. Project No:3625-21000-062-00D Accession No: 0425040 Resources in this dataset:Resource Title: SoyBase, the USDA-ARS soybean genetics and genomics database web site. File Name: Web Page, url: https://soybase.org SoyBase database was established in the 1990s as the USDA Soybean Genetics Database. Originally, it contained only genetic information about soybeans such as genetic maps and information about the Mendelian genetics of soybean. In time SoyBase was expanded to include molecular data regarding soybean genes and sequences as they became available. In 2010, the soybean genome sequence was published and it and supporting gene sequences have been integrated into the SoyBase sequence browser. SoyBase genetic maps were used in the assembly of both the Williams 82 2010 assembly (Wm82.a1.v1) and the newest genome assembly (Wm82.a2.v1).

Soybean Pods and Seeds SoyBase also incorporates information about mutant and other soybean genetic stocks and serves as a contact point for ordering strains from those populations. As association analyses continue due to various re-sequencing efforts SoyBase will also incorporate those data into the soybean genome browser as they become available. Gene expression patterns are also available at SoyBase through the SoyBase expression pages and the Soybean Gene Atlas. Other expression/transcriptome/methylomic data sets also have been and continue to be incorporated into the SoyBase genome browser.
f
The results of whole genome sequence database (the TrueBacTM ID-Genome...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oh Joo Kweon; Yong Kwan Lim; Hye Ryoun Kim; Tae-Hyoung Kim; Sung-min Ha; Mi-Kyung Lee (2023). The results of whole genome sequence database (the TrueBacTM ID-Genome system) matching for the novel Cupriavidus species. [Dataset]. http://doi.org/10.1371/journal.pone.0232850.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0232850.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Oh Joo Kweon; Yong Kwan Lim; Hye Ryoun Kim; Tae-Hyoung Kim; Sung-min Ha; Mi-Kyung Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The results of whole genome sequence database (the TrueBacTM ID-Genome system) matching for the novel Cupriavidus species.
f
Gene database for Pseudomonas aeruginosa PAO1
figshare.com
xlsx
Updated Nov 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenfa Ng (2019). Gene database for Pseudomonas aeruginosa PAO1 [Dataset]. http://doi.org/10.6084/m9.figshare.10271318.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.10271318.v1
Dataset updated
Nov 8, 2019
Dataset provided by
figshare
Authors
Wenfa Ng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pseudomonas aeruginosa is a common soil bacterium with a versatile metabolism capable of degrading many organic compounds. More importantly, P. aeruginosa is recognized as an important opportunistic pathogen of humans. This contribution describes a gene database comprising the names and gene sequences of all genes in Pseudomonas aeruginosa PAO1 (Genbank accession number: NCBI Reference Sequence: NC_002516.2). Annotated genome file was downloaded from Genbank and parsed by an in-house MATLAB software to yield a database comprising gene names and gene sequences. Overall, the gene database of P. aeruginosa PAO1 should find use in applications investigating the multi-faceted biology and biotechnology applications of this Gram-negative bacterium. Note that many genes in the species remain unannotated and lack a three letter abbreviated gene name.
r
High Throughput Genomic Sequences Division
rrid.site
dknet.org
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). High Throughput Genomic Sequences Division [Dataset]. http://identifiers.org/RRID:SCR_002150
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002150
Dataset updated
Jun 26, 2025
Description
Database of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences. It was created to accommodate a growing need to make unfinished genomic sequence data rapidly available to the scientific community in a coordinated effort among the International Nucleotide Sequence databases, DDBJ, EMBL, and GenBank. Sequences are prepared for submission by using NCBI's software tools Sequin or tbl2asn. Each center has an FTP directory into which new or updated sequence files are placed. Sequence data in this division are available for BLAST homology searches against either the htgs database or the month database, which includes all new submissions for the prior month. Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and deposited in the HTG division. A typical HTG record might consist of all the first-pass sequence data generated from a single cosmid, BAC, YAC, or P1 clone, which together make up more than 2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences, and each record includes a clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data are unfinished and may contain errors. The accession number does not change as sequence records are updated; only the most recent version of a HTG record remains in GenBank.
ARS Microbial Genomic Sequence Database Server
s.cnmilf.com
datadiscoverystudio.org
+2more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). ARS Microbial Genomic Sequence Database Server [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/ars-microbial-genomic-sequence-database-server-1b81c
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
This database server is supported in fulfilment of the research mission of the Mycotoxin Prevention and Applied Microbiology Research Unit at the National Center for Agricultural Utilization Research in Peoria, Illinois. The linked website provides access to gene sequence databases for various groups of microorganisms, such as Streptomyces species or Aspergillus species and their relatives, that are the product of ARS research programs. The sequence databases are organized in the BIGSdb (Bacterial Isolate Genomic Sequence Database) software package developed by Keith Jolley and Martin Maiden at Oxford University. Resources in this dataset:Resource Title: ARS Microbial Genomic Sequence Database Server. File Name: Web Page, url: http://199.133.98.43
r
Alternative Splicing Annotation Project II Database
rrid.site
neuinfo.org
+3more
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Alternative Splicing Annotation Project II Database [Dataset]. http://identifiers.org/RRID:SCR_000322
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_000322
Dataset updated
Jun 26, 2025
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.
e
NCBIFAM
ebi.ac.uk
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). NCBIFAM [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Dec 16, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NCBIfam is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. NCBIfam is maintained at the National Center for Biotechnology Information (Bethesda, MD). NCBIfam includes models from TIGRFAMs, another database of protein families developed at The Institute for Genomic Research, then at the J. Craig Venter Institute (Rockville, MD, US).
n
IMGT/GENE-DB
neuinfo.org
scicrunch.org
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). IMGT/GENE-DB [Dataset]. http://identifiers.org/RRID:SCR_006964
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006964
Dataset updated
Jan 29, 2022
Description
IMGT/GENE-DB is the comprehensive IMGT genome database for immunoglobulin (IG) and T cell receptor (TR) genes from human and mouse, and, in development, from other vertebrates. IMGT/GENE-DB is the international reference for the IG and TR gene nomenclature and works in close collaboration with the HUGO Nomenclature Committee, Mouse Genome Database and genome committees for other species. IMGT/GENE-DB allows a search of IG and TR genes by locus, group and subgroup, which are CLASSIFICATION concepts of IMGT-ONTOLOGY. Short cuts allow the retrieval gene information by gene name or clone name. Direct links with configurable URL give access to information usable by humans or programs. An IMGT/GENE-DB entry displays accurate gene data related to genome (gene localization), allelic polymorphisms (number of alleles, IMGT reference sequences, functionality, etc.) gene expression (known cDNAs), proteins and structures (Protein displays, IMGT Colliers de Perles). It provides internal links to the IMGT sequence databases and to the IMGT Repertoire Web resources, and external links to genome and generalist sequence databases. IMGT/GENE-DB manages the IMGT reference directory used by the IMGT tools for IG and TR gene and allele comparison and assignment, and by the IMGT databases for gene data annotation.
Z
The tpm metabarcoding DNA sequence database for taxonomic allocations using...
data.niaid.nih.gov
zenodo.org
Updated Oct 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MARJOLET Laurence (2023). The tpm metabarcoding DNA sequence database for taxonomic allocations using RDP classifier implemented in DADA2. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4492210
Explore at:
Dataset updated
Oct 10, 2023
Dataset provided by
COURNOYER Benoît
POZZI Adrien C.M.
MARJOLET Laurence
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The tpm metabarcoding DNA sequence database for taxonomic allocations using the Mothur and DADA2 bio-informatic tools

A.C.M. Pozzi1, R. Bouchali1, L. Marjolet1, B. Cournoyer1

1 University of Lyon, UMR Ecologie Microbienne Lyon (LEM), CNRS 5557, INRAE 1418, Université Claude Bernard Lyon 1, VetAgro Sup, Research Team “Bacterial Opportunistic Pathogens and Environment” (BPOE), 69280 Marcy L’Etoile, France.

Corresponding authors:

A.C.M. Pozzi, UMR Microbial Ecology, CNRS 5557, CNRS 1418, VetAgro Sup, Main building, aisle 3, 1st floor, 69280 Marcy-L’Etoile, France. Tel. (+33) 478 87 39 47. Fax. (+33) 472 43 12 23. Email: adrien.meynier_pozzi@vetagro-sup.fr

B. Cournoyer, UMR Microbial Ecology, CNRS 5557, CNRS 1418, VetAgro Sup, Main building, aisle 3, 1st floor, 69280 Marcy-L’Etoile, France. Tel. (+33) 478 87 56 47. Fax. (+33) 472 43 12 23. Email: and benoit.cournoyer@vetagro-sup.fr

Keywords:

BACtpm, Bacteria, tpm, thiopurine-S-methyltransferase EC:2.1.1.67, Nucleotide sequences, PCR products, Next-Generation-Sequencing, OTHU

Description:

The tpm gene codes for the thiopurine-S-methyltransferase (TPMT), an enzyme that can detoxify metalloid-containing oxyanions and xenobiotics (Cournoyer et al., 1998). Bacterial TPMTs radiated apart from human and animal TPMTs, and showed a vertical evolution in line with the 16S rRNA gene molecular phylogeny (Favre‐Bonté et al., 2005).

The tpm database, named BACtpm, was designed to apply the tpm-metabarcoding analytical scheme published in Aigle et al. (2021). It includes the full tpm identifiers, GenBank accession numbers, complete taxonomic records (domain down to strain code) of about 215 nucleotide-long tpm sequences of 840 unique taxa belonging to 139 genera.

Nucleotide sequences of tpm (range: 190-233 nucleotides) were either retrieved from public repositories (GenBank) or made available by B. Cournoyer’s research group. Colin et al. (2020) described the PCR and high throughput Illumina Miseq DNA sequencing procedures used to produce tpm sequences.

BACtpm v.2.0.1 (June 2021 release) is made available under the Creative Commons Attribution 4.0 International Licence. It can be used for the taxonomic allocations of tpm sequences down to the species and strain levels. Data is stored in the csv format enabling future user to reformat it to fit their specific needs.

Acknowledgments:

We thank the worldwide community of microbiologists who made contributions to public databases in the past decades, and made possible the elaboration of the BACtpm database. We also thank the Field Observatory in Urban Hydrology (OTHU, www.graie.org/othu/), Labex IMU (Intelligence des Mondes Urbains), the Greater Lyon Urban Community, the School of Integrated Watershed Sciences H2O'LYON, and the Lyon Urban School for their support in the development of this database. This work was funded by the French national research program for environmental and occupational health of ANSES under the terms of project “Iouqmer” EST 2016/1/120, l'Agence Nationale de la Recherche through ANR-16-CE32-0006, ANR-17-CE04-0010, ANR-17-EURE-0018 and ANR-17-CONV-0004, by the MITI CNRS project named Urbamic, and the French water agency for the Rhône, Mediterranean and Corsica areas through the Desir and DOmic projects. We thank former BPOE lab members who contributed to start and expand the BACtpm database: Céline COLINON, Romain MARTI, Emilie BOURGEOIS, Sébastien RIBUN and Yannick COLIN.

References:

Aigle, A., Colin, Y., Bouchali, R., Bourgeois, E., Marti, R., Ribun, S., Marjolet, L., Pozzi, A.C.M., Misery, B., Colinon, C., Bernardin-Souibgui, C., Wiest, L., Blaha, D., Galia, W., Cournoyer, B., 2021. Spatio-temporal variations in chemical pollutants found among urban deposits match changes in thiopurine S-methyltransferase-harboring bacteria tracked by the tpm metabarcoding approach. Sci. Total Environ. 767, 145425. https://doi.org/10.1016/j.scitotenv.2021.145425

Colin, Y., Bouchali, R., Marjolet, L., Marti, R., Vautrin, F., Voisin, J., Bourgeois, E., Rodriguez-Nava, V., Blaha, D., Winiarski, T., Mermillod-Blondin, F., Cournoyer, B., 2020. Coalescence of bacterial groups originating from urban runoffs and artificial infiltration systems among aquifer microbiomes. Hydrol. Earth Syst. Sci. 24, 4257–4273. https://doi.org/10.5194/hess-24-4257-2020

Cournoyer, B., Watanabe, S., Vivian, A., 1998. A tellurite-resistance genetic determinant from phytopathogenic pseudomonads encodes a thiopurine methyltransferase: evidence of a widely-conserved family of methyltransferases1The International Collaboration (IC) accession number of the DNA sequence is L49178.1. Biochim. Biophys. Acta BBA - Gene Struct. Expr. 1397, 161–168. https://doi.org/10.1016/S0167-4781(98)00020-7

Favre‐Bonté, S., Ranjard, L., Colinon, C., Prigent‐Combaret, C., Nazaret, S., Cournoyer, B., 2005. Freshwater selenium-methylating bacterial thiopurine methyltransferases: diversity and molecular phylogeny. Environ. Microbiol. 7, 153–164. https://doi.org/10.1111/j.1462-2920.2004.00670.x
r
NCBI BioProject
rrid.site
dknet.org
+2more
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). NCBI BioProject [Dataset]. http://identifiers.org/RRID:SCR_004801
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_004801
Dataset updated
Jun 23, 2025
Description
Database of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. It is a searchable collection of complete and incomplete (in-progress) large-scale sequencing, assembly, annotation, and mapping projects for cellular organisms. Submissions are supported by a web-based Submission Portal. The database facilitates organization and classification of project data submitted to NCBI, EBI and DDBJ databases that captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability. BioProject records link to corresponding data stored in archival repositories. The BioProject resource is a redesigned, expanded, replacement of the NCBI Genome Project resource. The redesign adds tracking of several data elements including more precise information about a project''''s scope, material, and objectives. Genome Project identifiers are retained in the BioProject as the ID value for a record, and an Accession number has been added. Database content is exchanged with other members of the International Nucleotide Sequence Database Collaboration (INSDC). BioProject is accessible via FTP.
o
COVID-19 Genome Sequence Dataset
registry.opendata.aws
catalog.midasnetwork.us
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (NLM) (2020). COVID-19 Genome Sequence Dataset [Dataset]. https://registry.opendata.aws/ncbi-covid-19/
Explore at:
Dataset updated
Jul 9, 2020
Dataset provided by
<a href="http://nlm.nih.gov/">National Library of Medicine (NLM)</a>
Description
This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six hours, with updates to the AWS ODP bucket occurring daily.
Z
Data from: DADA2 formatted 16S rRNA gene sequences for both bacteria &...
data.niaid.nih.gov
Updated Oct 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali, Alishum (2024). DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2541238
Explore at:
Dataset updated
Oct 24, 2024
Dataset authored and provided by
Ali, Alishum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This version is to stay up to date with the improvements and increase in 16S rRNA gene sequences (SSU) added to the GTDB release 220. Please read this post for the stats on the updates. https://gtdb.ecogenomic.org/stats/r220 .

There has been no change to the RDP-RefSeq reference database please use previous versions.

If anyone has concerns with MAG extracted 16S rRNA gene contamination concerns, then I suggest that they contact the curators of GTDB themselves because it is outside of my role with these resources designed for DADA2 usage only.

Another concern that was raised was the orientation of the DB sequences, to get past this problem please use the tryRC = TRUE argument in the assignTaxonomy command within DADA2, this will search your ASVs in the reverse complement as well.

The bacterial and archaeal 16S rRNA gene sequence databases were collated from various sources and formatted to use the "assignTaxonomy" command within the DADA2 pipeline. The data was converted to suite DADA2 format by Alishum Ali.

Genome Taxonomy Database (GTDB): The new version of our dada2 formatted GTDB reference sequences now contains 58102 bacteria and 3672 archaea full 16S rRNA gene sequences. If you wonder why there are fewer species with 16S rRNA, that is because some metagenomics-assembled genomes (MAGs) lack the 16S gene and thus cannot be extracted. The database was downloaded from https://data.ace.uq.edu.au/public/gtdb/data/releases/ on 24/10/2024. Please read the release notes and file descriptions.

The formatting to DADA2 was done using simple awk bash scripts. The script takes as input a fasta file and a tab-delimited taxonomy file (slightly edited to remove special characters) and then it outputs a fasta file with all 7 taxonomy ranks separated by ";" as required for DADA2 compatibility. Additionally, we have concatenated the unique sequence GTDB ID to the species entry (but replaced the "." with an " _". We see this as an important QC step to highlight the issues/confidence associated with short-read taxonomy assignment at the finer rank levels.

Also, this update includes two other files that you can use with the assignTaxonomy and addSpecies commands in DADA2.
d
Data from: BBGD: an Online Database for Blueberry Genomic Data.
datadiscoverystudio.org
Updated Feb 4, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). BBGD: an Online Database for Blueberry Genomic Data. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/acd7345e618947488264ca0d45bb2c76/html
Explore at:
Dataset updated
Feb 4, 2018
Description
description:
BBGD (http://bioinformatics.towson.edu/BBGD/) was developed as a database for blueberry genomics. BBGD is both a sequence and gene expression database. It stores both EST and microarray data and allows scientists to correlate expression profiles with gene function. BBGD is a public online database. "Presently, the main focus of the database is the identification of genes in blueberry that are significantly induced or suppressed after low temperature exposure. "

To gain a better understanding of changes in gene expression associated with cold acclimation in blueberry, the Rowland laboratory (USDA-ARS, Beltsville, MD) has undertaken a genomics approach based on the analysis of Expressed Sequence Tags (ESTs). Initially, two standard cDNA libraries were constructed using RNA from cold-acclimated and non-acclimated floral buds of the blueberry cultivar Bluecrop (Vaccinium corymbosum L.) and about 1200 5-end ESTs were generated from each of the libraries. About 100 3-end ESTs were generated from the cold-acclimated library as well.

The Blueberry EST database contains EST sequences from a number blueberry libraries including cold acclimated and non-acclimated libraries. It also includes forward and reverse subtractive libraries.

You can query the sequence database by clone ID, accession number or gene (clone) name below. Or you can get a list (in tabular) format of all the clones in a particular library by clicking on the library name on the left side navigation bar.

Attribution for photo: D2601-1 - Blueberry plant: Copyright free, public domain photo by Mark Ehlenfeldt
; abstract:
BBGD (http://bioinformatics.towson.edu/BBGD/) was developed as a database for blueberry genomics. BBGD is both a sequence and gene expression database. It stores both EST and microarray data and allows scientists to correlate expression profiles with gene function. BBGD is a public online database. "Presently, the main focus of the database is the identification of genes in blueberry that are significantly induced or suppressed after low temperature exposure. "

To gain a better understanding of changes in gene expression associated with cold acclimation in blueberry, the Rowland laboratory (USDA-ARS, Beltsville, MD) has undertaken a genomics approach based on the analysis of Expressed Sequence Tags (ESTs). Initially, two standard cDNA libraries were constructed using RNA from cold-acclimated and non-acclimated floral buds of the blueberry cultivar Bluecrop (Vaccinium corymbosum L.) and about 1200 5-end ESTs were generated from each of the libraries. About 100 3-end ESTs were generated from the cold-acclimated library as well.

The Blueberry EST database contains EST sequences from a number blueberry libraries including cold acclimated and non-acclimated libraries. It also includes forward and reverse subtractive libraries.

You can query the sequence database by clone ID, accession number or gene (clone) name below. Or you can get a list (in tabular) format of all the clones in a particular library by clicking on the library name on the left side navigation bar.

Attribution for photo: D2601-1 - Blueberry plant: Copyright free, public domain photo by Mark Ehlenfeldt
o
Sequence database for the single-copy, nuclear-encoded, core photosynthetic...
omicsdi.org
Updated Jul 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Sequence database for the single-copy, nuclear-encoded, core photosynthetic gene psbO [Dataset]. https://www.omicsdi.org/dataset/biostudies/S-BSST659
Explore at:
Dataset updated
Jul 10, 2023
Variables measured
Unknown
Description
We have compiled a psbO sequence database for its use as phytoplankton marker gene (see Pierella Karlusich et al 2022 Molecular Ecology Resources doi:10.1111/1755-0998.13592). psbO is nuclear-encoded and only present in photosynthetic organisms (both cyanobacteria and eukaryotic phototrophs), mainly in one copy per genome. The database contains >18,000 unique psbO sequences covering cyanobacteria, photosynthetic protists, macroalgae and land plants. It includes sequences retrieved from IMG, NCBI, MMETSP and other sequenced genomes and transcriptomes, as well as from the environmental sequence catalogs of Global Ocean Sampling and Tara Oceans. The taxonomic assignment of environmental sequences of psbO was determined by the placement of their translated sequences on a PsbO protein reference phylogeny. This reference phylogeny was built in the following way. The sequences were retrieved using HMMer version 3.2.1 with gathering threshold option (http://hmmer.org/) for the corresponding Pfam domain (MSP; PF01716) against the translated sequenced genomes and transcriptomes from the literature and from PhycoCosm, MMETSP and IMG databases. The translated Pfam region of each sequence was retrieved and the redundancy of the dataset was reduced using CDHIT version 4.6.4 (W. Li & Godzik, 2006) at a 80% identity cut-off. These translated sequences were then aligned with MAFFT version 6 using the G-INS-I strategy (Katoh & Toh, 2008). The reference phylogenetic trees was generated with PhyML version 3.0 using the LG substitution model plus gamma-distributed rates and four substitution rate categories (Guindon et al., 2010). The starting tree was a BIONJ tree and the type of tree improvement was subtree pruning and regrafting. Branch support was calculated using the approximate likelihood ratio test (aLRT) with a Shimodaira–Hasegawa-like (SH-like) procedure. Contaminant sequences were carefully removed based on phylogenetic incongruence. The corresponding curated final alignment was used as reference. For parallelization of the taxonomic annotation task, a set of 50 environmental sequences were translated and the PsbO specific Pfam region (PF01716) were retrieved for the following analysis. First, they were aligned against the reference alignment using the option --add of MAFFT version 6 with the G-INS-I strategy (Katoh and Toh 2008 Brief Bioinformatics 9:286-298). Second, the resulting alignment was used for building a phylogeny as described above. Finally, the sequences were classified according to their grouping in monophyletic branches of statistical support >0.7 with reference sequences of the same taxonomic group.
e
CATH-Gene3D
ebi.ac.uk
Updated Oct 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Oct 21, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institutes of Health (NIH) (2023). GenBank [Dataset]. https://catalog.data.gov/dataset/genbank

GenBank

Explore at:

Dataset updated

Jul 26, 2023

Dataset provided by

National Institutes of Health (NIH)

Description

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information.

Clear search

Close search

Google apps

Main menu

GenBank

NCBI Genome Survey Sequences Database

GenBank

GenBank Database

Human Gene and Protein Database (HGPD)

Data from: SoyBase and the Soybean Breeder's Toolbox

The results of whole genome sequence database (the TrueBacTM ID-Genome...

Gene database for Pseudomonas aeruginosa PAO1

High Throughput Genomic Sequences Division

ARS Microbial Genomic Sequence Database Server

Alternative Splicing Annotation Project II Database

NCBIFAM

IMGT/GENE-DB

The tpm metabarcoding DNA sequence database for taxonomic allocations using...

NCBI BioProject

COVID-19 Genome Sequence Dataset

Data from: DADA2 formatted 16S rRNA gene sequences for both bacteria &...

Data from: BBGD: an Online Database for Blueberry Genomic Data.

Sequence database for the single-copy, nuclear-encoded, core photosynthetic...

CATH-Gene3D

GenBankSee More Versions

GenBank