100+ datasets found
  1. s

    GenBank

    • scicrunch.org
    • dknet.org
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GenBank [Dataset]. http://identifiers.org/RRID:SCR_002760
    Explore at:
    Description

    NIH genetic sequence database that provides annotated collection of all publicly available DNA sequences for almost 280 000 formally described species (Jan 2014) .These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun (WGS) and environmental sampling projects. Most submissions are made using web-based BankIt or standalone Sequin programs, and GenBank staff assigns accession numbers upon data receipt. It is part of International Nucleotide Sequence Database Collaboration and daily data exchange with European Nucleotide Archive (ENA) and DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through NCBI Entrez retrieval system, which integrates data from major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of GenBank database are available by FTP.

  2. r

    High Throughput Genomic Sequences Division

    • rrid.site
    • dknet.org
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). High Throughput Genomic Sequences Division [Dataset]. http://identifiers.org/RRID:SCR_002150
    Explore at:
    Dataset updated
    Feb 12, 2025
    Description

    Database of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences. It was created to accommodate a growing need to make unfinished genomic sequence data rapidly available to the scientific community in a coordinated effort among the International Nucleotide Sequence databases, DDBJ, EMBL, and GenBank. Sequences are prepared for submission by using NCBI's software tools Sequin or tbl2asn. Each center has an FTP directory into which new or updated sequence files are placed. Sequence data in this division are available for BLAST homology searches against either the htgs database or the month database, which includes all new submissions for the prior month. Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and deposited in the HTG division. A typical HTG record might consist of all the first-pass sequence data generated from a single cosmid, BAC, YAC, or P1 clone, which together make up more than 2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences, and each record includes a clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data are unfinished and may contain errors. The accession number does not change as sequence records are updated; only the most recent version of a HTG record remains in GenBank.

  3. n

    3D-Genomics Database

    • neuinfo.org
    • scicrunch.org
    Updated Mar 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430/resolver?q=&i=rrid
    Explore at:
    Dataset updated
    Mar 9, 2025
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome

  4. f

    9MM Gallus gallus protein BLAST (tabular).

    • figshare.com
    xlsx
    Updated Jun 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthys G. Potgieter; Andrew J. M. Nel; Suereta Fortuin; Shaun Garnett; Jerome M. Wendoh; David L. Tabb; Nicola J. Mulder; Jonathan M. Blackburn (2023). 9MM Gallus gallus protein BLAST (tabular). [Dataset]. http://doi.org/10.1371/journal.pcbi.1011163.s014
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Matthys G. Potgieter; Andrew J. M. Nel; Suereta Fortuin; Shaun Garnett; Jerome M. Wendoh; David L. Tabb; Nicola J. Mulder; Jonathan M. Blackburn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundMicrobiome research is providing important new insights into the metabolic interactions of complex microbial ecosystems involved in fields as diverse as the pathogenesis of human diseases, agriculture and climate change. Poor correlations typically observed between RNA and protein expression datasets make it hard to accurately infer microbial protein synthesis from metagenomic data. Additionally, mass spectrometry-based metaproteomic analyses typically rely on focused search sequence databases based on prior knowledge for protein identification that may not represent all the proteins present in a set of samples. Metagenomic 16S rRNA sequencing only targets the bacterial component, while whole genome sequencing is at best an indirect measure of expressed proteomes. Here we describe a novel approach, MetaNovo, that combines existing open-source software tools to perform scalable de novo sequence tag matching with a novel algorithm for probabilistic optimization of the entire UniProt knowledgebase to create tailored sequence databases for target-decoy searches directly at the proteome level, enabling metaproteomic analyses without prior expectation of sample composition or metagenomic data generation and compatible with standard downstream analysis pipelines.ResultsWe compared MetaNovo to published results from the MetaPro-IQ pipeline on 8 human mucosal-luminal interface samples, with comparable numbers of peptide and protein identifications, many shared peptide sequences and a similar bacterial taxonomic distribution compared to that found using a matched metagenome sequence database—but simultaneously identified many more non-bacterial peptides than the previous approaches. MetaNovo was also benchmarked on samples of known microbial composition against matched metagenomic and whole genomic sequence database workflows, yielding many more MS/MS identifications for the expected taxa, with improved taxonomic representation, while also highlighting previously described genome sequencing quality concerns for one of the organisms, and identifying an experimental sample contaminant without prior expectation.ConclusionsBy estimating taxonomic and peptide level information directly on microbiome samples from tandem mass spectrometry data, MetaNovo enables the simultaneous identification of peptides from all domains of life in metaproteome samples, bypassing the need for curated sequence databases to search. We show that the MetaNovo approach to mass spectrometry metaproteomics is more accurate than current gold standard approaches of tailored or matched genomic sequence database searches, can identify sample contaminants without prior expectation and yields insights into previously unidentified metaproteomic signals, building on the potential for complex mass spectrometry metaproteomic data to speak for itself.

  5. r

    European Nucleotide Archive (ENA)

    • rrid.site
    • scicrunch.org
    • +2more
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). European Nucleotide Archive (ENA) [Dataset]. http://identifiers.org/RRID:SCR_006515
    Explore at:
    Dataset updated
    Feb 9, 2025
    Description

    Public archive providing a comprehensive record of the world''''s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. All submitted data, once public, will be exchanged with the NCBI and DDBJ as part of the INSDC data exchange agreement. The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline. ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation). Data arrive at ENA from a variety of sources including submissions of raw data, assembled sequences and annotation from small-scale sequencing efforts, data provision from the major European sequencing centers and routine and comprehensive exchange with their partners in the International Nucleotide Sequence Database Collaboration (INSDC). Provision of nucleotide sequence data to ENA or its INSDC partners has become a central and mandatory step in the dissemination of research findings to the scientific community. ENA works with publishers of scientific literature and funding bodies to ensure compliance with these principles and to provide optimal submission systems and data access tools that work seamlessly with the published literature. ENA is made up of a number of distinct databases that includes the EMBL Nucleotide Sequence Database (Embl-Bank), the newly established Sequence Read Archive (SRA) and the Trace Archive. The main tool for downloading ENA data is the ENA Browser, which is available through REST URLs for easy programmatic use. All ENA data are available through the ENA Browser. Note: EMBL Nucleotide Sequence Database (EMBL-Bank) is entirely included within this resource.

  6. ARS Microbial Genomic Sequence Database Server

    • catalog.data.gov
    • datadiscoverystudio.org
    • +1more
    Updated Mar 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2024). ARS Microbial Genomic Sequence Database Server [Dataset]. https://catalog.data.gov/dataset/ars-microbial-genomic-sequence-database-server-1b81c
    Explore at:
    Dataset updated
    Mar 30, 2024
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This database server is supported in fulfilment of the research mission of the Mycotoxin Prevention and Applied Microbiology Research Unit at the National Center for Agricultural Utilization Research in Peoria, Illinois. The linked website provides access to gene sequence databases for various groups of microorganisms, such as Streptomyces species or Aspergillus species and their relatives, that are the product of ARS research programs. The sequence databases are organized in the BIGSdb (Bacterial Isolate Genomic Sequence Database) software package developed by Keith Jolley and Martin Maiden at Oxford University. Resources in this dataset:Resource Title: ARS Microbial Genomic Sequence Database Server. File Name: Web Page, url: http://199.133.98.43

  7. Bacterial strain panel used in this study.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erin P. Price; Derek S. Sarovich; Jessica R. Webb; Jennifer L. Ginther; Mark Mayo; James M. Cook; Meagan L. Seymour; Mirjam Kaestli; Vanessa Theobald; Carina M. Hall; Joseph D. Busch; Jeffrey T. Foster; Paul Keim; David M. Wagner; Apichai Tuanyok; Talima Pearson; Bart J. Currie (2023). Bacterial strain panel used in this study. [Dataset]. http://doi.org/10.1371/journal.pone.0071647.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Erin P. Price; Derek S. Sarovich; Jessica R. Webb; Jennifer L. Ginther; Mark Mayo; James M. Cook; Meagan L. Seymour; Mirjam Kaestli; Vanessa Theobald; Carina M. Hall; Joseph D. Busch; Jeffrey T. Foster; Paul Keim; David M. Wagner; Apichai Tuanyok; Talima Pearson; Bart J. Currie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    aNumbers in parentheses indicate Thai strains; all other strains were isolated in the Northern Territory, Australia.bSpecies assignment based on [28].

  8. n

    NCBI Genome Survey Sequences Database

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Sep 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). NCBI Genome Survey Sequences Database [Dataset]. http://identifiers.org/RRID:SCR_015063
    Explore at:
    Dataset updated
    Sep 15, 2024
    Description

    Database of genomic sequences. rather than cDNA (mRNA). Two classes (exon trapped products and gene trapped products) may be derived via a cDNA intermediate.

  9. d

    T4-like genome database

    • dknet.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). T4-like genome database [Dataset]. http://identifiers.org/RRID:SCR_005367
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 22, 2016. A database of information on bacterial phages. It contains multiple phage genomes, which users can BLAST and MegaBLAST, and also hosts a Phage Forum in which users can discuss phage data. Interactive browsing of completed phage genomes is available using the program. The browser allows users to scan the genome for particular features and to download sequence information plus analyses of those features. Views of the genome are generated showing named genes BLAST similarities to other phages predicted tRNAs and other sequence features.

  10. n

    Human Gene and Protein Database (HGPD)

    • neuinfo.org
    • scicrunch.org
    Updated Nov 23, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2008). Human Gene and Protein Database (HGPD) [Dataset]. http://identifiers.org/RRID:SCR_002889
    Explore at:
    Dataset updated
    Nov 23, 2008
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE. Documented on January 4,2023.The Human Gene and Protein Database presents SDS-PAGE patterns and other informations of human genes and proteins. The HGPD was constructed from full-length cDNAs. For conversion to Gateway entry clones, we first determined an open reading frame (ORF) region in each cDNA meeting the criteria. Those ORF regions were PCR-amplified utilizing selected resource cDNAs as templates. All the details of the construction and utilization of entry clones will be published elsewhere. Amino acid and nucleotide sequences of an ORF for each cDNA and sequence differences of Gateway entry clones from source cDNAs are presented in the GW: Gateway Summary window. Utilizing those clones with a very efficient cell-free protein synthesis system featuring wheat germ, we have produced a large number of human proteins in vitro. Expressed proteins were detected in almost all cases. Proteins in both total and supernatant fractions are shown in the PE: Protein Expression window. In addition, we have also successfully expressed proteins in HeLa cells and determined subcellular localizations of human proteins. These biological data are presented on the frame of cDNA clusters in the Human Gene and Protein Database. To build the basic frame of HGPD, sequences of FLJ full-length cDNAs and others deposited in public databases (Human ESTs, RefSeq, Ensembl, MGC, etc.) are assembled onto the genome sequences (NCBI Build 35 (UCSC hg17)). The majority of analysis data for cDNA sequences in HGPD are shared with the FLJ Human cDNA Database (http://flj.hinv.jp/) constructed as a human cDNA sequence analysis database focusing on mRNA varieties caused by variations in transcription start site (TSS) and splicing.

  11. Data from: Cacao Genome Database

    • agdatacommons.nal.usda.gov
    • datasets.ai
    • +1more
    bin
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raymond J. Schnell; Alan W. Meerow; Tomas Ayala-Silva; Osman Gutierrez; David Kuhn; Cecile L. Tondo; Juan Carlos Motamayor (2024). Cacao Genome Database [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Cacao_Genome_Database/24852516
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Authors
    Raymond J. Schnell; Alan W. Meerow; Tomas Ayala-Silva; Osman Gutierrez; David Kuhn; Cecile L. Tondo; Juan Carlos Motamayor
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Not only is cacao the basic ingredient in the world’s favorite confection, chocolate, but it provides a livelihood for over 6.5 million farmers in Africa, South America and Asia and ranks as one of the top ten agriculture commodities in the world. Historically, cocoa production has been plagued by serious losses due to pests and diseases. The release of the cacao genome sequence will provide researchers with access to the latest genomic tools, enabling more efficient research and accelerating the breeding process, thereby expediting the release of superior cacao cultivars. The sequenced genotype, Matina 1-6, is representative of the genetic background most commonly found in the cacao producing countries, enabling results to be applied immediately and broadly to current commercial cultivars. Matina 1-6 is highly homozygous which greatly reduces the complexity of the sequence assembly process. While the sequence provided is a preliminary release, it already covers 92% of the genome, with approximately 35,000 genes. We will continue to refine the assembly and annotation, working toward a complete finished sequence. Updates will be made available via the main project website. Resources in this dataset:Resource Title: Cacao Genome Database. File Name: Web Page, url: http://www.cacaogenomedb.org/

  12. u

    Data from: SoyBase and the Soybean Breeder's Toolbox

    • agdatacommons.nal.usda.gov
    • catalog.data.gov
    • +1more
    bin
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David M. Grant (2024). SoyBase and the Soybean Breeder's Toolbox [Dataset]. http://doi.org/10.15482/USDA.ADC/1212265
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    Ag Data Commons
    Authors
    David M. Grant
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    SoyBase is a repository for genetics, genomics and related data resources for soybean. It contains current genetic, physical and genomic sequence maps integrated with qualitative and quantitative traits. SoyBase database was established in the 1990s as the USDA Soybean Genetics Database. Originally, it contained only genetic information about soybeans such as genetic maps and information about the Mendelian genetics of soybean. In time SoyBase was expanded to include molecular data regarding soybean genes and sequences as they became available. In 2010, the soybean genome sequence was published and it and supporting gene sequences have been integrated into the SoyBase sequence browser. SoyBase genetic maps were used in the assembly of both the Williams 82 2010 assembly (Wm82.a1.v1) and the newest genome assembly (Wm82.a2.v1). SoyBase also incorporates information about mutant and other soybean genetic stocks and serves as a contact point for ordering strains from those populations. As association analyses continue due to various re-sequencing efforts SoyBase will also incorporate those data into the soybean genome browser as they become available. Gene expression patterns are also available at SoyBase through the SoyBase expression pages and the Soybean Gene Atlas. Other expression/transcriptome/methylomic data sets also have been and continue to be incorporated into the SoyBase genome browser. Project No:3625-21000-062-00D Accession No: 0425040 Resources in this dataset:Resource Title: SoyBase, the USDA-ARS soybean genetics and genomics database web site. File Name: Web Page, url: https://soybase.org SoyBase database was established in the 1990s as the USDA Soybean Genetics Database. Originally, it contained only genetic information about soybeans such as genetic maps and information about the Mendelian genetics of soybean. In time SoyBase was expanded to include molecular data regarding soybean genes and sequences as they became available. In 2010, the soybean genome sequence was published and it and supporting gene sequences have been integrated into the SoyBase sequence browser. SoyBase genetic maps were used in the assembly of both the Williams 82 2010 assembly (Wm82.a1.v1) and the newest genome assembly (Wm82.a2.v1).

    Soybean Pods and Seeds SoyBase also incorporates information about mutant and other soybean genetic stocks and serves as a contact point for ordering strains from those populations. As association analyses continue due to various re-sequencing efforts SoyBase will also incorporate those data into the soybean genome browser as they become available. Gene expression patterns are also available at SoyBase through the SoyBase expression pages and the Soybean Gene Atlas. Other expression/transcriptome/methylomic data sets also have been and continue to be incorporated into the SoyBase genome browser.

  13. f

    The results of whole genome sequence database (the TrueBacTM ID-Genome...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oh Joo Kweon; Yong Kwan Lim; Hye Ryoun Kim; Tae-Hyoung Kim; Sung-min Ha; Mi-Kyung Lee (2023). The results of whole genome sequence database (the TrueBacTM ID-Genome system) matching for the novel Cupriavidus species. [Dataset]. http://doi.org/10.1371/journal.pone.0232850.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Oh Joo Kweon; Yong Kwan Lim; Hye Ryoun Kim; Tae-Hyoung Kim; Sung-min Ha; Mi-Kyung Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The results of whole genome sequence database (the TrueBacTM ID-Genome system) matching for the novel Cupriavidus species.

  14. d

    Alternative Splicing Annotation Project II Database

    • dknet.org
    • scicrunch.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Alternative Splicing Annotation Project II Database [Dataset]. http://identifiers.org/RRID:SCR_000322
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.

  15. MARMICRODB database for taxonomic classification of (marine) metagenomes

    • zenodo.org
    application/gzip, bin +3
    Updated Mar 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shane L Hogle; Shane L Hogle (2020). MARMICRODB database for taxonomic classification of (marine) metagenomes [Dataset]. http://doi.org/10.5281/zenodo.3520509
    Explore at:
    bin, application/gzip, tsv, html, bz2Available download formats
    Dataset updated
    Mar 20, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shane L Hogle; Shane L Hogle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction:
    This sequence database (MARMICRODB) was introduced in the publication JW Becker, SL Hogle, K Rosendo, and SW Chisholm. 2019. Co-culture and biogeography of Prochlorococcus and SAR11. ISME J. doi:10.1038/s41396-019-0365-4. Please see the original publication and its associated supplementary material for the original description of this resource.

    Motivation:
    We needed a reference database to annotate shotgun metagenomes from the Tara Oceans project [1] the GEOTRACES cruises GA02, GA03, GA10, and GP13 and the HOT and BATS time series [2]. Our interests are primarily in quantifying and annotating the free-living, oligotrophic bacterial groups Prochlorococcus, Pelagibacterales/SAR11, SAR116, and SAR86 from these samples using the protein classifier tool Kaiju [3]. Kaiju’s sensitivity and classification accuracy depend on the composition of the reference database, and highest sensitivity is achieved when the reference database contains a comprehensive representation of expected taxa from an environment/sample of interest. However, the speed of the algorithm decreases as database size increases. Therefore, we aimed to create a reference database that maximized the representation of sequences from marine bacteria, archaea, and microbial eukaryotes, while minimizing (but not excluding) the sequences from clinical, industrial, and terrestrial host-associated samples.

    Results/Description:
    MARMICRODB consists of 56 million sequence non-redundant protein sequences from 18769 bacterial/archaeal/eukaryote genome and transcriptome bins and 7492 viral genomes optimized for use with the protein homology classifier Kaiju [3]. To ensure maximum representation of marine bacteria, archaea, and microbial eukaryotes, we included translated genes/transcripts from 5397 representative “specI” species clusters from the proGenomes database [4]; 113 transcriptomes from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [5]; 10509 metagenome assembled genomes from the Tara Oceans expedition [6,7], the Red Sea [8], the Baltic Sea [9], and other aquatic and terrestrial sources [10]; 994 isolate genomes from the Genomic Encyclopedia of Bacteria and Archaea [11]; 7492 viral genomes from NCBI RefSeq [12]; 786 bacterial and archaeal genomes from MarRef [13]; and 677 marine single cell genomes [14]. In order to annotate metagenomic reads at the clade/ecotype level (subspecies) for the focal taxa Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116, we generated custom MARMICRODB taxonomies based on curated genome phylogenies for each group. The curated phylogenies, Kaiju formatted Burrows-Wheeler index, translated genes, the custom taxonomy hierarchy, an interactive kronaplot of the taxonomic composition, and scripts and instructions for how to use or rebuild the resource is available from 10.5281/zenodo.3520509.

    Methods:
    The curation and quality control of MARMICRODB single cell, metagenome assembled, and isolate genomes was performed as described in [15]. Briefly, we downloaded all MARMICRODB genomes as raw nucleotide assemblies from NCBI. We determined an initial genome taxonomy for these assemblies using checkM with the default lineage workflow [16]. All genome bins met the completion/contamination thresholds outlined in prior studies [7,17]. For single cell and metagenome assembled genomes, especially those from Tara Oceans Mediterranean sea samples [18], we use the GTDB-Tk classification workflow [19] to verify the taxonomic fidelity of each genome bin. We then selected genomes with a checkM taxonomic assignment of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 for further analysis and confirmed taxonomic assignment using blast matches to known Prochlorococcus/Synechococcus ITS sequences and by matching 16S sequences to the SILVA database [20]. To refine our estimates of completeness/contamination of Prochlorococcus genome bins we created a custom set of 730 single copy protein families (available from 10.5281/zenodo.3719132) from closed, isolate Prochlorococcus genomes [21] for quality assessments with checkM. For Synechococcus we used the CheckM taxonomic-specific workflow with the genus Synechococcus. After the custom CheckM quality control, we excluded any genome bins from downstream analysis that had an estimated quality < 30, defined as %completeness – 5x %contamination resulting in 18769 genome/transcriptome bins. We predicted genes in the resulting genome bins using prodigal [22] and excluded protein sequences with lengths less than 20 and greater than 20000 amino acids, removed non-standard amino acid residues, and condensed redundant protein sequences to a single representative sequence to which we assigned a lowest common ancestor (LCA) taxonomy identifier from the NCBI taxonomy database [23]. The resulting protein sequences were compiled and used to build a Kaiju [3] search database.

    The above filtering criteria resulted in 605 Prochlorococcus, 96 Synechococcus, 186 SAR11/Pelagibacterales, 60 SAR86, and 59 SAR116 high-quality genome bins. We constructed a high quality fixed reference phylogenetic tree for each taxonomic group based on genomes manually selected for completeness and the phylogenetic diversity. For example the Prochlorococcus and Synechococcus genomes for the fixed reference phylogeny are estimated > 90% complete, and SAR11 genomes are estimated > 70% complete. We created multiple sequence alignments of phylogenetically conserved genes from these genomes using the GTDB-Tk pipeline [19] with default settings. The pipeline identifies conserved proteins (120 bacterial proteins) and generates concatenated multi-protein alignments [17] from the genome assemblies using hmmalign from the hmmer software suite. We further filtered the resulting alignment columns using the bacterial and archaeal alignment masks from [17] (http://gtdb.ecogenomic.org/downloads). We removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occuring at a frequency greater than 25%. We trimmed the alignments using trimal [24] with the automated -gappyout option to trim columns based on their gap distribution. We inferred reference phylogenies using multithreaded RAxML [25] with the GAMMA model of rate heterogeneity, empirically determined base frequencies, and the LG substitution model [26](PROTGAMMALGF). Branch support is based on 250 resampled bootstrap trees. This tree was then pruned to only allow a maximum average distance to the closest leaf (ADCL) of 0.003 to reduce the phylogenetic redundancy in the tree [27]. We then “placed” genomes that either did not pass completeness threshold or were considered phylogenetically redundant by ADCL within the fixed reference phylogeny for each group using pplacer [28] representing each placed genome as a pendant edge in the final tree. We then examined the resulting tree and manually selected clade/ecotype cutoffs to be as consistent as possible with clade definitions previously outlined for these groups [29–32]. We then gave clades from each taxonomic group custom taxonomic identifiers and we added these identifiers to the MARMICRODB Kaiju taxonomic hierarchy.

    Software/databases used:
    checkM v1.0.11[16]
    HMMERv3.1b2 (http://hmmer.org/)
    prodigal v2.6.3 [22]
    trimAl v1.4.rev22 [24]
    AliView v1.18.1 [33] [34]
    Phyx v0.1 [35]
    RAxML v8.2.12 [36]
    Pplacer v1.1alpha [28]
    GTDB-Tk v0.1.3 [19]
    Kaiju v1.6.0 [34]
    GTDB RS83 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release83/83.0/)
    NCBI Taxonomy (accessed 2018-07-02) [23]
    TIGRFAM v14.0 [37]
    PFAM v31.0 [38]

    Discussion/Caveats:
    MARMICRODB is optimized for metagenomic samples from the marine environment, in particular planktonic microbes from the pelagic euphotic zone. We expect this database may also be useful for classifying other types of marine metagenomic samples (for example mesopelagic, bathypelagic, or even benthic or marine host-associated), but it has not been tested as such. The original purpose of this database was to quantify clades/ecotypes of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 in metagenomes from Tara Oceans Expedition and the GEOTRACES project. We carefully annotated and quality controlled genomes from these five groups, but the processing of the other marine taxa was largely automated and unsupervised. Taxonomy for other groups was copied over from the Genome Taxonomy Database (GTDB) [19,39] and NCBI Taxonomy [23] so any inconsistencies in those databases will be propagated to MARMICRODB. For most use cases MARMICRODB can probably be used unmodified, but if the user’s goal is to focus on a particular organism/clade that we did not curate in the database then the user may wish to spend some time curating those genomes (ie checking for contamination, dereplicating, building a genome phylogeny for custom taxonomy node assignment). Currently the custom taxonomy is hardcoded in the MARMICRODB.fmi index, but if users wish to modify MARMICRODB by adding or removing genomes, or reconfiguring taxonomic ranks the names.dmp and nodes.dmp files can easily be modified as well as the fasta file of protein sequences. However, the Kaiju index will need to be rebuilt, and user will require a high

  16. Data from: Genetic diversity and spread dynamics of SARS-CoV-2 variants...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Desire Mtetwa (2024). Genetic diversity and spread dynamics of SARS-CoV-2 variants present in African populations [Dataset]. http://doi.org/10.5061/dryad.1c59zw42d
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    Chinhoyi University of Technology
    Authors
    Desire Mtetwa
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The dynamics of coronavirus disease-19 (COVID-19) have been extensively researched in many settings around the world, but little is known about these patterns in Africa. 7540 complete nucleotide genomes from 51 African nations were obtained and analysed from the National Center for Biotechnology Information (NCBI) and Global Initiative on Sharing Influenza Data (GISAID) databases to examine genetic diversity and spread dynamics of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) lineages circulating in Africa. Utilising a variety of clade and lineage nomenclature schemes, we looked at their diversity, and used maximum parsimony inference methods to recreate their evolutionary divergence and history. According to this study, only 465 of the 2610 Pango lineages found to have existed in the world circulated in Africa after three years of the COVID-19 pandemic outbreak, with five different lineages dominating at various points during the outbreak. We identified South Africa, Kenya, and Nigeria as key sources of viral transmissions between Sub-Saharan African nations. These findings provide insight into the viral strains that are circulating in Africa and their evolutionary patterns. Methods Dataset mining and workflow SARS-CoV-2 genome sequences collected from Africa were obtained from NCBI database and GISAID database on February 26, 2023. 24415 African sequences were retrieved from both databases so as to examine the number of lineages circulating within Africa. The two databases had only 8044 complete genome sequences combined from Africa, and these sequences excluding those with low coverage using NextClade were retrieved to determine spread dynamics. 5908 sequences from 23 African countries were available in the NCBI and 2137 sequences from 41 African countries from GISAID database. The sequences were aligned using the online version of the MAFFT multiple sequence alignment tool, with the Wuhan-Hu-1 (MN 908947.3) as the reference sequence, and sequences with more than 5.0% ambiguous letters were removed. Duplicates were removed using goalign dedup software and only high quality African complete sequences remained (n=7540). Phylogenetic reconstruction Using IQ-TREE multicore software version v1.6.12 and NextClade, phylogeny reconstruction on the dataset was performed numerous times. Lineage classification PANGOLin, a web application was used to classify sequences into their lineages. The objective was to determine the SARS-CoV-2 lineages that are circulating in Africa that are most important from an epidemiological perspective, as well as the lineage dynamics within and across the African continent, due to the fact that this naming system integrates genetic and geographic data concerning SARS-CoV-2 dynamics. Phylogeographic reconstruction VOC, (VOI) and VUM were designated based on the WHO framework as of 20 January 2022. We included one lineage, namely A.23.1 and labelled it as VOI for the purposes of this analysis. This lineage was included because it demonstrated the continued evolution of African lineages into potentially more transmissible variants. VOI, VOC, and VUM that emerged on the African continent were marked. These were A.23.1 (VOI), B.1.351 and B.1.1.529 (VOC), B.1.640, and B.1.525 (VUM). Genome sequences of these five lineages were extracted from NCBI database for phylogeographic reconstruction. A similar approach to that described above (including alignment using online MAFFT) was employed. Phylogeographic reconstruction for all variants circulating in Africa and all VOI, VOC, and VUM was conducted using PASTML.

  17. d

    GenBank

    • catalog.data.gov
    • healthdata.gov
    • +3more
    Updated Jul 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (NIH) (2023). GenBank [Dataset]. https://catalog.data.gov/dataset/genbank
    Explore at:
    Dataset updated
    Jul 26, 2023
    Dataset provided by
    National Institutes of Health (NIH)
    Description

    GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information.

  18. d

    MBGD - Microbial Genome Database

    • dknet.org
    • scicrunch.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). MBGD - Microbial Genome Database [Dataset]. http://identifiers.org/RRID:SCR_012824
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    MBGD is a database for comparative analysis of completely sequenced microbial genomes, the number of which is now growing rapidly. The aim of MBGD is to facilitate comparative genomics from various points of view such as ortholog identification, paralog clustering, motif analysis and gene order comparison. The heart of MBGD function is to create orthologous or homologous gene cluster table. For this purpose, similarities between all genes are precomputed and stored into the database, in addition to the annotations of genes such as function categories that were assigned by the original authors and motifs that were found in the translated sequence. Using these homology data, MBGD dynamically creates orthologous gene cluster table. Users can change a set of organisms or cutoff parameters to create their own orthologous grouping. Based on this cluster table, users can further analyze multiple genomes from various points of view with the functions such as global map comparison, local map comparison, multiple sequence alignment and phylogenetic tree construction.

  19. r

    DNA DataBank of Japan (DDBJ)

    • rrid.site
    • dknet.org
    • +1more
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). DNA DataBank of Japan (DDBJ) [Dataset]. http://identifiers.org/RRID:SCR_002359
    Explore at:
    Dataset updated
    Feb 25, 2025
    Description

    Maintains and provides archival, retrieval and analytical resources for biological information. Central DDBJ resource consists of public, open-access nucleotide sequence databases including raw sequence reads, assembly information and functional annotation. Database content is exchanged with EBI and NCBI within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). In 2011, DDBJ launched two new resources: DDBJ Omics Archive and BioProject. DOR is archival database of functional genomics data generated by microarray and highly parallel new generation sequencers. Data are exchanged between the ArrayExpress at EBI and DOR in the common MAGE-TAB format. BioProject provides organizational framework to access metadata about research projects and data from projects that are deposited into different databases.

  20. r

    Chloroplast Genome Database

    • rrid.site
    • neuinfo.org
    • +1more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Chloroplast Genome Database [Dataset]. http://identifiers.org/RRID:SCR_013421
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    The Chloroplast Genome Database contains annotated chloroplast/plastid genomes from the NCBI Organelle Genomes section at NCBI. Users can search for genes by their annotated names, conduct flexible BLAST searches, download protein and nucleotide sequences extracted from a selected chloroplast genome, and browse the putative protein families (tribes) created using TribeMCL.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
GenBank [Dataset]. http://identifiers.org/RRID:SCR_002760

GenBank

RRID:SCR_002760, nif-0000-02873, OMICS_01650, GenBank (RRID:SCR_002760), GB, Gen Bank, GenBank

Explore at:
50 scholarly articles cite this dataset (View in Google Scholar)
Description

NIH genetic sequence database that provides annotated collection of all publicly available DNA sequences for almost 280 000 formally described species (Jan 2014) .These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun (WGS) and environmental sampling projects. Most submissions are made using web-based BankIt or standalone Sequin programs, and GenBank staff assigns accession numbers upon data receipt. It is part of International Nucleotide Sequence Database Collaboration and daily data exchange with European Nucleotide Archive (ENA) and DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through NCBI Entrez retrieval system, which integrates data from major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of GenBank database are available by FTP.

Search
Clear search
Close search
Google apps
Main menu