100+ datasets found
  1. n

    DBETH - Database for Bacterial ExoToxins for Humans

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Jan 12, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). DBETH - Database for Bacterial ExoToxins for Humans [Dataset]. http://identifiers.org/RRID:SCR_005908
    Explore at:
    Dataset updated
    Jan 12, 2012
    Description

    Database of Bacterial ExoToxins for Human is a database of sequences, structures, interaction networks and analytical results for 229 exotoxins, from 26 different human pathogenic bacterial genus. All toxins are classified into 24 different Toxin classes. The aim of DBETH is to provide a comprehensive database for human pathogenic bacterial exotoxins. DBETH also provides a platform to its users to identify potential exotoxin like sequences through Homology based as well as Non-homology based methods. In homology based approach the users can identify potential exotoxin like sequences either running BLASTp against the toxin sequences or by running HMMER against toxin domains identified by DBETH from human pathogenic bacterial exotoxins. In Non-homology based part DBETH uses a machine learning approach to identify potential exotoxins (Toxin Prediction by Support Vector Machine based approach).

  2. d

    A tandem repeats database for bacterial genomes: application to the...

    • catalog.data.gov
    • odgavaprod.ogopendata.com
    Updated Sep 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). A tandem repeats database for bacterial genomes: application to the genotyping of [Dataset]. https://catalog.data.gov/dataset/a-tandem-repeats-database-for-bacterial-genomes-application-to-the-genotyping-of
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Some pathogenic bacteria are genetically very homogeneous, making strain discrimination difficult. In the last few years, tandem repeats have been increasingly recognized as markers of choice for genotyping a number of pathogens. The rapid evolution of these structures appears to contribute to the phenotypic flexibility of pathogens. The availability of whole-genome sequences has opened the way to the systematic evaluation of tandem repeats diversity and application to epidemiological studies. Results This report presents a database () of tandem repeats from publicly available bacterial genomes which facilitates the identification and selection of tandem repeats. We illustrate the use of this database by the characterization of minisatellites from two important human pathogens, Yersinia pestis and Bacillus anthracis. In order to avoid simple sequence contingency loci which may be of limited value as epidemiological markers, and to provide genotyping tools amenable to ordinary agarose gel electrophoresis, only tandem repeats with repeat units at least 9 bp long were evaluated. Yersinia pestis contains 64 such minisatellites in which the unit is repeated at least 7 times. An additional collection of 12 loci with at least 6 units, and a high internal conservation were also evaluated. Forty-nine are polymorphic among five Yersinia strains (twenty-five among three Y. pestis strains). Bacillus anthracis contains 30 comparable structures in which the unit is repeated at least 10 times. Half of these tandem repeats show polymorphism among the strains tested. Conclusions Analysis of the currently available bacterial genome sequences classifies Bacillus anthracis and Yersinia pestis as having an average (approximately 30 per Mb) density of tandem repeat arrays longer than 100 bp when compared to the other bacterial genomes analysed to date. In both cases, testing a fraction of these sequences for polymorphism was sufficient to quickly develop a set of more than fifteen informative markers, some of which show a very high degree of polymorphism. In one instance, the polymorphism information content index reaches 0.82 with allele length covering a wide size range (600-1950 bp), and nine alleles resolved in the small number of independent Bacillus anthracis strains typed here.

  3. MARMICRODB database for taxonomic classification of (marine) metagenomes

    • zenodo.org
    application/gzip, bin +3
    Updated Mar 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shane L Hogle; Shane L Hogle (2020). MARMICRODB database for taxonomic classification of (marine) metagenomes [Dataset]. http://doi.org/10.5281/zenodo.3520509
    Explore at:
    bin, application/gzip, tsv, html, bz2Available download formats
    Dataset updated
    Mar 20, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shane L Hogle; Shane L Hogle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction:
    This sequence database (MARMICRODB) was introduced in the publication JW Becker, SL Hogle, K Rosendo, and SW Chisholm. 2019. Co-culture and biogeography of Prochlorococcus and SAR11. ISME J. doi:10.1038/s41396-019-0365-4. Please see the original publication and its associated supplementary material for the original description of this resource.

    Motivation:
    We needed a reference database to annotate shotgun metagenomes from the Tara Oceans project [1] the GEOTRACES cruises GA02, GA03, GA10, and GP13 and the HOT and BATS time series [2]. Our interests are primarily in quantifying and annotating the free-living, oligotrophic bacterial groups Prochlorococcus, Pelagibacterales/SAR11, SAR116, and SAR86 from these samples using the protein classifier tool Kaiju [3]. Kaiju’s sensitivity and classification accuracy depend on the composition of the reference database, and highest sensitivity is achieved when the reference database contains a comprehensive representation of expected taxa from an environment/sample of interest. However, the speed of the algorithm decreases as database size increases. Therefore, we aimed to create a reference database that maximized the representation of sequences from marine bacteria, archaea, and microbial eukaryotes, while minimizing (but not excluding) the sequences from clinical, industrial, and terrestrial host-associated samples.

    Results/Description:
    MARMICRODB consists of 56 million sequence non-redundant protein sequences from 18769 bacterial/archaeal/eukaryote genome and transcriptome bins and 7492 viral genomes optimized for use with the protein homology classifier Kaiju [3]. To ensure maximum representation of marine bacteria, archaea, and microbial eukaryotes, we included translated genes/transcripts from 5397 representative “specI” species clusters from the proGenomes database [4]; 113 transcriptomes from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [5]; 10509 metagenome assembled genomes from the Tara Oceans expedition [6,7], the Red Sea [8], the Baltic Sea [9], and other aquatic and terrestrial sources [10]; 994 isolate genomes from the Genomic Encyclopedia of Bacteria and Archaea [11]; 7492 viral genomes from NCBI RefSeq [12]; 786 bacterial and archaeal genomes from MarRef [13]; and 677 marine single cell genomes [14]. In order to annotate metagenomic reads at the clade/ecotype level (subspecies) for the focal taxa Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116, we generated custom MARMICRODB taxonomies based on curated genome phylogenies for each group. The curated phylogenies, Kaiju formatted Burrows-Wheeler index, translated genes, the custom taxonomy hierarchy, an interactive kronaplot of the taxonomic composition, and scripts and instructions for how to use or rebuild the resource is available from 10.5281/zenodo.3520509.

    Methods:
    The curation and quality control of MARMICRODB single cell, metagenome assembled, and isolate genomes was performed as described in [15]. Briefly, we downloaded all MARMICRODB genomes as raw nucleotide assemblies from NCBI. We determined an initial genome taxonomy for these assemblies using checkM with the default lineage workflow [16]. All genome bins met the completion/contamination thresholds outlined in prior studies [7,17]. For single cell and metagenome assembled genomes, especially those from Tara Oceans Mediterranean sea samples [18], we use the GTDB-Tk classification workflow [19] to verify the taxonomic fidelity of each genome bin. We then selected genomes with a checkM taxonomic assignment of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 for further analysis and confirmed taxonomic assignment using blast matches to known Prochlorococcus/Synechococcus ITS sequences and by matching 16S sequences to the SILVA database [20]. To refine our estimates of completeness/contamination of Prochlorococcus genome bins we created a custom set of 730 single copy protein families (available from 10.5281/zenodo.3719132) from closed, isolate Prochlorococcus genomes [21] for quality assessments with checkM. For Synechococcus we used the CheckM taxonomic-specific workflow with the genus Synechococcus. After the custom CheckM quality control, we excluded any genome bins from downstream analysis that had an estimated quality < 30, defined as %completeness – 5x %contamination resulting in 18769 genome/transcriptome bins. We predicted genes in the resulting genome bins using prodigal [22] and excluded protein sequences with lengths less than 20 and greater than 20000 amino acids, removed non-standard amino acid residues, and condensed redundant protein sequences to a single representative sequence to which we assigned a lowest common ancestor (LCA) taxonomy identifier from the NCBI taxonomy database [23]. The resulting protein sequences were compiled and used to build a Kaiju [3] search database.

    The above filtering criteria resulted in 605 Prochlorococcus, 96 Synechococcus, 186 SAR11/Pelagibacterales, 60 SAR86, and 59 SAR116 high-quality genome bins. We constructed a high quality fixed reference phylogenetic tree for each taxonomic group based on genomes manually selected for completeness and the phylogenetic diversity. For example the Prochlorococcus and Synechococcus genomes for the fixed reference phylogeny are estimated > 90% complete, and SAR11 genomes are estimated > 70% complete. We created multiple sequence alignments of phylogenetically conserved genes from these genomes using the GTDB-Tk pipeline [19] with default settings. The pipeline identifies conserved proteins (120 bacterial proteins) and generates concatenated multi-protein alignments [17] from the genome assemblies using hmmalign from the hmmer software suite. We further filtered the resulting alignment columns using the bacterial and archaeal alignment masks from [17] (http://gtdb.ecogenomic.org/downloads). We removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occuring at a frequency greater than 25%. We trimmed the alignments using trimal [24] with the automated -gappyout option to trim columns based on their gap distribution. We inferred reference phylogenies using multithreaded RAxML [25] with the GAMMA model of rate heterogeneity, empirically determined base frequencies, and the LG substitution model [26](PROTGAMMALGF). Branch support is based on 250 resampled bootstrap trees. This tree was then pruned to only allow a maximum average distance to the closest leaf (ADCL) of 0.003 to reduce the phylogenetic redundancy in the tree [27]. We then “placed” genomes that either did not pass completeness threshold or were considered phylogenetically redundant by ADCL within the fixed reference phylogeny for each group using pplacer [28] representing each placed genome as a pendant edge in the final tree. We then examined the resulting tree and manually selected clade/ecotype cutoffs to be as consistent as possible with clade definitions previously outlined for these groups [29–32]. We then gave clades from each taxonomic group custom taxonomic identifiers and we added these identifiers to the MARMICRODB Kaiju taxonomic hierarchy.

    Software/databases used:
    checkM v1.0.11[16]
    HMMERv3.1b2 (http://hmmer.org/)
    prodigal v2.6.3 [22]
    trimAl v1.4.rev22 [24]
    AliView v1.18.1 [33] [34]
    Phyx v0.1 [35]
    RAxML v8.2.12 [36]
    Pplacer v1.1alpha [28]
    GTDB-Tk v0.1.3 [19]
    Kaiju v1.6.0 [34]
    GTDB RS83 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release83/83.0/)
    NCBI Taxonomy (accessed 2018-07-02) [23]
    TIGRFAM v14.0 [37]
    PFAM v31.0 [38]

    Discussion/Caveats:
    MARMICRODB is optimized for metagenomic samples from the marine environment, in particular planktonic microbes from the pelagic euphotic zone. We expect this database may also be useful for classifying other types of marine metagenomic samples (for example mesopelagic, bathypelagic, or even benthic or marine host-associated), but it has not been tested as such. The original purpose of this database was to quantify clades/ecotypes of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 in metagenomes from Tara Oceans Expedition and the GEOTRACES project. We carefully annotated and quality controlled genomes from these five groups, but the processing of the other marine taxa was largely automated and unsupervised. Taxonomy for other groups was copied over from the Genome Taxonomy Database (GTDB) [19,39] and NCBI Taxonomy [23] so any inconsistencies in those databases will be propagated to MARMICRODB. For most use cases MARMICRODB can probably be used unmodified, but if the user’s goal is to focus on a particular organism/clade that we did not curate in the database then the user may wish to spend some time curating those genomes (ie checking for contamination, dereplicating, building a genome phylogeny for custom taxonomy node assignment). Currently the custom taxonomy is hardcoded in the MARMICRODB.fmi index, but if users wish to modify MARMICRODB by adding or removing genomes, or reconfiguring taxonomic ranks the names.dmp and nodes.dmp files can easily be modified as well as the fasta file of protein sequences. However, the Kaiju index will need to be rebuilt, and user will require a high

  4. n

    MiST - Microbial Signal Transduction database

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). MiST - Microbial Signal Transduction database [Dataset]. http://identifiers.org/RRID:SCR_003166
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Database which contains the signal transduction proteins for complete and draft bacterial and archaeal genomes. The MiST2 database identifies and catalogs the repertoire of signal transduction proteins in microbial genomes.

  5. d

    ARS Microbial Genomic Sequence Database Server

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). ARS Microbial Genomic Sequence Database Server [Dataset]. https://catalog.data.gov/dataset/ars-microbial-genomic-sequence-database-server-1b81c
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This database server is supported in fulfilment of the research mission of the Mycotoxin Prevention and Applied Microbiology Research Unit at the National Center for Agricultural Utilization Research in Peoria, Illinois. The linked website provides access to gene sequence databases for various groups of microorganisms, such as Streptomyces species or Aspergillus species and their relatives, that are the product of ARS research programs. The sequence databases are organized in the BIGSdb (Bacterial Isolate Genomic Sequence Database) software package developed by Keith Jolley and Martin Maiden at Oxford University. Resources in this dataset:Resource Title: ARS Microbial Genomic Sequence Database Server. File Name: Web Page, url: http://199.133.98.43

  6. d

    DOLOP: A Database of Bacterial Lipoproteins

    • dknet.org
    • neuinfo.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). DOLOP: A Database of Bacterial Lipoproteins [Dataset]. http://identifiers.org/RRID:SCR_013487
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    DOLOP is an exclusive knowledge base for bacterial lipoproteins by processing information from 510 entries to provide a list of 199 distinct lipoproteins with relevant links to molecular details. Features include functional classification, predictive algorithm for query sequences, primary sequence analysis and lists of predicted lipoproteins from 43 completed bacterial genomes along with interactive information exchange facility. This website along will have additional information on the biosynthetic pathway, supplementary material and other related figures. DOLOP also contains information and links to molecular details for about 278 distinct lipoproteins and predicted lipoproteins from 234 completely sequenced bacterial genomes. Additionally, the website features a tool that applies a predictive algorithm to identify the presence or absence of the lipoprotein signal sequence in a user-given sequence. The experimentally verified lipoproteins have been classified into different functional classes and more importantly functional domain assignments using hidden Markov models from the SUPERFAMILY database that have been provided for the predicted lipoproteins. Other features include: primary sequence analysis, signal sequence analysis, and search facility and information exchange facility to allow researchers to exchange results on newly characterized lipoproteins.

  7. n

    BacMap: Bacterial Genome Atlas

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). BacMap: Bacterial Genome Atlas [Dataset]. http://identifiers.org/RRID:SCR_006988
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    An interactive visual database containing hundreds of fully labeled, zoomable, and searchable maps of bacterial genomes. It uses a visualization tool (CGView) to generate high-resolution circular genome maps from sequence feature information. Each map includes an interface that allows the image to be expanded and rotated. In the default view, identified genes are drawn to scale and colored according to coding directions. When a region of interest is expanded, gene labels are displayed. Each label is hyperlinked to a custom ''gene card'' which provides several fields of information concerning the corresponding DNA and protein sequences. Each genome map is searchable via a local BLAST search and a gene name/synonym search. A complete listing of the species and strains in the BacMap database is available on the BacMap homepage. Below each species/strain name is a list of the sequenced chromosomes and plasmids that are available. Some features of BacMap include: * Maps are available for 2023 bacterial chromosomes. * Each map supports zooming and rotation. * Map gene labels are hyperlinked to detailed textual annotations. * Maps can be explored manually, or with the help of BacMap''s built in text search and BLAST search. * A written synopsis of each bacterial species is provided. * Several charts illustrating the proteomic and genomic characteristics of each chromosome are available. * Flat file versions of the BacMap gene annotations, gene sequences and protein sequences can be downloaded. BacMap can be used to: * Obtain basic genome statistics. * Visualize the genomic context of genes. * Search for orthologues and paralogues in a genome of interest. * Search for conserved operon structure. * Look for gene content differences between bacterial species. * Obtain pre-calculated annotations for bacterial genes of interest.

  8. d

    Archaeal and Bacterial ABC Transporter Database

    • dknet.org
    • neuinfo.org
    • +1more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Archaeal and Bacterial ABC Transporter Database [Dataset]. http://identifiers.org/RRID:SCR_001692
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    ABCdb is a public resource devoted to the ATP-binding Cassette (ABC) transporters encoded by completely sequenced prokaryotic genomes. In order to establish, in a complete genome, the repertory of ABC systems, we have to: i) identify the different partners, ii) assemble the partners in putative systems, and iii) classify the system into the correct functional subfamily (Quentin et al., 2002). The main pitfalls were the identification of loosely conserved domains and the assembly of partners encoded by genes dispersed over the chromosome. In order to face the avalanche of newly sequenced genomes, we decided to also feed into the database the raw prediction issued by this automatic procedure, before time consuming review by an expert occurs. Therefore, the database comprises two sections: CleanDb, for data checked by an expert and AutoDb for raw data. The ABC proteins are involved in a wide variety of physiological processes in Archaea, Bacteria and Eucaryota where they are encoded by large families of paralogous genes. The majority of ABC domains energize the transport of compounds across membranes. In bacteria, ABC transporters are involved in the uptake of a wide variety of molecules, as well as in mechanisms of virulence and antibiotic resistance. In eukaryotes, most of them are involved in drug resistance and in human cell, many are associated with diseases. Sequence analysis reveals that members of the ABC superfamily can be organized into sub-families, and suggests that they have diverged from common ancestral forms. A typical ABC transporter system is composed of an assembly of protein domains that serve different functions: i) two Nucleotide Binding Domains (NBD) that energize transport via ATP hydrolysis, ii) two Membrane Spanning Domains (MSD) that act as a membrane channel for the substrate, and iii) for the importer, a Solute Binding Protein (SBP) that confers substrates specificity on the transporter. The different partners of an ABC system are generally encoded by neighboring genes. The database includes information on: * ABC transporters * Protein partners * Protein domains (NBD, MSD and SBP) * Classification of ABC transporters and their protein partners * Taxonomy of the species Each model Protein includes a link to the Peptide sequence, general information extracted from EMBL files, and specific tags to store results of predictions. The results of the annotation procedure are reachable through the class Prediction. The origin of the proteins is modeled as a path through the classes Chromosome, Strain, Species, and Taxon. Assembly and protein compilation tables are also provided for each of the chromosomes ( Assembly and Protein ).

  9. n

    MBGD - Microbial Genome Database

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Feb 1, 2001
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2001). MBGD - Microbial Genome Database [Dataset]. http://identifiers.org/RRID:SCR_012824
    Explore at:
    Dataset updated
    Feb 1, 2001
    Description

    MBGD is a database for comparative analysis of completely sequenced microbial genomes, the number of which is now growing rapidly. The aim of MBGD is to facilitate comparative genomics from various points of view such as ortholog identification, paralog clustering, motif analysis and gene order comparison. The heart of MBGD function is to create orthologous or homologous gene cluster table. For this purpose, similarities between all genes are precomputed and stored into the database, in addition to the annotations of genes such as function categories that were assigned by the original authors and motifs that were found in the translated sequence. Using these homology data, MBGD dynamically creates orthologous gene cluster table. Users can change a set of organisms or cutoff parameters to create their own orthologous grouping. Based on this cluster table, users can further analyze multiple genomes from various points of view with the functions such as global map comparison, local map comparison, multiple sequence alignment and phylogenetic tree construction.

  10. Z

    mOTUs database for MetaMeta pipeline - Archaea and Bacteria - version 1

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piro, Vitor C. (2020). mOTUs database for MetaMeta pipeline - Archaea and Bacteria - version 1 [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_819364
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Robert Koch-Institut - MF1 Bioinformatics
    Authors
    Piro, Vitor C.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    mOTUs database for MetaMeta pipeline version 1. The database was downloaded from http://www.bork.embl.de/software/mOTU/share/mOTUs.Linux64bits.tar.gz and it is based on marker genes from 1,753 bacterial reference genomes + marker genes from 263 metagenomes and 3,496 bacterial genomes dating from February 2012

  11. Data from: Fermented Foods Microbial Genomes Database

    • zenodo.org
    application/gzip, tsv
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth McDaniel; Elizabeth McDaniel; Rachel Dutton; Rachel Dutton (2025). Fermented Foods Microbial Genomes Database [Dataset]. http://doi.org/10.5281/zenodo.15794524
    Explore at:
    application/gzip, tsvAvailable download formats
    Dataset updated
    Jul 2, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elizabeth McDaniel; Elizabeth McDaniel; Rachel Dutton; Rachel Dutton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fermented Foods Microbial Genomes Database

    This database contains 13,850 microbial genomes assembled from various fermented foods and associated curated metadata.

    We have also clustered the database at 95% identity for creating a species-representative database, and at 99% identity for creating a "strain"-representative database, since we hypothesize that many bioactivities and phenotypes for fermented food microbes are important at the strain-level.

    This GitHub repository documents how publicly available genomes and metagenome-assembled genomes were sourced and curated. This GitHub repository documents how the associated metadata was curated.

    This database largely pulls from existing genome resources, and we curated this database specifically for fermented foods. If you use this database, please cite the following genome databases/resources:

    1. MiFoDB, a workflow for microbial food metagenomic characterization, enables high-resolution analysis of fermented food microbial dynamics. Elisa B. Caffrey, Matthew R. Olm, Caroline Isabel Kothe, Joshua Evans, Justin L. Sonnenburg . bioRxiv 2024.03.29.587370; doi: https://doi.org/10.1101/2024.03.29.587370
    2. Unexplored microbial diversity from 2,500 food metagenomes and links with the human microbiome. Carlino, Niccolo Alvarez-Ordonez, Avelino et al. Cell, Volume 187, Issue 20, 5775-5795.e15 AND the associated Zenodo release: Master Consortium. (2024). Unexplored microbial diversity from 2,500 food metagenomes and links with the human microbiome. Zenodo. https://doi.org/10.5281/zenodo.13285428

    We are incredibly grateful for these groups and countless others taking the time to make their data publicly available. Included in the metadata is the original DOI and study link from which the genome was generated, in addition to if they were collated into one of the above two larger databases. If you specifically use/analyze a subset of genomes, please cite those studies to credit those that generate data and make it publicly available.

    Subsetting the Database to Species/Strain-Resolved Representatives or a Custom Set

    We have provided the entire set of 13,850 microbial genomes in a single tar archive for download. We have also provided tar archives of the genomes clustered at 95% and 99% identity. If you wish to download the entire database and then only use a subset of the database, such as species-representative (clustered at 95% ANI) or "strain-representative" (clustered at 99% ANI) genomes after downloading the entire database, you can use our helper script for subsetting genomes that are representatives or a custom list that you provide.

    usage: subset_genomes.py [-h] [--rep-column {rep_95id,rep_99id}]
    [--id-column ID_COLUMN] [--dry-run]
    [--genome-list GENOME_LIST]
    metadata_tsv all_genomes_dir output_dir

    Subset representative genomes (species/strain) from a genome set using
    metadata.

    positional arguments:
    metadata_tsv Path to metadata TSV file
    all_genomes_dir Directory containing all .fa genome files
    output_dir Directory to copy representative genomes to

    optional arguments:
    -h, --help show this help message and exit
    --rep-column {rep_95id,rep_99id}
    Column in metadata to use for representatives (e.g.,
    rep_95id or rep_99id)
    --id-column ID_COLUMN
    Column in metadata with genome file IDs (default:
    mag_id)
    --dry-run Only print what would be copied, don't actually copy
    --genome-list GENOME_LIST
    Optional: Path to file with list of genome IDs or
    filenames to subset (one per line)

    KBase Fermented Foods Microbial Genomes Database Narrative

    We have uploaded the "strain-representative" set of ~4300 genomes to KBase as a public narrative.

    KBase is a community-driven platform for facilitating open science research in systems biology. KBase allows you to run bioinformatics tools on large datasets using freely available Department of Energy high-perofrmance computing resources, allowing for open-sharing of research outputs and collaborative work. You can sign-up for a KBase account with any email account. You are not required to be affiliated with an academic institution or government organization to use KBase, and you can reside outside of the United States.
    This platform not only allows additional access to the Fermented Foods Microbial Genomes Database, but access to open-source bioinformatics tools and high-performance computing resources through the DOE to run reproducible analyses. You can use this narrative for exploring the database, incorporating your own genomes to compare against genomes in the database, and/or using as a teaching resource.
  12. s

    SBDI Sativa curated 16S GTDB database

    • figshare.scilifelab.se
    • researchdata.se
    • +2more
    txt
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Lundin; Anders Andersson (2025). SBDI Sativa curated 16S GTDB database [Dataset]. http://doi.org/10.17044/scilifelab.14869077.v10
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 3, 2025
    Dataset provided by
    Linnéuniversitetet
    Authors
    Daniel Lundin; Anders Andersson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data in this repository is the result of vetting 16S sequences from the Genome Taxonomy Database (GTDB) release R10RS226 (r226) (https://gtdb.ecogenomic.org/; Parks et al. 2018) with the Sativa program (Kozlov et al. 2016) using the sbdi-phylomarkercheck Nextflow pipeline version 1.0.2.Using Sativa [Kozlov et al. 2016], 16S sequences from GTDB were checked so that their phylogenetic signal is consistent with their taxonomy.Before calling Sativa, sequences longer than 2000 nucleotides or containing Ns were removed, and the reverse complement of each is calculated. Subsequently, sequences were aligned with HMMER [Eddy 2011] using the Barrnap [https://github.com/tseemann/barrnap] archaeal and bacterial 16S profiles respectively, and sequences containing more than 10% gaps were removed. From each genome the longest sequence was selected (three from species-representative genomes). Subsequently, 30 sequences were selected from each species with a stronger weight for species-representative genomes and sequences with longer alignment to the Barrnap profile. Priority was also multiplied by degree of CheckM contamination so that sequences from more contaminated genomes had a lower chance of becoming part of the 30 selected. Furthermore, sequences which did not have the same GTDB order as Silva order in GTDB's metadata, receieved a lower priority in selection of the 30.The 30 selected sequences were analyzed with Sativa, and sequences that were not phylogenetically consistent with their taxonomy were removed.Files for the DADA2 (Callahan et al. 2016) methods assignTaxonomy and addSpecies are available, in three different versions each. The assignTaxonomy files contain taxonomy for domain, phylum, class, order, family, genus and species. (Note that it has been proposed that species assignment for short 16S sequences require 100% identity (Edgar 2018), so use species assignments from assignTaxonomy with caution.) The versions differ in the maximum number of genomes that we included per species: 1, 5 or 20, indicated by "1genome", "5genomes" and "20genomes" in the file names respectively. Using the version with 20 genomes per species should increase the chances to identify an exactly matching sequence by the addSpecies algorithm, while using a file with many genomes per species could potentially give biases in the taxonomic annotations at higher levels by assignTaxonomy. Our recommendation is hence to use the "1genome" files for assignTaxonomy and "20genomes" for addSpecies.The fasta files are gzipped fasta files with 16S sequences, the assignTaxonomy associated with taxonomy hierarchies from domain to species whereas the addSpecies file have sequence identities and species names. There is also a fasta files with the original GTDB sequence names: sbdi-gtdb-sativa.r09rs220.20genomes.fna.gz.Taxonomical annotation of 16S amplicons using this data is available as an optional argument to the nf-core/ampliseq Nextflow workflow: --dada_ref_taxonomy sbdi-gtdb (https://nf-co.re/ampliseq; Straub et al. 2020).In addition to the fasta files, the workflow outputs phylogenetic trees by optimizing branch-lengths of the original phylogenomic GTDB trees based on a 16S sequence alignment. As not all species in GTDB will have correct 16S sequences, the GTDB trees are first subset to contain only species for which the species representative genome has a correct 16S sequence. Subsequently, branch lengths for the tree are optimized based on the original alignment of 16S sequences using IQTREE [Nguyen et al. 2015] with a GTR+F+I+G4 model. The alignment files end with .alnfna, the taxonomy files with .taxonomy.tsv and the tree files (newick-formatted) end with .brlenopt.newick. They will be made available in nf-core/ampliseq for phylogenetic placement.The data will be updated circa yearly, after the GTDB database is updated.Version historyv11 (2025-10-31): Stricter filtering of sequences before Sativa, see description above.v10 (2025-04-30): Update versions in this textv9 (2025-04-29): Update to GTDB R10-RS226v8 (2025-02-18): Remove extra sequences from e.g. "1genome" files that appeared due to ties.v7 (2024-06-25): Update to GTDB R09-RS220 from R08-RS214.v6 (2024-04-24): Replace manual procedure with Nextflow pipeline. Update to GTDB R08-RS214 from R07-RS207.v5 (2022-10-07): Add missing fasta file with original GTDB names.v4 (2022-08-31): Update to GTDB R07-RS207 from R06-RS202AcknowledgementsThe computations were enabled by resources in project [NAISS 2023/22-601, SNIC 2022/22-500 and SNIC 2021/22-263] provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at UPPMAX, funded by the Swedish Research Council through grant agreement no. 2022-06725.Computations were also enabled by resources provided by Dr. Maria Vila-Costa, Institute of Environmental Assessment and Water Research (IDAEA-CSIC), Barcelona.

  13. d

    Ensembl Bacteria

    • dknet.org
    • scicrunch.org
    • +2more
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Ensembl Bacteria [Dataset]. http://identifiers.org/RRID:SCR_008679
    Explore at:
    Dataset updated
    Nov 12, 2024
    Description

    The Ensembl Genomes project produces genome databases for important species from across the taxonomic range, using the Ensembl software system. Five sites are now available, one of which is Ensembl Bacteria, which houses bacterial species. All bacterial collections in Ensembl Bacteria have been updated with the latest data from ENA and UniProtKB. New genomes have been added to Escherichia/Shigella (3 additional genomes) and Staphylococcus (3 additional genomes). The mapping of array probes has been expanded to all genomes in the Escherichia/Shigella and Staphylococcus collections. Ensembl Bacteria also now features improved interfaces for selecting regions of circular molecules a new visualisation allowing the large scale comparison of multiple genomes. In multi-synteny view, users can select multiple genomes and observe the syntenic relationships between them. Sponsors: EnsembBacteria is a project run by EMBL - EBI to maintain annotation on selected genomes, based on the software developed in the Ensembl project developed jointly by the EBI and the Wellcome Trust Sanger Institute.

  14. u

    National Microbial Germplasm Program

    • agdatacommons.nal.usda.gov
    • catalog.data.gov
    bin
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    USDA ARS National Germplasm Resources Laboratory (2025). National Microbial Germplasm Program [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/National_Microbial_Germplasm_Program/24661746
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    National Germplasm Resources Laboratory
    Authors
    USDA ARS National Germplasm Resources Laboratory
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    The goal of the National Microbial Germplasm Program is to ensure that the genetic diversity of agriculturally important microorganisms is maintained to enhance and increase agricultural efficiency and profitability. The program collects, authenticates, and characterizes potentially useful microbial germplasm; preserves microbial genetic diversity; and facilitates distribution and utilization of microbial germplasm for research and industry.The Agricultural Research Service maintains several microbial germplasm collections including:USDA ARS Culture CollectionUSDA ARS Collection of Entomopathogenic Fungal Cultures (ARSEF)Query or Download the Rhizobium DatabaseUS National Fungus CollectionsResources in this dataset:Resource Title: National Microbial Germplasm Program .File Name: Web Page, url: https://www.ars-grin.gov/Collections#microbial-germplasm Main web site for the National Microbial Germplasm Program with links to component databases/collections.

  15. Z

    NCBI Refseq database as of May 2023 part 2

    • data.niaid.nih.gov
    Updated Feb 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert, Nichols (2024). NCBI Refseq database as of May 2023 part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10452278
    Explore at:
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    Pennsylvania State University
    Authors
    Robert, Nichols
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the second part of the NCBI Refseq bacterial database originally downloaded in May of 2023. This was used to create the Bacterial 16S and Gyrb databases used in gyrB primer development. The first part can be found at 10.5281/zenodo.10452184.

    To recombine the database parts use the code

    "cat Bacteria.refseq.tar.gz.part* > Bacteria.refseq.tar.gz"

    the total file size of the downloaded refseq database is 88 GB

    The gyrB and 16S databases can be found at 10.5281/zenodo.10451935

  16. m

    In-house database specific to bacteria from horses for MALDI Biotyper

    • data.mendeley.com
    • narcis.nl
    Updated May 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eri Uchida-Fujii (2021). In-house database specific to bacteria from horses for MALDI Biotyper [Dataset]. http://doi.org/10.17632/m342p574wj.1
    Explore at:
    Dataset updated
    May 12, 2021
    Authors
    Eri Uchida-Fujii
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains the main spectral profiles (MSPs) for MALDI Biotyper CA 3.2 System (Bruker Japan, Kanagawa, Japan) constitutes in-house database specific to bacteria from horses for identification with matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS). MALDI-TOF MS is used for identification of bacterial species isolated from horses. however, some bacterial species isolated from horses are difficult to identify with MALDI-TOF MS because of insufficiencies in the reference database, and enriching the databases is expected to enhance the accuracy of MALDI-TOF MS identification. Here we created an in-house database including 271 bacterial isolates from horses. Bacterial isolates were subjected to ethanol / formic acid treatment, and spectra were gained using flexControl 3.4 software (Bruker Japan). The spectra were checked with flexAnalysis 3.4 software (Bruker Japan) to delete spectra differing from the cohort spectra and imported for generating MSPs using MBT Compass Explorer 4.1.7.0. software (Bruker Japan). MSPs were exported in a *.btmsp format for use with MALDI Biotyper systems.

  17. Additional file 7 of A large-scale genomically predicted protein mass...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Aug 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuji Sekiguchi; Kanae Teramoto; Dieter M. Tourlousse; Akiko Ohashi; Mayu Hamajima; Daisuke Miura; Yoshihiro Yamada; Shinichi Iwamoto; Koichi Tanaka (2024). Additional file 7 of A large-scale genomically predicted protein mass database enables rapid and broad-spectrum identification of bacterial and archaeal isolates by mass spectrometry [Dataset]. http://doi.org/10.6084/m9.figshare.26637801.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Aug 16, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Yuji Sekiguchi; Kanae Teramoto; Dieter M. Tourlousse; Akiko Ohashi; Mayu Hamajima; Daisuke Miura; Yoshihiro Yamada; Shinichi Iwamoto; Koichi Tanaka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 7: Table S6. Identification of new isolates from the same mice fecal samples.

  18. Z

    Draft genomes of vinasse bacteria

    • data-staging.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noriko A. Cassman; Kesia S. Lourenco; Janaina do Carmo; Heitor Cantarella; Eiko E. Kuramae (2020). Draft genomes of vinasse bacteria [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_1194339
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Federal University of Sao Carlos
    NIOO-KNAW
    IAC
    NIOO-KNAW and Agronomic Institute of Campinas (IAC)
    Netherlands Institute for Ecology (NIOO-KNAW)
    Authors
    Noriko A. Cassman; Kesia S. Lourenco; Janaina do Carmo; Heitor Cantarella; Eiko E. Kuramae
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Twenty-one draft bacterial genomes and their annotations described in the publication Cassman, NA. et. al. 2018, Biotech for Biofuels. The goal of the study was to characterize the microbial assemblage present in sugarcane vinasse, which is the major waste of bioethanol production from sugarcane and generally used as an organic and/or K fertilizer. Briefly, the draft genomes were binned (maxbin2) and manually refined (anvi'o v.2.3.2) from a cross-assembly (metaSPADES v3.8.2) of 18 metagenomes sequenced from six batches of sugarcane vinasse with Illumina Miseq technology. Annotations (prokka v1.12) were carried out against prokka databases (UniProtKB Bacterial and the HAMAP HMM database) along with the dbCAN HMM database (dbCAN-fam-HMMs.v6). The uploaded files are from 1) prokka output: bin nucleotide sequences (fa), protein sequences (faa), annotations (gff and tsv) and annotation info (txt) and 2) anvi'o output: percent recruitment of the bins, bin summary and samples summary.

  19. Resilience of Microbial Communities Sequence Data Set

    • catalog.data.gov
    • data.amerigeoss.org
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Resilience of Microbial Communities Sequence Data Set [Dataset]. https://catalog.data.gov/dataset/resilience-of-microbial-communities-sequence-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The EX_Genome_Assemblies.zip file contain the contig sequences (i.e. assembly) of fifteen isolates used for genomic and antibiotic resistance genes (ARG) analysis. EX_OTU.fasta file contain the sequences of the bacterial 16S rRNA-encoding V4 region gene (≈250 nt) for each Operational Taxonomic Unit (OTU). This dataset is associated with the following publication: Gomez-Alvarez, V., S. Pfaller, J. Pressman, D. Wahman, and R. Revetta. Resilience of microbial communities in a simulated drinking water distribution system subjected to disturbances: role of conditionally rare taxa and potential implications for antibiotic-resistant bacteria. Environmental Science: Water Research & Technology. Royal Society of Chemistry, Cambridge, UK, 2: 645-657, (2016).

  20. n

    Data from: Distinguishing potential bacteria-tumor associations from...

    • data-staging.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Jan 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelly M. Robinson; Jonathan Crabtree; John S. A. Mattick; Kathleen E. Anderson; Julie C. Dunning Hotopp (2018). Distinguishing potential bacteria-tumor associations from contamination in a secondary data analysis of public cancer genome sequence data [Dataset]. http://doi.org/10.5061/dryad.96584
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 11, 2018
    Dataset provided by
    University of Maryland, Baltimore County
    Authors
    Kelly M. Robinson; Jonathan Crabtree; John S. A. Mattick; Kathleen E. Anderson; Julie C. Dunning Hotopp
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background: A variety of bacteria are known to influence carcinogenesis. Therefore, we sought to investigate if publicly available whole genome and whole transcriptome sequencing data generated by large public cancer genome efforts, like The Cancer Genome Atlas (TCGA), could be used to identify bacteria associated with cancer. The Burrows-Wheeler aligner (BWA) was used to align a subset of Illumina paired-end sequencing data from TCGA to the human reference genome and all complete bacterial genomes in the RefSeq database in an effort to identify bacterial read pairs from the microbiome.

    Results: Through careful consideration of all of the bacterial taxa present in the cancer types investigated, their relative abundance, and batch effects, we were able to identify some read pairs from certain taxa as likely resulting from contamination. In particular, the presence of Mycobacterium tuberculosis complex in the ovarian serous cystadenocarcinoma (OV) and glioblastoma multiforme (GBM) samples was correlated with the sequencing center of the samples. Additionally, there was a correlation between the presence of Ralstonia spp. and two specific plates of acute myeloid leukemia (AML) samples. At the end, associations remained between Pseudomonas-like and Acinetobacter-like read pairs in AML, and Pseudomonas-like read pairs in stomach adenocarcinoma (STAD) that could not be explained through batch effects or systematic contamination as seen in other samples.

    Conclusions: This approach suggests that it is possible to identify bacteria that may be present in human tumor samples from public genome sequencing data that can be examined further experimentally. More weight should be given to this approach in the future when bacterial associations with diseases are suspected.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2012). DBETH - Database for Bacterial ExoToxins for Humans [Dataset]. http://identifiers.org/RRID:SCR_005908

DBETH - Database for Bacterial ExoToxins for Humans

RRID:SCR_005908, nlx_149481, biotools:dbeth, DBETH - Database for Bacterial ExoToxins for Humans (RRID:SCR_005908), DBETH, Database for Bacterial ExoToxins for Humans

Explore at:
Dataset updated
Jan 12, 2012
Description

Database of Bacterial ExoToxins for Human is a database of sequences, structures, interaction networks and analytical results for 229 exotoxins, from 26 different human pathogenic bacterial genus. All toxins are classified into 24 different Toxin classes. The aim of DBETH is to provide a comprehensive database for human pathogenic bacterial exotoxins. DBETH also provides a platform to its users to identify potential exotoxin like sequences through Homology based as well as Non-homology based methods. In homology based approach the users can identify potential exotoxin like sequences either running BLASTp against the toxin sequences or by running HMMER against toxin domains identified by DBETH from human pathogenic bacterial exotoxins. In Non-homology based part DBETH uses a machine learning approach to identify potential exotoxins (Toxin Prediction by Support Vector Machine based approach).

Search
Clear search
Close search
Google apps
Main menu