100+ datasets found

d
A tandem repeats database for bacterial genomes: application to the...
catalog.data.gov
odgavaprod.ogopendata.com
Updated Sep 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). A tandem repeats database for bacterial genomes: application to the genotyping of [Dataset]. https://catalog.data.gov/dataset/a-tandem-repeats-database-for-bacterial-genomes-application-to-the-genotyping-of
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
National Institutes of Health
Description
Background Some pathogenic bacteria are genetically very homogeneous, making strain discrimination difficult. In the last few years, tandem repeats have been increasingly recognized as markers of choice for genotyping a number of pathogens. The rapid evolution of these structures appears to contribute to the phenotypic flexibility of pathogens. The availability of whole-genome sequences has opened the way to the systematic evaluation of tandem repeats diversity and application to epidemiological studies. Results This report presents a database () of tandem repeats from publicly available bacterial genomes which facilitates the identification and selection of tandem repeats. We illustrate the use of this database by the characterization of minisatellites from two important human pathogens, Yersinia pestis and Bacillus anthracis. In order to avoid simple sequence contingency loci which may be of limited value as epidemiological markers, and to provide genotyping tools amenable to ordinary agarose gel electrophoresis, only tandem repeats with repeat units at least 9 bp long were evaluated. Yersinia pestis contains 64 such minisatellites in which the unit is repeated at least 7 times. An additional collection of 12 loci with at least 6 units, and a high internal conservation were also evaluated. Forty-nine are polymorphic among five Yersinia strains (twenty-five among three Y. pestis strains). Bacillus anthracis contains 30 comparable structures in which the unit is repeated at least 10 times. Half of these tandem repeats show polymorphism among the strains tested. Conclusions Analysis of the currently available bacterial genome sequences classifies Bacillus anthracis and Yersinia pestis as having an average (approximately 30 per Mb) density of tandem repeat arrays longer than 100 bp when compared to the other bacterial genomes analysed to date. In both cases, testing a fraction of these sequences for polymorphism was sufficient to quickly develop a set of more than fifteen informative markers, some of which show a very high degree of polymorphism. In one instance, the polymorphism information content index reaches 0.82 with allele length covering a wide size range (600-1950 bp), and nine alleles resolved in the small number of independent Bacillus anthracis strains typed here.
MARMICRODB database for taxonomic classification of (marine) metagenomes
zenodo.org
application/gzip, bin +3
Updated Mar 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shane L Hogle; Shane L Hogle (2020). MARMICRODB database for taxonomic classification of (marine) metagenomes [Dataset]. http://doi.org/10.5281/zenodo.3520509
Explore at:
bin, application/gzip, tsv, html, bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.3520509
Dataset updated
Mar 20, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shane L Hogle; Shane L Hogle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction:
This sequence database (MARMICRODB) was introduced in the publication JW Becker, SL Hogle, K Rosendo, and SW Chisholm. 2019. Co-culture and biogeography of Prochlorococcus and SAR11. ISME J. doi:10.1038/s41396-019-0365-4. Please see the original publication and its associated supplementary material for the original description of this resource.

Motivation:
We needed a reference database to annotate shotgun metagenomes from the Tara Oceans project [1] the GEOTRACES cruises GA02, GA03, GA10, and GP13 and the HOT and BATS time series [2]. Our interests are primarily in quantifying and annotating the free-living, oligotrophic bacterial groups Prochlorococcus, Pelagibacterales/SAR11, SAR116, and SAR86 from these samples using the protein classifier tool Kaiju [3]. Kaiju’s sensitivity and classification accuracy depend on the composition of the reference database, and highest sensitivity is achieved when the reference database contains a comprehensive representation of expected taxa from an environment/sample of interest. However, the speed of the algorithm decreases as database size increases. Therefore, we aimed to create a reference database that maximized the representation of sequences from marine bacteria, archaea, and microbial eukaryotes, while minimizing (but not excluding) the sequences from clinical, industrial, and terrestrial host-associated samples.

Results/Description:
MARMICRODB consists of 56 million sequence non-redundant protein sequences from 18769 bacterial/archaeal/eukaryote genome and transcriptome bins and 7492 viral genomes optimized for use with the protein homology classifier Kaiju [3]. To ensure maximum representation of marine bacteria, archaea, and microbial eukaryotes, we included translated genes/transcripts from 5397 representative “specI” species clusters from the proGenomes database [4]; 113 transcriptomes from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [5]; 10509 metagenome assembled genomes from the Tara Oceans expedition [6,7], the Red Sea [8], the Baltic Sea [9], and other aquatic and terrestrial sources [10]; 994 isolate genomes from the Genomic Encyclopedia of Bacteria and Archaea [11]; 7492 viral genomes from NCBI RefSeq [12]; 786 bacterial and archaeal genomes from MarRef [13]; and 677 marine single cell genomes [14]. In order to annotate metagenomic reads at the clade/ecotype level (subspecies) for the focal taxa Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116, we generated custom MARMICRODB taxonomies based on curated genome phylogenies for each group. The curated phylogenies, Kaiju formatted Burrows-Wheeler index, translated genes, the custom taxonomy hierarchy, an interactive kronaplot of the taxonomic composition, and scripts and instructions for how to use or rebuild the resource is available from 10.5281/zenodo.3520509.

Methods:
The curation and quality control of MARMICRODB single cell, metagenome assembled, and isolate genomes was performed as described in [15]. Briefly, we downloaded all MARMICRODB genomes as raw nucleotide assemblies from NCBI. We determined an initial genome taxonomy for these assemblies using checkM with the default lineage workflow [16]. All genome bins met the completion/contamination thresholds outlined in prior studies [7,17]. For single cell and metagenome assembled genomes, especially those from Tara Oceans Mediterranean sea samples [18], we use the GTDB-Tk classification workflow [19] to verify the taxonomic fidelity of each genome bin. We then selected genomes with a checkM taxonomic assignment of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 for further analysis and confirmed taxonomic assignment using blast matches to known Prochlorococcus/Synechococcus ITS sequences and by matching 16S sequences to the SILVA database [20]. To refine our estimates of completeness/contamination of Prochlorococcus genome bins we created a custom set of 730 single copy protein families (available from 10.5281/zenodo.3719132) from closed, isolate Prochlorococcus genomes [21] for quality assessments with checkM. For Synechococcus we used the CheckM taxonomic-specific workflow with the genus Synechococcus. After the custom CheckM quality control, we excluded any genome bins from downstream analysis that had an estimated quality < 30, defined as %completeness – 5x %contamination resulting in 18769 genome/transcriptome bins. We predicted genes in the resulting genome bins using prodigal [22] and excluded protein sequences with lengths less than 20 and greater than 20000 amino acids, removed non-standard amino acid residues, and condensed redundant protein sequences to a single representative sequence to which we assigned a lowest common ancestor (LCA) taxonomy identifier from the NCBI taxonomy database [23]. The resulting protein sequences were compiled and used to build a Kaiju [3] search database.

The above filtering criteria resulted in 605 Prochlorococcus, 96 Synechococcus, 186 SAR11/Pelagibacterales, 60 SAR86, and 59 SAR116 high-quality genome bins. We constructed a high quality fixed reference phylogenetic tree for each taxonomic group based on genomes manually selected for completeness and the phylogenetic diversity. For example the Prochlorococcus and Synechococcus genomes for the fixed reference phylogeny are estimated > 90% complete, and SAR11 genomes are estimated > 70% complete. We created multiple sequence alignments of phylogenetically conserved genes from these genomes using the GTDB-Tk pipeline [19] with default settings. The pipeline identifies conserved proteins (120 bacterial proteins) and generates concatenated multi-protein alignments [17] from the genome assemblies using hmmalign from the hmmer software suite. We further filtered the resulting alignment columns using the bacterial and archaeal alignment masks from [17] (http://gtdb.ecogenomic.org/downloads). We removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occuring at a frequency greater than 25%. We trimmed the alignments using trimal [24] with the automated -gappyout option to trim columns based on their gap distribution. We inferred reference phylogenies using multithreaded RAxML [25] with the GAMMA model of rate heterogeneity, empirically determined base frequencies, and the LG substitution model [26](PROTGAMMALGF). Branch support is based on 250 resampled bootstrap trees. This tree was then pruned to only allow a maximum average distance to the closest leaf (ADCL) of 0.003 to reduce the phylogenetic redundancy in the tree [27]. We then “placed” genomes that either did not pass completeness threshold or were considered phylogenetically redundant by ADCL within the fixed reference phylogeny for each group using pplacer [28] representing each placed genome as a pendant edge in the final tree. We then examined the resulting tree and manually selected clade/ecotype cutoffs to be as consistent as possible with clade definitions previously outlined for these groups [29–32]. We then gave clades from each taxonomic group custom taxonomic identifiers and we added these identifiers to the MARMICRODB Kaiju taxonomic hierarchy.

Software/databases used:
checkM v1.0.11[16]
HMMERv3.1b2 (http://hmmer.org/)
prodigal v2.6.3 [22]
trimAl v1.4.rev22 [24]
AliView v1.18.1 [33] [34]
Phyx v0.1 [35]
RAxML v8.2.12 [36]
Pplacer v1.1alpha [28]
GTDB-Tk v0.1.3 [19]
Kaiju v1.6.0 [34]
GTDB RS83 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release83/83.0/)
NCBI Taxonomy (accessed 2018-07-02) [23]
TIGRFAM v14.0 [37]
PFAM v31.0 [38]

Discussion/Caveats:
MARMICRODB is optimized for metagenomic samples from the marine environment, in particular planktonic microbes from the pelagic euphotic zone. We expect this database may also be useful for classifying other types of marine metagenomic samples (for example mesopelagic, bathypelagic, or even benthic or marine host-associated), but it has not been tested as such. The original purpose of this database was to quantify clades/ecotypes of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 in metagenomes from Tara Oceans Expedition and the GEOTRACES project. We carefully annotated and quality controlled genomes from these five groups, but the processing of the other marine taxa was largely automated and unsupervised. Taxonomy for other groups was copied over from the Genome Taxonomy Database (GTDB) [19,39] and NCBI Taxonomy [23] so any inconsistencies in those databases will be propagated to MARMICRODB. For most use cases MARMICRODB can probably be used unmodified, but if the user’s goal is to focus on a particular organism/clade that we did not curate in the database then the user may wish to spend some time curating those genomes (ie checking for contamination, dereplicating, building a genome phylogeny for custom taxonomy node assignment). Currently the custom taxonomy is hardcoded in the MARMICRODB.fmi index, but if users wish to modify MARMICRODB by adding or removing genomes, or reconfiguring taxonomic ranks the names.dmp and nodes.dmp files can easily be modified as well as the fasta file of protein sequences. However, the Kaiju index will need to be rebuilt, and user will require a high
ARS Microbial Genomic Sequence Database Server
agdatacommons.nal.usda.gov
catalog.data.gov
bin
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA Agricultural Research Service (2024). ARS Microbial Genomic Sequence Database Server [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/ARS_Microbial_Genomic_Sequence_Database_Server/24661200
Explore at:
binAvailable download formats
Dataset updated
Feb 9, 2024
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
United States Department of Agriculturehttp://usda.gov/
Authors
USDA Agricultural Research Service
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
This database server is supported in fulfilment of the research mission of the Mycotoxin Prevention and Applied Microbiology Research Unit at the National Center for Agricultural Utilization Research in Peoria, Illinois. The linked website provides access to gene sequence databases for various groups of microorganisms, such as Streptomyces species or Aspergillus species and their relatives, that are the product of ARS research programs. The sequence databases are organized in the BIGSdb (Bacterial Isolate Genomic Sequence Database) software package developed by Keith Jolley and Martin Maiden at Oxford University. Resources in this dataset:Resource Title: ARS Microbial Genomic Sequence Database Server. File Name: Web Page, url: http://199.133.98.43
n
BacMap: Bacterial Genome Atlas
neuinfo.org
dknet.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). BacMap: Bacterial Genome Atlas [Dataset]. http://identifiers.org/RRID:SCR_006988
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006988
Dataset updated
Jan 29, 2022
Description
An interactive visual database containing hundreds of fully labeled, zoomable, and searchable maps of bacterial genomes. It uses a visualization tool (CGView) to generate high-resolution circular genome maps from sequence feature information. Each map includes an interface that allows the image to be expanded and rotated. In the default view, identified genes are drawn to scale and colored according to coding directions. When a region of interest is expanded, gene labels are displayed. Each label is hyperlinked to a custom ''gene card'' which provides several fields of information concerning the corresponding DNA and protein sequences. Each genome map is searchable via a local BLAST search and a gene name/synonym search. A complete listing of the species and strains in the BacMap database is available on the BacMap homepage. Below each species/strain name is a list of the sequenced chromosomes and plasmids that are available. Some features of BacMap include: * Maps are available for 2023 bacterial chromosomes. * Each map supports zooming and rotation. * Map gene labels are hyperlinked to detailed textual annotations. * Maps can be explored manually, or with the help of BacMap''s built in text search and BLAST search. * A written synopsis of each bacterial species is provided. * Several charts illustrating the proteomic and genomic characteristics of each chromosome are available. * Flat file versions of the BacMap gene annotations, gene sequences and protein sequences can be downloaded. BacMap can be used to: * Obtain basic genome statistics. * Visualize the genomic context of genes. * Search for orthologues and paralogues in a genome of interest. * Search for conserved operon structure. * Look for gene content differences between bacterial species. * Obtain pre-calculated annotations for bacterial genes of interest.
Genome Taxonomy Database r226.0
gbif.org
Updated Jan 28, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donovan Parks; Phil Hugenholtz; Donovan Parks; Phil Hugenholtz (2026). Genome Taxonomy Database r226.0 [Dataset]. http://doi.org/10.15468/dpzg84
Explore at:
Unique identifier
https://doi.org/10.15468/dpzg84
Dataset updated
Jan 28, 2026
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
The University of Queensland
Authors
Donovan Parks; Phil Hugenholtz; Donovan Parks; Phil Hugenholtz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Genome Taxonomy Database (GTDB) is an initiative to establish a standardised microbial taxonomy based on genome phylogeny, primarily funded by the Australian Research Council via a Laureate Fellowship (FL150100038) and Discovery Project (DP220100900), with the welcome assistance of strategic funding from The University of Queensland. The genomes used to construct the phylogeny are obtained from RefSeq and GenBank, and GTDB releases are indexed to RefSeq releases, starting with release 76. Importantly and increasingly, this dataset includes draft genomes of uncultured microorganisms obtained from metagenomes and single cells, ensuring improved genomic representation of the microbial world. All genomes are independently quality controlled using CheckM before inclusion in GTDB, see statistics here . The GTDB taxonomy is based on genome trees inferred using FastTree from an aligned concatenated set of 120 single copy marker proteins for Bacteria, and with IQ-TREE from a concatenated set of 53 (starting with R07-RS207) and 122 (prior to R07-RS207) marker proteins for Archaea (download page here ). Additional marker sets are also used to cross-validate tree topologies including concatenated ribosomal proteins and ribosomal RNA genes. NCBI taxonomy was initially used to decorate the genome tree via tax2tree and subsequently used as a reference source of new taxonomic opinions including new names. The 16S rRNA-based Greengenes and SILVA taxonomies were intially used to supplement the taxonomy particularly in regions of the tree with no cultured representatives, however genome assembly identifiers are now used to create placeholder names for uncultured taxa. LPSN is used as the primary nomenclatural reference for establishing naming priorities and nomenclature types. All taxonomic ranks except species are normalised using PhyloRank and the taxonomy manually curated to remove polyphyletic groups. Polyphyly and rank evenness can be visualised in PhyloRank plots . Species were originally delineated based on phylogeny and rank normalization but this was replaced with an ANI-based method (starting with R04-RS89) to enable scalable and automated assignment of genomes to species clusters. The GTDB taxonomy can be queried and downloaded through a number of tools at https://gtdb.ecogenomic.org/
d
Archaeal and Bacterial ABC Transporter Database
dknet.org
neuinfo.org
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Archaeal and Bacterial ABC Transporter Database [Dataset]. http://identifiers.org/RRID:SCR_001692
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_001692
Dataset updated
Jan 29, 2022
Description
ABCdb is a public resource devoted to the ATP-binding Cassette (ABC) transporters encoded by completely sequenced prokaryotic genomes. In order to establish, in a complete genome, the repertory of ABC systems, we have to: i) identify the different partners, ii) assemble the partners in putative systems, and iii) classify the system into the correct functional subfamily (Quentin et al., 2002). The main pitfalls were the identification of loosely conserved domains and the assembly of partners encoded by genes dispersed over the chromosome. In order to face the avalanche of newly sequenced genomes, we decided to also feed into the database the raw prediction issued by this automatic procedure, before time consuming review by an expert occurs. Therefore, the database comprises two sections: CleanDb, for data checked by an expert and AutoDb for raw data. The ABC proteins are involved in a wide variety of physiological processes in Archaea, Bacteria and Eucaryota where they are encoded by large families of paralogous genes. The majority of ABC domains energize the transport of compounds across membranes. In bacteria, ABC transporters are involved in the uptake of a wide variety of molecules, as well as in mechanisms of virulence and antibiotic resistance. In eukaryotes, most of them are involved in drug resistance and in human cell, many are associated with diseases. Sequence analysis reveals that members of the ABC superfamily can be organized into sub-families, and suggests that they have diverged from common ancestral forms. A typical ABC transporter system is composed of an assembly of protein domains that serve different functions: i) two Nucleotide Binding Domains (NBD) that energize transport via ATP hydrolysis, ii) two Membrane Spanning Domains (MSD) that act as a membrane channel for the substrate, and iii) for the importer, a Solute Binding Protein (SBP) that confers substrates specificity on the transporter. The different partners of an ABC system are generally encoded by neighboring genes. The database includes information on: * ABC transporters * Protein partners * Protein domains (NBD, MSD and SBP) * Classification of ABC transporters and their protein partners * Taxonomy of the species Each model Protein includes a link to the Peptide sequence, general information extracted from EMBL files, and specific tags to store results of predictions. The results of the annotation procedure are reachable through the class Prediction. The origin of the proteins is modeled as a path through the classes Chromosome, Strain, Species, and Taxon. Assembly and protein compilation tables are also provided for each of the chromosomes ( Assembly and Protein ).
n
MBGD - Microbial Genome Database
neuinfo.org
rrid.site
+2more
Updated Feb 1, 2001
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2001). MBGD - Microbial Genome Database [Dataset]. http://identifiers.org/RRID:SCR_012824
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_012824
Dataset updated
Feb 1, 2001
Description
MBGD is a database for comparative analysis of completely sequenced microbial genomes, the number of which is now growing rapidly. The aim of MBGD is to facilitate comparative genomics from various points of view such as ortholog identification, paralog clustering, motif analysis and gene order comparison. The heart of MBGD function is to create orthologous or homologous gene cluster table. For this purpose, similarities between all genes are precomputed and stored into the database, in addition to the annotations of genes such as function categories that were assigned by the original authors and motifs that were found in the translated sequence. Using these homology data, MBGD dynamically creates orthologous gene cluster table. Users can change a set of organisms or cutoff parameters to create their own orthologous grouping. Based on this cluster table, users can further analyze multiple genomes from various points of view with the functions such as global map comparison, local map comparison, multiple sequence alignment and phylogenetic tree construction.
d
BIGSdb
dknet.org
Updated Feb 3, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2026). BIGSdb [Dataset]. http://identifiers.org/RRID:SCR_023551/resolver?q=&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_023551 https://identifiers.org/RRID:SCR_023551/resolver?q=&i=rrid
Dataset updated
Feb 3, 2026
Description
Platform for gene-by-gene bacterial population annotation and analysis. Designed to store and analyse sequence data for bacterial isolates. Used for scalable analysis of bacterial genome variation at population level.
Bacterial and Archaeal Phenotypic Database (BAPdb)
figshare.com
xlsx
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fouad El Baidouri; Chris Venditti; Andrew Meade; Sei Suzuki; Stuart Humphries (2025). Bacterial and Archaeal Phenotypic Database (BAPdb) [Dataset]. http://doi.org/10.6084/m9.figshare.12987509.v3
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12987509.v3
Dataset updated
Jun 16, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Fouad El Baidouri; Chris Venditti; Andrew Meade; Sei Suzuki; Stuart Humphries
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Supplementary data for "Phenotypic reconstruction of the last universal common ancestor reveals a complex cell" doi: https://doi.org/10.1101/2020.08.20.260398
d
TWIW database dump
data.dtu.dk
txt
Updated Jul 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sidsel Nag; Gunhild Larsen; Judit Szarvas; Laura Elmlund Kohl Birkedahl; Gábor Máté Gulyás; Wojciech Jakub Ciok; Timmie M. R. Lagermann; Silva Tafaj; Susan Bradbury; Peter Collignon; Denise Daley; Victorien Dougnon; Kafayath Fabiyi; Boubacar Coulibaly; René Dembélé; Georgette Nikiema; Natama Magloire; Isidore Juste Ouindgueta; Zenat Zebin Hossain; Anowara Begum; Deyan Donchev; Mathew Diggle; LeeAnn Turnbull; Simon Lévesque; Livia Berlinger; Kirstine Kobberoe Søgaard; Paula Diaz Guevara; Carolina Duarte Valderrama; Panagiota Maikanti; Jana Amlerova; Pavel Drevinek; Jan Tkadlec; Milica Dilas; Achim J. Kaasch; HenrikTorkil Westh; Mohamed Azzedine Bachtarzi; Wahiba Amhis; Carolina Elizabeth Satán Salazar; José Eduardo Villacis; Mária Angeles Dominguez Lúzon; Dàmaris Berbel Palau; Claire Duployez; Maxime Paluch; Solomon Asante-Sefa; Mie Møller; Margaret Ip; Ivana Marecović; Agnes Pál-Sonnevend; Clementiza Elvezia Cocuzza; Asta Dambrauskiene; Alexandre Macanze; Anelsio Cossa; Inácio Mandomando; Philip Nwajiobi-Princewill; Iruka N. Okeke; Aderemi O. Kehinde; Ini Adebiyi; Ifeoluwa Akintayo; Oluwafemi Popoola; Anthony Onipede; Anita Blomfeldt; Nora Elisabeth Nyquist; Kiri Bocker; James Ussher; Amjad Ali; Nimat Ullah; Habibullah Khan; Natalie Weiler Gustafson; Ikhlas Jarrar; Arif Al-Hamad; Viravarn Luvira; Wantana Paveenkittiporn; Irmak Baran; James C. L. Mwansa; Linda Sikakwa; Kaunda Yamba; Rene Sjøgren Hendriksen; Frank Møller Aarestrup (2023). TWIW database dump [Dataset]. http://doi.org/10.11583/DTU.21758456.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.21758456.v2
Dataset updated
Jul 10, 2023
Dataset provided by
Technical University of Denmark
Authors
Sidsel Nag; Gunhild Larsen; Judit Szarvas; Laura Elmlund Kohl Birkedahl; Gábor Máté Gulyás; Wojciech Jakub Ciok; Timmie M. R. Lagermann; Silva Tafaj; Susan Bradbury; Peter Collignon; Denise Daley; Victorien Dougnon; Kafayath Fabiyi; Boubacar Coulibaly; René Dembélé; Georgette Nikiema; Natama Magloire; Isidore Juste Ouindgueta; Zenat Zebin Hossain; Anowara Begum; Deyan Donchev; Mathew Diggle; LeeAnn Turnbull; Simon Lévesque; Livia Berlinger; Kirstine Kobberoe Søgaard; Paula Diaz Guevara; Carolina Duarte Valderrama; Panagiota Maikanti; Jana Amlerova; Pavel Drevinek; Jan Tkadlec; Milica Dilas; Achim J. Kaasch; HenrikTorkil Westh; Mohamed Azzedine Bachtarzi; Wahiba Amhis; Carolina Elizabeth Satán Salazar; José Eduardo Villacis; Mária Angeles Dominguez Lúzon; Dàmaris Berbel Palau; Claire Duployez; Maxime Paluch; Solomon Asante-Sefa; Mie Møller; Margaret Ip; Ivana Marecović; Agnes Pál-Sonnevend; Clementiza Elvezia Cocuzza; Asta Dambrauskiene; Alexandre Macanze; Anelsio Cossa; Inácio Mandomando; Philip Nwajiobi-Princewill; Iruka N. Okeke; Aderemi O. Kehinde; Ini Adebiyi; Ifeoluwa Akintayo; Oluwafemi Popoola; Anthony Onipede; Anita Blomfeldt; Nora Elisabeth Nyquist; Kiri Bocker; James Ussher; Amjad Ali; Nimat Ullah; Habibullah Khan; Natalie Weiler Gustafson; Ikhlas Jarrar; Arif Al-Hamad; Viravarn Luvira; Wantana Paveenkittiporn; Irmak Baran; James C. L. Mwansa; Linda Sikakwa; Kaunda Yamba; Rene Sjøgren Hendriksen; Frank Møller Aarestrup
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Two Weeks in the World is a global research collaboration which seeks to shed light on various aspects of antimicrobial resistance. The research project has resulted in a dataset of 3100 clinically relevant bacterial genomes with pertaining metadata. “Clinically relevant” refers to the fact that the bacteria from which the genomes were obtained, were all concluded as being a cause of clinical manifestations of infection. The metadata refers to the data describing the infection from which the bacteria was obtained, like geographic origin and approximate collection date. The bacteria were collected from 59 microbiological diagnostic units in 35 countries around the world during 2020. The data from the project consists of tabular data and genomic sequence data. The tabular data is available as a mysql dump (relational database) and as csv files. The tabular data includes the infection metadata, the results from bioinformatic analyses (species prediction, identification of acquired resistance genes and phylogenetic analysis) as well as the pertaining accession numbers of the individual genomic sequence data, which are available through the European Nucleotide Archive (ENA). At time of submission, the project also has a dedicated web app, from which data can be browsed and downloaded: https://twiw.genomicepidemiology.org/ This complete dataset is created and shared according to the FAIR principles and has large reuse potential within the research fields of antimicrobial resistance, clinical microbiology and global health.

.v2: Author list and readme has been updated. And a file containing column descriptions, for the database dump, has been added: TWIW_dbcolumns_explained.csv.
n
GOBASE- The Organelle Genome Database
neuinfo.org
dknet.org
+2more
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). GOBASE- The Organelle Genome Database [Dataset]. http://identifiers.org/RRID:SCR_007692
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007692
Dataset updated
Jun 27, 2024
Description
A taxonomically broad organelle genome database that organizes and integrates diverse data related to mitochondria and chloroplasts. GOBASE is currently expanding to include information on representative bacteria that are thought to be specifically related to the bacterial ancestors of mitochondria and chloroplasts It contains single reference whole-genome sequences for each species from which we have complete mitochondrial or chloroplast data. A new release of this database also includes 42,000 new mitochondrial sequences and 39,000 new chloroplast sequences.
Bacteria Dataset
kaggle.com
zip
Updated Mar 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanchana1990 (2024). Bacteria Dataset [Dataset]. https://www.kaggle.com/datasets/kanchana1990/bacteria-dataset/code
Explore at:
zip(4158 bytes)Available download formats
Dataset updated
Mar 27, 2024
Authors
Kanchana1990
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Dataset Overview

This dataset provides a comprehensive overview of 200 unique bacterial species, highlighting their scientific classification, natural habitats, and potential impacts on human health. Designed for data scientists and researchers, this collection serves as a foundational resource for studies in microbiology, public health, and environmental science. Each entry has been meticulously compiled to offer insights into the diverse roles bacteria play in ecosystems and their interactions with humans.

Data Science Applications

With 200 carefully curated entries, this dataset is ideal for a variety of data science applications, including but not limited to: - Predictive modeling to understand factors influencing bacterial habitats and human health implications. - Clustering analyses to uncover patterns and relationships among bacterial families and their characteristics. - Data visualization projects to illustrate the diversity of bacterial life and its relevance to ecosystems and health.

Column Descriptors

Name: The scientific name of the bacterial species.

Family: The taxonomic family to which the bacterium belongs.

Where Found: Natural habitats or common environments where the bacterium is typically found, including multiple locations if applicable.

Harmful to Humans: Indicates whether the bacterium is known to have harmful effects on human health ("Yes" or "No").

Ethically Mined Data

The compilation of this dataset adheres to ethical data mining practices, ensuring respect for intellectual property rights and scientific integrity. No proprietary or confidential information has been included without appropriate permissions and acknowledgments.

Sources

The data within this dataset has been gathered and synthesized from a range of authoritative sources, ensuring reliability and accuracy:

Websites: - CDC (Centers for Disease Control and Prevention): Offers extensive information on pathogenic bacteria and their impact on human health. - WHO (World Health Organization): Provides global health-related data, including details on bacteria responsible for infectious diseases.

Scientific Journals: - "Journal of Bacteriology": A peer-reviewed scientific journal that publishes research articles on the biology of bacteria. - "Microbiology": Offers articles on microbiology, virology, and molecular biology, with a focus on novel bacterial species and their functions.

Textbooks: - "Brock Biology of Microorganisms" by Michael T. Madigan et al.: A comprehensive textbook covering the principles of microbiology, including detailed information on bacteria. - "Prescott's Microbiology" by Joanne Willey, Linda Sherwood, and Christopher J. Woolverton: Provides a thorough introduction to the field of microbiology, with an emphasis on bacterial species and their roles.

This dataset represents a synthesis of credible scientific knowledge aimed at fostering research and education in microbiology and related fields.
Z
mOTUs database for MetaMeta pipeline - Archaea and Bacteria - version 1
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piro, Vitor C. (2020). mOTUs database for MetaMeta pipeline - Archaea and Bacteria - version 1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_819364
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Robert Koch-Institut - MF1 Bioinformatics
Authors
Piro, Vitor C.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
mOTUs database for MetaMeta pipeline version 1. The database was downloaded from http://www.bork.embl.de/software/mOTU/share/mOTUs.Linux64bits.tar.gz and it is based on marker genes from 1,753 bacterial reference genomes + marker genes from 263 metagenomes and 3,496 bacterial genomes dating from February 2012
n
DOLOP: A Database of Bacterial Lipoproteins
neuinfo.org
dknet.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). DOLOP: A Database of Bacterial Lipoproteins [Dataset]. http://identifiers.org/RRID:SCR_013487
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_013487
Dataset updated
Jan 29, 2022
Description
DOLOP is an exclusive knowledge base for bacterial lipoproteins by processing information from 510 entries to provide a list of 199 distinct lipoproteins with relevant links to molecular details. Features include functional classification, predictive algorithm for query sequences, primary sequence analysis and lists of predicted lipoproteins from 43 completed bacterial genomes along with interactive information exchange facility. This website along will have additional information on the biosynthetic pathway, supplementary material and other related figures. DOLOP also contains information and links to molecular details for about 278 distinct lipoproteins and predicted lipoproteins from 234 completely sequenced bacterial genomes. Additionally, the website features a tool that applies a predictive algorithm to identify the presence or absence of the lipoprotein signal sequence in a user-given sequence. The experimentally verified lipoproteins have been classified into different functional classes and more importantly functional domain assignments using hidden Markov models from the SUPERFAMILY database that have been provided for the predicted lipoproteins. Other features include: primary sequence analysis, signal sequence analysis, and search facility and information exchange facility to allow researchers to exchange results on newly characterized lipoproteins.
Bakta database
zenodo.org
application/gzip +1
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Schwengers; Oliver Schwengers (2023). Bakta database [Dataset]. http://doi.org/10.5281/zenodo.7025248
Explore at:
application/gzip, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7025248
Dataset updated
Feb 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Oliver Schwengers; Oliver Schwengers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data repository contains the mandatory DB for Bakta (db.tar.gz).

Bakta is a tool for the rapid & standardized local annotation of bacterial genomes & plasmids. It provides dbxref-rich and sORF-including annotations in machine-readble JSON & bioinformatics standard file formats for automatic downstream analysis: https://github.com/oschwengers/bakta

This db provides protein sequence hash digests and lengths of UniProt's UniRef100 clusters, UniParc and NCBI RefSeq sequences for ultra-fast identification & lookups. It has been pre-annotated with several specialized db and enriched with Dbxrefs. Furthermore, seed sequences of UniProt's UniRef90 clusters are stored for fallback homology searches via Diamond sequence alignments. All conducted pre-annotations are logged and provided in the db.log.gz file.

External DB versions:

NCBI AMRFinderPlus: 2022-08-09.1

COG: 2020

DoriC: 10

ISFinder: 2019-09-25

Mob-suite: 2.0

Pfam: 35

RefSeq: r213

Rfam: 14.8

UniProtKB/Swiss-Prot: 2022_03

VFDB: 2022-08-19
Z
NCBI Refseq database as of May 2023 part 2
data.niaid.nih.gov
Updated Feb 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert, Nichols (2024). NCBI Refseq database as of May 2023 part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10452278
Explore at:
Dataset updated
Feb 27, 2024
Dataset provided by
Pennsylvania State University
Authors
Robert, Nichols
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the second part of the NCBI Refseq bacterial database originally downloaded in May of 2023. This was used to create the Bacterial 16S and Gyrb databases used in gyrB primer development. The first part can be found at 10.5281/zenodo.10452184.

To recombine the database parts use the code

"cat Bacteria.refseq.tar.gz.part* > Bacteria.refseq.tar.gz"

the total file size of the downloaded refseq database is 88 GB

The gyrB and 16S databases can be found at 10.5281/zenodo.10451935
r
MiST - Microbial Signal Transduction database
rrid.site
dknet.org
+2more
Updated Feb 3, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2026). MiST - Microbial Signal Transduction database [Dataset]. http://identifiers.org/RRID:SCR_003166/resolver?q=&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_003166 https://identifiers.org/RRID:SCR_003166/resolver?q=&i=rrid
Dataset updated
Feb 3, 2026
Description
Database which contains the signal transduction proteins for complete and draft bacterial and archaeal genomes. The MiST2 database identifies and catalogs the repertoire of signal transduction proteins in microbial genomes.
u
National Microbial Germplasm Program
agdatacommons.nal.usda.gov
bin
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA ARS National Germplasm Resources Laboratory (2025). National Microbial Germplasm Program [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/National_Microbial_Germplasm_Program/24661746
Explore at:
binAvailable download formats
Dataset updated
Nov 21, 2025
Dataset provided by
National Germplasm Resources Laboratory
Authors
USDA ARS National Germplasm Resources Laboratory
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
The goal of the National Microbial Germplasm Program is to ensure that the genetic diversity of agriculturally important microorganisms is maintained to enhance and increase agricultural efficiency and profitability. The program collects, authenticates, and characterizes potentially useful microbial germplasm; preserves microbial genetic diversity; and facilitates distribution and utilization of microbial germplasm for research and industry.The Agricultural Research Service maintains several microbial germplasm collections including:USDA ARS Culture CollectionUSDA ARS Collection of Entomopathogenic Fungal Cultures (ARSEF)Query or Download the Rhizobium DatabaseUS National Fungus CollectionsResources in this dataset:Resource Title: National Microbial Germplasm Program .File Name: Web Page, url: https://www.ars-grin.gov/Collections#microbial-germplasm Main web site for the National Microbial Germplasm Program with links to component databases/collections.
Proteins and sequences from the MicrobesOnline database
figshare.com
application/gzip
Updated Feb 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morgan Price; Keith Keller (2024). Proteins and sequences from the MicrobesOnline database [Dataset]. http://doi.org/10.6084/m9.figshare.25207142.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25207142.v1
Dataset updated
Feb 12, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Morgan Price; Keith Keller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This figshare includes scaffold sequences, genes, and protein sequences from MicrobesOnline, a database of (mostly) bacterial and archaeal genomes. The data is provided as a sqlite3 database (gzipped). The tables included are:Taxonomy -- 1 row per genomeScaffold -- scaffolds for each genomeScaffoldSeq -- sequences for each scaffold. (Sequences for Arabidopsis thaliana chromosomes were omitted due to their length.)Locus -- genes for each scaffoldPosition -- positions of genes on scaffoldsAASeq -- protein sequencesSynonym -- gene namesDescription -- gene descriptionsTaxParentChild -- parent-child relationships between taxonomic groupsThese are a subset of all the tables in MicrobesOnline. More detailed documentation is available of MicrobesOnline's schema is available athttp://www.microbesonline.org/programmers.html
m
In-house database specific to bacteria from horses for MALDI Biotyper
data.mendeley.com
narcis.nl
Updated May 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eri Uchida-Fujii (2021). In-house database specific to bacteria from horses for MALDI Biotyper [Dataset]. http://doi.org/10.17632/m342p574wj.1
Explore at:
Unique identifier
https://doi.org/10.17632/m342p574wj.1
Dataset updated
May 12, 2021
Authors
Eri Uchida-Fujii
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains the main spectral profiles (MSPs) for MALDI Biotyper CA 3.2 System (Bruker Japan, Kanagawa, Japan) constitutes in-house database specific to bacteria from horses for identification with matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS). MALDI-TOF MS is used for identification of bacterial species isolated from horses. however, some bacterial species isolated from horses are difficult to identify with MALDI-TOF MS because of insufficiencies in the reference database, and enriching the databases is expected to enhance the accuracy of MALDI-TOF MS identification. Here we created an in-house database including 271 bacterial isolates from horses. Bacterial isolates were subjected to ethanol / formic acid treatment, and spectra were gained using flexControl 3.4 software (Bruker Japan). The spectra were checked with flexAnalysis 3.4 software (Bruker Japan) to delete spectra differing from the cohort spectra and imported for generating MSPs using MBT Compass Explorer 4.1.7.0. software (Bruker Japan). MSPs were exported in a *.btmsp format for use with MALDI Biotyper systems.

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institutes of Health (2025). A tandem repeats database for bacterial genomes: application to the genotyping of [Dataset]. https://catalog.data.gov/dataset/a-tandem-repeats-database-for-bacterial-genomes-application-to-the-genotyping-of

A tandem repeats database for bacterial genomes: application to the genotyping of

Explore at:

250 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Sep 7, 2025

Dataset provided by

National Institutes of Health

Description

Background Some pathogenic bacteria are genetically very homogeneous, making strain discrimination difficult. In the last few years, tandem repeats have been increasingly recognized as markers of choice for genotyping a number of pathogens. The rapid evolution of these structures appears to contribute to the phenotypic flexibility of pathogens. The availability of whole-genome sequences has opened the way to the systematic evaluation of tandem repeats diversity and application to epidemiological studies. Results This report presents a database () of tandem repeats from publicly available bacterial genomes which facilitates the identification and selection of tandem repeats. We illustrate the use of this database by the characterization of minisatellites from two important human pathogens, Yersinia pestis and Bacillus anthracis. In order to avoid simple sequence contingency loci which may be of limited value as epidemiological markers, and to provide genotyping tools amenable to ordinary agarose gel electrophoresis, only tandem repeats with repeat units at least 9 bp long were evaluated. Yersinia pestis contains 64 such minisatellites in which the unit is repeated at least 7 times. An additional collection of 12 loci with at least 6 units, and a high internal conservation were also evaluated. Forty-nine are polymorphic among five Yersinia strains (twenty-five among three Y. pestis strains). Bacillus anthracis contains 30 comparable structures in which the unit is repeated at least 10 times. Half of these tandem repeats show polymorphism among the strains tested. Conclusions Analysis of the currently available bacterial genome sequences classifies Bacillus anthracis and Yersinia pestis as having an average (approximately 30 per Mb) density of tandem repeat arrays longer than 100 bp when compared to the other bacterial genomes analysed to date. In both cases, testing a fraction of these sequences for polymorphism was sufficient to quickly develop a set of more than fifteen informative markers, some of which show a very high degree of polymorphism. In one instance, the polymorphism information content index reaches 0.82 with allele length covering a wide size range (600-1950 bp), and nine alleles resolved in the small number of independent Bacillus anthracis strains typed here.

Clear search

Close search

Google apps

Main menu

A tandem repeats database for bacterial genomes: application to the...

MARMICRODB database for taxonomic classification of (marine) metagenomes

ARS Microbial Genomic Sequence Database Server

BacMap: Bacterial Genome Atlas

Genome Taxonomy Database r226.0

Archaeal and Bacterial ABC Transporter Database

MBGD - Microbial Genome Database

BIGSdb

Bacterial and Archaeal Phenotypic Database (BAPdb)

TWIW database dump

GOBASE- The Organelle Genome Database

Bacteria Dataset

Dataset Overview

Data Science Applications

Column Descriptors

Ethically Mined Data

Sources

mOTUs database for MetaMeta pipeline - Archaea and Bacteria - version 1

DOLOP: A Database of Bacterial Lipoproteins

Bakta database

NCBI Refseq database as of May 2023 part 2

MiST - Microbial Signal Transduction database

National Microbial Germplasm Program

Proteins and sequences from the MicrobesOnline database

In-house database specific to bacteria from horses for MALDI Biotyper

A tandem repeats database for bacterial genomes: application to the genotyping ofSee More Versions

A tandem repeats database for bacterial genomes: application to the genotyping of