Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Overview
MicrobiomeHD is a standardized database of human gut microbiome studies in health and disease. This database includes publicly available 16S data from published case-control studies and their associated patient metadata. Raw sequencing data for each study was downloaded and processed through a standardized pipeline.
To be included in MicrobiomeHD, datasets have:
Currently, MicrobiomeHD is focused on stool samples. Additional samples may be included in certain datasets, as indicated in the metadata.
Files
Additional information about the datasets included in this MicrobiomeHD release are in the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, in the file db/dataset_info.yaml. Top-level identifiers correspond to the dataset IDs used in Duvallet et al. 2017. Sample sizes in the yaml file are those that were described in the papers, and may not exactly reflect the actual data (due to missing/extra data, samples which didn't pass quality control, etc).
Each dataset was downloaded and processed through a standardized pipeline. The raw processing results are available in the *.tar.gz files here. Each file has the same directory structure and files, as described in the pipeline documentation: http://amplicon-sequencing-pipeline.readthedocs.io/en/latest/output.html.
Specific files of interest include:
The raw data was acquired as described in the supplementary materials of Duvallet et al.'s "Meta analysis of microbiome studies identifies shared and disease-specific patterns".
Raw sequencing data was processed with the Alm lab's in-house 16S processing pipeline: https://github.com/thomasgurry/amplicon_sequencing_pipeline
Pipeline documentation is available at: http://amplicon-sequencing-pipeline.readthedocs.io/
Metadata was extracted from the original papers and/or data sources, and formatted manually.
Contributing
MicrobiomeHD is a resource that can be used to extract disease-specific microbiome signals in individual case-control studies. Many microbes respond non-specifically to health and disease, and the majority of bacterial associations within individual studies overlap with this "core" response. Researchers should cross-check their results with the data presented here to ensure that their identified microbial associations are specific to their disease under study.
We provide an updated list of "core" microbes here, as well as the raw OTU tables for anyone who wishes to reproduce and adapt this analysis to their study question.
If you would like to include your case-control dataset in MicrobiomeHD, please email duvallet[at]mit.edu.
For us to process your data through our standard pipeline, you will need to provide the following files and information about your data:
By using MicrobiomeHD in your own analyses, you agree to contribute your dataset to this database and to make your raw sequencing data (i.e. fastq files) publicly available.
Citing MicrobiomeHD
The MicrobiomeHD database and original publications for each of these datasets are described in Duvallet et al. (2017): http://biorxiv.org/content/early/2017/05/08/134031
If you use any of these datasets in your analysis, please cite both MicrobiomeHD (Duvallet et al. (2017)) and the original publication for each dataset that you use.
The code used to process and analyze this data in Duvallet et al. (2017) is available on github: https://github.com/cduvallet/microbiomeHD
Files
Core genera
file-S3.core_genera.txt: Supplemental Table 3 from Duvallet et al. (2017), listing the core health- and disease-associated microbes.
Datasets
Note that MicrobiomeHD contains all 28 datasets from Duvallet et al. (2017), as well as additional datasets which did not meet the inclusion criteria for the meta-analysis presented in the paper. Additional information about the datasets included in this MicrobiomeHD release are in the original publications and the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, in the file db/dataset_info.yaml.
The sample sizes listed here reflect what was reported in the original publications. Some may have discrepancies between what is reported and what is in the actual data due to missing data, quality issues, barcode mismatches, etc.
</li>
<li><strong>autism_kb_results.tar.gz</strong> (<em>asd_kang</em>): H: 20, ASD: 20
<ul>
<li>http://dx.doi.org/10.1371/journal.pone.0068322</li>
</ul>
</li>
<li><strong>cdi_schubert_results.tar.gz</strong> (<em>noncdi_schubert</em>): H: 155, nonCDI: 89, CDI: 94
<ul>
<li>http://dx.doi.org/10.1128/mBio.01021-14</li>
</ul>
</li>
<li><strong>cdi_vincent_v3v5_results.tar.gz</strong> (<em>cdi_vincent</em>): H: 25, CDI: 25
<ul>
<li>http://dx.doi.org/10.1186/2049-2618-1-18</li>
</ul>
</li>
<li><strong>cdi_youngster_results.tar.gz</strong> (<em>cdi_youngster</em>): H: 4, CDI: 19
<ul>
<li>http://dx.doi.org/10.1093/cid/ciu135</li>
</ul>
</li>
<li><strong>crc_baxter_results.tar.gz</strong> (<em>crc_baxter</em>): adenoma: 198, H: 172, CRC: 120
<ul>
<li>http://dx.doi.org/10.1186/s13073-016-0290-3</li>
</ul>
</li>
<li><strong>crc_xiang_results.tar.gz</strong> (<em>crc_chen</em>): H: 22, CRC: 21
<ul>
<li>http://dx.doi.org/10.1371/journal.pone.0039743</li>
</ul>
</li>
<li><strong>crc_zackular_results.tar.gz</strong> (<em>crc_zackular</em>): adenoma: 30, H: 30, CRC: 30
<ul>
<li>http://dx.doi.org/10.1158/1940-6207.CAPR-14-0129</li>
</ul>
</li>
<li><strong>crc_zeller_results.tar.gz</strong> (<em>crc_zeller</em>): H: 75, CRC: 41
<ul>
<li>http://dx.doi.org/10.15252/msb.20145645</li>
</ul>
</li>
<li><strong>crc_zhao_results.tar.gz</strong> (<em>crc_wang</em>): H: 56, CRC: 46
<ul>
<li>http://dx.doi.org/10.1038/ismej.2011.109}</li>
</ul>
</li>
<li><strong>edd_singh_results.tar.gz</strong> (<em>edd_singh</em>): STEC: 28, CAMP: 71, SALM: 66, SHIG: 34, H: 75
<ul>
<li>http://dx.doi.org/10.1186/s40168-015-0109-2</li>
</ul>
</li>
<li><strong>hiv_dinh_results.tar.gz</strong> (<em>hiv_dinh</em>): H: 16, HIV: 21
<ul>
<li>http://dx.doi.org/10.1093/infdis/jiu409</li>
</ul>
</li>
<li><strong>hiv_lozupone_results.tar.gz</strong> (<em>hiv_lozupone</em>): H: 13, HIV: 25
<ul>
<li>http://dx.doi.org/10.1016/j.chom.2013.08.006</li>
</ul>
</li>
<li><strong>hiv_noguerajulian_results.tar.gz</strong> (<em>hiv_noguerajulian</em>): H: 34, HIV: 206
<ul>
<li>https://doi.org/10.1016%2Fj.ebiom.2016.01.032</li>
</ul>
</li>
<li><strong>ibd_alm_results.tar.gz</strong> (<em>ibd_papa</em>): IBDundef: 1, nonIBD: 24, UC: 43, CD: 23
<ul>
<li>http://dx.doi.org/10.1371/journal.pone.0039242</li>
</ul>
</li>
<li><strong>ibd_engstrand_maxee_results.tar.gz</strong> (<em>ibd_willing</em>): CCD: 12, H: 35, ICD: 15, UC: 16, ICCD: 2
<ul>
<li>http://dx.doi.org/10.1053/j.gastro.2010.08.049</li>
</ul>
</li>
<li><strong>ibd_gevers_2014_results.tar.gz</strong> (<em>ibd_gevers</em>): H: 31, CD: 224
<ul>
<li>http://dx.doi.org/10.1016/j.chom.2014.02.005</li>
</ul>
</li>
<li><strong>ibd_huttenhower_results.tar.gz</strong> (<em>ibd_morgan</em>): H: 18, UC: 48, CD: 62
<ul>
<li>http://dx.doi.org/10.1186/gb-2012-13-9-r79</li>
</ul>
</li>
<li><strong>mhe_zhang_results.tar.gz</strong> (<em>liv_zhang</em>): CIRR: 25, H: 26, MHE: 26
<ul>
<li>http://dx.doi.org/10.1038/ajg.2013.221</li>
</ul>
</li>
<li><strong>nash_chan_results.tar.gz</strong> (<em>nash_wong</em>): H: 22, NASH: 16
<ul>
<li>http://dx.doi.org/10.1371/journal.pone.0062885</li>
</ul>
</li>
<li><strong>nash_ob_baker_results.tar.gz</strong> (<em>nash_zhu</em>): H: 16, NASH: 22, OB: 25
<ul>
<li>http://dx.doi.org/10.1002/hep.26093</li>
</ul>
</li>
<li><strong>ob_goodrich_results.tar.gz</strong> (<em>ob_goodrich</em>): OW: 322, H: 433, OB: 183
<ul>
<li>http://dx.doi.org/10.1016/j.cell.2014.09.053</li>
</ul>
</li>
<li><strong>ob_gordon_2008_v2_results.tar.gz</strong> (<em>ob_turnbaugh</em>): H: 61, OB:
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
There are abundant endophytic bacteria, fungi and actinomycetes in medicinal plants. The microorganisms of medicinal plants are inseparable from the growth, reproduction and metabolic activities of their host plants, which can not only affect the formation and content of medicinal components of plants, but also affect the authenticity of Chinese medicinal materials. Angelicae Sinensis Radix, Astragali Radix, Codonopsis Radix, Glycyrrhizae Radix et Rhizoma and Rhei Radix et Rhizoma are traditional Chinese medicinal materials and important sources of clinical medicine. In recent years, more and more research has been done on the microbiome of this medicinal plant. In order to integrate data resources and results of numerous studies and promote comparative studies, literature review and information extraction analysis were carried out, so as to construct a knowledge base of medicinal plant microbiome to assist the research of medicinal plant quality and authenticity. The database covers medicinal plant microorganisms by name, host plant, plant source in literature, classification, genus, family, order, class, phylum, function/biological role, technique, sequence length, NCBI reference serial number /GenBank, references and corresponding links. This interface supports the query function of the microbiome content of the above medicinal plants. Therefore, the database will help to provide a research basis for the development and utilization of the microbiome of medicinal plants and provide a reference for the creation of new methods for quality control and authenticity evaluation of medicinal plants.In Version 2, an additional 11 pieces of information have been incorporated for Codonopsis Radix to consider.In Version 3, the number of endophytes in the database was updated to 350.In Version 4, in order to distinguish the origin of host plants, category 'plant source in literature' was added. Meanwhile, the names of host plants were unified as the Latin names, and one duplicate data has been removed.In Version 5, we added a processing file for the data in MPMD, in which we counted the frequency of each endophyte and analyzed parameters such as the proportion of high-frequency endophytes occurring in the five traditional medicinal plants.In Version 6, we corrected the errors that appeared in the description of the past few versions and a new version of MPMD was provided.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the working version of the Fly Microbiome Diet Database, a compilation of published Drosophila diets used by laboratories in the field of fly microbiome research. In addition to listing dietary components as they are described in published research methods, we calculate macronutrient content and protein to carbohydrate ratio of each diet for the purpose of making comparisons between dietary nutrient content across studies. The source files for each diet component's nutrition facts are accessible at https://doi.org/10.6084/m9.figshare.11920743.v2.The database is subject to change as new studies are added or nutritional information is updated to be more accurate. In database, N.S. for Yeast or Cornmeal type means "not specified"; N/A means that ingredient was not used.06/22/2021: V5 published, includes fructose nutritional content, 7 additional diets12/07/2021: V6 published, includes additional diets from recent published microbiome work and self-submitted non-microbiome fly work (submitted via DDCC)
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE.Documented on April 14,2022. Database of comprehensive information on the approximately 600 prokaryote species that are present in the human oral cavity. The majority of these species are uncultivated and unnamed, recognized primarily by their 16S rRNA sequences. The HOMD presents a provisional naming scheme for the currently unnamed species so that strain, clone, and probe data from any laboratory can be directly linked to a stably named reference entity. The HOMD links sequence data with phenotypic, phylogenetic, clinical, and bibliographic information. Full and partial oral bacterial genome sequences determined as part of this project and the Human Microbiome Project, are being added to the HOMD as they become available. HOMD offers easy to use tools for viewing all publicly available oral bacterial genomes. Data is also downloadable.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pbac (Panda Gut Microbiome Database): A Curated Resource for Giant Panda Gut Microbial Genomes. In this project, we conducted an in-depth analysis of giant panda metagenome-assembled genomes (MAGs), utilizing both Illumina and Nanopore sequencing technologies. Our extensive efforts resulted in the identification of 2,684 medium- to high-quality MAGs meeting specific criteria: completeness ≥ 50%, contamination < 10%, and length ≥ 500 kb. Remarkably, 960 MAGs surpassed the stringent high-quality thresholds of completeness ≥ 90% and contamination < 5%. Within this dataset, we identified 1,193 non-redundant MAGs through a 99% similarity threshold clustering, with 354 of them being of high quality. Taxonomic analysis revealed that 672 MAGs could be mapped to 219 known species, while 521 MAGs clustered into 228 unique groups, leading to the assignment of new genus or species identifiers.
Facebook
TwitterThe Human Oral Microbiome Database (HOMD) provides a site-specific comprehensive database for the more than 600 prokaryote species that are present in the human oral cavity. It contains genomic information based on a curated 16S rRNA gene-based provisional naming scheme, and taxonomic information. This datatype contains taxonomic information.
Facebook
TwitterAn abundance matrix (BM_OTU.xlsx) contains rows as OTU, columns as samples, and entries representing the abundance of each OTU as a ratio of all sequences obtained for each individual sample. This dataset is associated with the following publication: Gomez-Alvarez, V., and R. Revetta. Monitoring of Nitrification in Chloraminated Drinking Water Distribution Systems With Microbiome Bioindicators Using Supervised Machine Learning. Frontiers in Microbiology. Frontiers, Lausanne, SWITZERLAND, 11: 2254-2267, (2020).
Facebook
Twitterhttps://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
This dataset compiles all the data of the Seed Microbiota Database associated to the publication Simonin et al. 2021 (BioRxiv) Seed microbiota revealed by a large-scale meta-analysis including 50 plant species. This database includes metabarcoding data from 63 seed microbiota studies on 50 plant species ( total of 3190 seed samples) based on 5 different molecular markers (16S rRNA gene - V4 region, 16S rRNA gene - V5-V6 region, gyrB gene, ITS1 region, ITS2 region). All the studies were re-processed from the fastq files (raw data) using DADA2 and Qiime2 and merged in 5 different datasets depending on the molecular marker targeted. The README file presents the structure of the database (Subsets) and files available. This database can be queried online without downloading it on the Askomics instance : https://askomics-192-168-100-151.vm.openstack.genouest.org/ For a full access to the results, you can log to the Askomics instance with the following credentials: Username: consult Password: OcOU83D5
Facebook
TwitterThe inflammatory bowel diseases (IBD), which include Crohn’s disease (CD) and ulcerative colitis (UC), are multifactorial, chronic conditions of the gastrointestinal tract. While IBD has been associated with dramatic changes in the gut microbiota, changes in the gut metabolome -- the molecular interface between host and microbiota -- are less-well understood. To address this gap, we performed untargeted LC-MS metabolomic and shotgun metagenomic profiling of cross-sectional stool samples from discovery (n=155) and validation (n=65) cohorts of CD, UC, and non-IBD control subjects. Metabolomic and metagenomic profiles were broadly correlated with fecal calprotectin levels (a measure of gut inflammation). Across >8,000 measured metabolite features, we identified chemicals and chemical classes that were differentially abundant (DA) in IBD, including enrichments for sphingolipids and bile acids, and depletions for triacylglycerols and tetrapyrroles. While >50% of DA metabolite features were uncharacterized, many could be assigned putative roles through metabolomic “guilt-by-association” (covariation with known metabolites). DA species and functions from the metagenomic profiles reflected adaptation to oxidative stress in the IBD gut, and were individually consistent with previous findings. Integrating these data, however, we identified 122 robust associations between DA species and well-characterized DA metabolites, indicating possible mechanistic relationships that are perturbed in IBD. Finally, we found that metabolome- and metagenome-based classifiers of IBD status were highly accurate and, like the vast majority of individual trends, generalized well to the independent validation cohort. Our findings thus provide an improved understanding of perturbations of the microbiome-metabolome interface in IBD, including identification of many potential diagnostic and therapeutic targets.
Facebook
TwitterThe NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performed for thousands of these samples. In addition, whole genome sequences were generated for isolate strains collected from human body sites to act as reference organisms for analysis. Finally, 16S marker and whole metagenome sequencing was also done on additional samples from people suffering from several disease conditions.
Facebook
TwitterA database which provides ribosome related data services to the scientific community, including online data analysis, rRNA derived phylogenetic trees, and aligned and annotated rRNA sequences. It specifically contains information on quality-controlled, aligned and annotated bacterial and archaean 16S rRNA sequences, fungal 28S rRNA sequences, and a suite of analysis tools for the scientific community. Most of the RDP tools are now available as open source packages for users to incorporate in their local workflow.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Serofluid dish is one of the most popular probiotic fermented foods in northern China with a long history. It has not only unique flavor, but also rich nutrition, which is beneficial to human health. There are a lot of probiotics resources in Serofluid dish, such as Lactobacillus and Acetobacter are its dominant species. The lactic acid bacteria contained in it can promote gastrointestinal peristalsis, digestion and absorption after entering the human digestive tract; it also can reduce cholesterol and enhance the body's immunity. In Lanzhou, it is not only a local specialty representative of the region, but also a symbol of the local food culture, attracting a large number of domestic food lovers. Based on the systematic study of the microbial community structure of Serofluid dish, the separation and identification results of natural fermentation Serofluid dish samples collected from different geographical locations were summarized, and the database was compiled. The main contents of the database include: species and genera of culturable microorganisms isolated from Serofluid dish, and GenBank accession, corresponding media and other information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Custom Kraken2/Bracken database built using the representative genomes for 1,021 microbial species from the mouse gut microbiota. Genomes include isolates and MAGs, but all are near-complete (>90% completeness; <5% contamination; maximum genome size ≤ 8 Mb; maximum contig count ≤ 500; N50 ≥ 10 kb; mean contig length ≥ 5 kb). This database achieved a mean read classification rate of 87.7% when benchmarked on 1,785 independent (i.e. non-contributory) mouse gut shotgun metagenome samples. An equivalent human database (UHGG) only attained classification rates of 36.6%.
This database is a publicly available resource to facilitate more efficient/deeper analyses of mouse gut shotgun metagenomes.
Find out more about the Mouse Microbial Genome Collection at our GitHub repository.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was compiled for a Master's thesis project focused on investigating the gut microbiota response in fish exposed to microplastics. It contains cleaned and annotated metadata along with taxonomic abundance information and exposure features, prepared for predictive machine learning modeling.
Context Microplastics (MPs) are emerging pollutants in aquatic ecosystems. Numerous studies have shown that MPs can impact the gut microbial composition of fish. This dataset integrates data from multiple studies through a meta-analysis approach, standardized using bioinformatics and machine learning pipelines.
Source Sequences and metadata were extracted from public BioProject entries in the NCBI SRA database.
Data processing: QIIME2, Python (pandas, scikit-learn), Google Colab
Total size: ~648 FASTQ files → summarized into machine learning-ready tabular format
Applications Microbiome classification modeling
Environmental ecotoxicology analysis
Meta-analysis benchmarking
Feature importance and interpretability (SHAP, feature selection)
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We use open source human gut microbiome data to learn a microbial “language” model by adapting techniques from Natural Language Processing (NLP). Our microbial “language” model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from independent studies, even in the presence of substantial distribution shifts. Through a variety of analyses, we further show that the pre-trained, context-sensitive embedding captures meaningful biological information, including taxonomic relationships, correlations with biological pathways, and relevance to IBD expression, despite the model never being explicitly exposed to such signals. Methods No additional raw data was collected for this project. All inputs are available publicly. American Gut Project, Halfvarson, and Schirmer raw data are available from the NCBI database (accession numbers PRJEB11419, PRJEB18471, and PRJNA398089, respectively). We used the curated data produced by Tataru and David, 2020.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tumor gene expression is predictive of patient prognosis in some cancers. However, RNA-seq and whole genome sequencing data contain not only reads from host tumor and normal tissue, but also reads from the tumor microbiome, which can be used to infer the microbial abundances in each tumor. Here, we show that tumor microbial abundances, alone or in combination with tumor gene expression data, can predict cancer prognosis and drug response to some extent – microbial abundances are significantly less predictive of prognosis than gene expression, although remarkably, similarly as predictive of drug response, but in mostly different cancer-drug combinations. Thus, it appears possible to leverage existing sequencing technology, or develop new protocols, to obtain more non-redundant information about prognosis and drug response from RNA-seq and whole genome sequencing experiments than could be obtained from tumor gene expression or genomic data alone.
Facebook
TwitterDatabase containing detailed information about small molecules produced by human microbiome. Provides metabolite data including structure, names, descriptions, chemical taxonomy, chemical ontology, physico-chemical data, spectra and contains detailed information about microbes that produce these chemicals, enzymatic reactions responsible for their production, bioactivity of chemicals and anatomical location of these chemicals and microbes. Many data fields in the database are hyperlinked to other databases including FooDB, HMDB, KEGG, PubChem, MetaCyc, ChEBI, UniProt, and GenBank. Database is FAIR compliant.The data in MiMeDB are released under the Creative Commons (CC) 4.0 License.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database used in Sci Rep. 2021 Aug 2;11(1):15646. doi: 10.1038/s41598-021-95228-8.Bacteria list processed to contain only bacteria from the oral cavity.The original files (without processing) can be downloaded from the Human Oral Microbiome Database: HOMD (http://www.homd.org/)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Human Microbiome Compendium is an ongoing project to build a large collection of human microbiome sequencing data processed with a uniform pipeline. Currently, the compendium contains 16S rRNA amplicon sequencing data for human gut microbiome samples retrieved from the Sequence Read Archive. Our website at microbiomap.org has more information about the project and links to related resources.
This data is freely available under the Creative Commons Attribution 4.0 International license (CC BY 4.0). If you use it in your work, please cite our preprint:
Abdill, Richard J., Samantha P. Graham, Vincent Rubinetti, Frank W. Albert, Casey S. Greene, Sean Davis, and Ran Blekhman. “Integration of 168,000 Samples Reveals Global Patterns of the Human Gut Microbiome.” bioRxiv, October 11, 2023. https://doi.org/10.1101/2023.10.11.560955.
If you are using this dataset in combination with your own results, it's important to note that the taxonomic classifications may differ between releases, as documented in CHANGELOG.md. The most recent release (1.1.0) includes assignments made using SILVA 138.2 (SSU Ref NR 99) and Greengenes2 (2022.10 backbone).
Facebook
TwitterThe fasta file (BM_OTU.fasta) contain the sequences of the bacterial 16S rRNA-encoding V4 region gene (≈250 nt) for each Operational Taxonomic Unit (OTU). This dataset is associated with the following publication: Gomez-Alvarez, V., and R. Revetta. Monitoring of Nitrification in Chloraminated Drinking Water Distribution Systems With Microbiome Bioindicators Using Supervised Machine Learning. Frontiers in Microbiology. Frontiers, Lausanne, SWITZERLAND, 11: 2254-2267, (2020).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Overview
MicrobiomeHD is a standardized database of human gut microbiome studies in health and disease. This database includes publicly available 16S data from published case-control studies and their associated patient metadata. Raw sequencing data for each study was downloaded and processed through a standardized pipeline.
To be included in MicrobiomeHD, datasets have:
Currently, MicrobiomeHD is focused on stool samples. Additional samples may be included in certain datasets, as indicated in the metadata.
Files
Additional information about the datasets included in this MicrobiomeHD release are in the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, in the file db/dataset_info.yaml. Top-level identifiers correspond to the dataset IDs used in Duvallet et al. 2017. Sample sizes in the yaml file are those that were described in the papers, and may not exactly reflect the actual data (due to missing/extra data, samples which didn't pass quality control, etc).
Each dataset was downloaded and processed through a standardized pipeline. The raw processing results are available in the *.tar.gz files here. Each file has the same directory structure and files, as described in the pipeline documentation: http://amplicon-sequencing-pipeline.readthedocs.io/en/latest/output.html.
Specific files of interest include:
The raw data was acquired as described in the supplementary materials of Duvallet et al.'s "Meta analysis of microbiome studies identifies shared and disease-specific patterns".
Raw sequencing data was processed with the Alm lab's in-house 16S processing pipeline: https://github.com/thomasgurry/amplicon_sequencing_pipeline
Pipeline documentation is available at: http://amplicon-sequencing-pipeline.readthedocs.io/
Metadata was extracted from the original papers and/or data sources, and formatted manually.
Contributing
MicrobiomeHD is a resource that can be used to extract disease-specific microbiome signals in individual case-control studies. Many microbes respond non-specifically to health and disease, and the majority of bacterial associations within individual studies overlap with this "core" response. Researchers should cross-check their results with the data presented here to ensure that their identified microbial associations are specific to their disease under study.
We provide an updated list of "core" microbes here, as well as the raw OTU tables for anyone who wishes to reproduce and adapt this analysis to their study question.
If you would like to include your case-control dataset in MicrobiomeHD, please email duvallet[at]mit.edu.
For us to process your data through our standard pipeline, you will need to provide the following files and information about your data:
By using MicrobiomeHD in your own analyses, you agree to contribute your dataset to this database and to make your raw sequencing data (i.e. fastq files) publicly available.
Citing MicrobiomeHD
The MicrobiomeHD database and original publications for each of these datasets are described in Duvallet et al. (2017): http://biorxiv.org/content/early/2017/05/08/134031
If you use any of these datasets in your analysis, please cite both MicrobiomeHD (Duvallet et al. (2017)) and the original publication for each dataset that you use.
The code used to process and analyze this data in Duvallet et al. (2017) is available on github: https://github.com/cduvallet/microbiomeHD
Files
Core genera
file-S3.core_genera.txt: Supplemental Table 3 from Duvallet et al. (2017), listing the core health- and disease-associated microbes.
Datasets
Note that MicrobiomeHD contains all 28 datasets from Duvallet et al. (2017), as well as additional datasets which did not meet the inclusion criteria for the meta-analysis presented in the paper. Additional information about the datasets included in this MicrobiomeHD release are in the original publications and the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, in the file db/dataset_info.yaml.
The sample sizes listed here reflect what was reported in the original publications. Some may have discrepancies between what is reported and what is in the actual data due to missing data, quality issues, barcode mismatches, etc.
</li>
<li><strong>autism_kb_results.tar.gz</strong> (<em>asd_kang</em>): H: 20, ASD: 20
<ul>
<li>http://dx.doi.org/10.1371/journal.pone.0068322</li>
</ul>
</li>
<li><strong>cdi_schubert_results.tar.gz</strong> (<em>noncdi_schubert</em>): H: 155, nonCDI: 89, CDI: 94
<ul>
<li>http://dx.doi.org/10.1128/mBio.01021-14</li>
</ul>
</li>
<li><strong>cdi_vincent_v3v5_results.tar.gz</strong> (<em>cdi_vincent</em>): H: 25, CDI: 25
<ul>
<li>http://dx.doi.org/10.1186/2049-2618-1-18</li>
</ul>
</li>
<li><strong>cdi_youngster_results.tar.gz</strong> (<em>cdi_youngster</em>): H: 4, CDI: 19
<ul>
<li>http://dx.doi.org/10.1093/cid/ciu135</li>
</ul>
</li>
<li><strong>crc_baxter_results.tar.gz</strong> (<em>crc_baxter</em>): adenoma: 198, H: 172, CRC: 120
<ul>
<li>http://dx.doi.org/10.1186/s13073-016-0290-3</li>
</ul>
</li>
<li><strong>crc_xiang_results.tar.gz</strong> (<em>crc_chen</em>): H: 22, CRC: 21
<ul>
<li>http://dx.doi.org/10.1371/journal.pone.0039743</li>
</ul>
</li>
<li><strong>crc_zackular_results.tar.gz</strong> (<em>crc_zackular</em>): adenoma: 30, H: 30, CRC: 30
<ul>
<li>http://dx.doi.org/10.1158/1940-6207.CAPR-14-0129</li>
</ul>
</li>
<li><strong>crc_zeller_results.tar.gz</strong> (<em>crc_zeller</em>): H: 75, CRC: 41
<ul>
<li>http://dx.doi.org/10.15252/msb.20145645</li>
</ul>
</li>
<li><strong>crc_zhao_results.tar.gz</strong> (<em>crc_wang</em>): H: 56, CRC: 46
<ul>
<li>http://dx.doi.org/10.1038/ismej.2011.109}</li>
</ul>
</li>
<li><strong>edd_singh_results.tar.gz</strong> (<em>edd_singh</em>): STEC: 28, CAMP: 71, SALM: 66, SHIG: 34, H: 75
<ul>
<li>http://dx.doi.org/10.1186/s40168-015-0109-2</li>
</ul>
</li>
<li><strong>hiv_dinh_results.tar.gz</strong> (<em>hiv_dinh</em>): H: 16, HIV: 21
<ul>
<li>http://dx.doi.org/10.1093/infdis/jiu409</li>
</ul>
</li>
<li><strong>hiv_lozupone_results.tar.gz</strong> (<em>hiv_lozupone</em>): H: 13, HIV: 25
<ul>
<li>http://dx.doi.org/10.1016/j.chom.2013.08.006</li>
</ul>
</li>
<li><strong>hiv_noguerajulian_results.tar.gz</strong> (<em>hiv_noguerajulian</em>): H: 34, HIV: 206
<ul>
<li>https://doi.org/10.1016%2Fj.ebiom.2016.01.032</li>
</ul>
</li>
<li><strong>ibd_alm_results.tar.gz</strong> (<em>ibd_papa</em>): IBDundef: 1, nonIBD: 24, UC: 43, CD: 23
<ul>
<li>http://dx.doi.org/10.1371/journal.pone.0039242</li>
</ul>
</li>
<li><strong>ibd_engstrand_maxee_results.tar.gz</strong> (<em>ibd_willing</em>): CCD: 12, H: 35, ICD: 15, UC: 16, ICCD: 2
<ul>
<li>http://dx.doi.org/10.1053/j.gastro.2010.08.049</li>
</ul>
</li>
<li><strong>ibd_gevers_2014_results.tar.gz</strong> (<em>ibd_gevers</em>): H: 31, CD: 224
<ul>
<li>http://dx.doi.org/10.1016/j.chom.2014.02.005</li>
</ul>
</li>
<li><strong>ibd_huttenhower_results.tar.gz</strong> (<em>ibd_morgan</em>): H: 18, UC: 48, CD: 62
<ul>
<li>http://dx.doi.org/10.1186/gb-2012-13-9-r79</li>
</ul>
</li>
<li><strong>mhe_zhang_results.tar.gz</strong> (<em>liv_zhang</em>): CIRR: 25, H: 26, MHE: 26
<ul>
<li>http://dx.doi.org/10.1038/ajg.2013.221</li>
</ul>
</li>
<li><strong>nash_chan_results.tar.gz</strong> (<em>nash_wong</em>): H: 22, NASH: 16
<ul>
<li>http://dx.doi.org/10.1371/journal.pone.0062885</li>
</ul>
</li>
<li><strong>nash_ob_baker_results.tar.gz</strong> (<em>nash_zhu</em>): H: 16, NASH: 22, OB: 25
<ul>
<li>http://dx.doi.org/10.1002/hep.26093</li>
</ul>
</li>
<li><strong>ob_goodrich_results.tar.gz</strong> (<em>ob_goodrich</em>): OW: 322, H: 433, OB: 183
<ul>
<li>http://dx.doi.org/10.1016/j.cell.2014.09.053</li>
</ul>
</li>
<li><strong>ob_gordon_2008_v2_results.tar.gz</strong> (<em>ob_turnbaugh</em>): H: 61, OB: