100+ datasets found
  1. MicrobiomeHD: the human gut microbiome in health and disease

    • zenodo.org
    • search.datacite.org
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Claire Duvallet; Sean Gibbons; Thomas Gurry; Rafael Irizarry; Eric Alm; Claire Duvallet; Sean Gibbons; Thomas Gurry; Rafael Irizarry; Eric Alm (2020). MicrobiomeHD: the human gut microbiome in health and disease [Dataset]. http://doi.org/10.5281/zenodo.569601
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Claire Duvallet; Sean Gibbons; Thomas Gurry; Rafael Irizarry; Eric Alm; Claire Duvallet; Sean Gibbons; Thomas Gurry; Rafael Irizarry; Eric Alm
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Overview

    MicrobiomeHD is a standardized database of human gut microbiome studies in health and disease. This database includes publicly available 16S data from published case-control studies and their associated patient metadata. Raw sequencing data for each study was downloaded and processed through a standardized pipeline.

    To be included in MicrobiomeHD, datasets have:

    • publicly available raw sequencing data (fastq or fasta)
    • publicly available metadata with at least case and control labels for each patient
    • at least 15 case patients

    Currently, MicrobiomeHD is focused on stool samples. Additional samples may be included in certain datasets, as indicated in the metadata.

    Files

    Additional information about the datasets included in this MicrobiomeHD release are in the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, in the file db/dataset_info.yaml. Top-level identifiers correspond to the dataset IDs used in Duvallet et al. 2017. Sample sizes in the yaml file are those that were described in the papers, and may not exactly reflect the actual data (due to missing/extra data, samples which didn't pass quality control, etc).

    Each dataset was downloaded and processed through a standardized pipeline. The raw processing results are available in the *.tar.gz files here. Each file has the same directory structure and files, as described in the pipeline documentation: http://amplicon-sequencing-pipeline.readthedocs.io/en/latest/output.html.

    Specific files of interest include:

    • summary_file.txt: this file contains a summary of all parameters used to process the data
    • datasetID.metadata.txt: the metadata associated with the samples. Note that some samples in the metadata may not have sequencing data, and vice versa.
    • RDP/datasetID.otu_table.100.denovo.rdp_assigned: the 100% OTU tables with Latin taxonomic names assigned using the RDP classifier.
    • datasetID.otu_seqs.100.fasta: representative sequences for each OTU in the 100% OTU table. OTU labels in the OTU table end with d_denovoID - these denovoIDs correspond to the sequences in this file. Processing

    The raw data was acquired as described in the supplementary materials of Duvallet et al.'s "Meta analysis of microbiome studies identifies shared and disease-specific patterns".

    Raw sequencing data was processed with the Alm lab's in-house 16S processing pipeline: https://github.com/thomasgurry/amplicon_sequencing_pipeline

    Pipeline documentation is available at: http://amplicon-sequencing-pipeline.readthedocs.io/

    Metadata was extracted from the original papers and/or data sources, and formatted manually.

    Contributing

    MicrobiomeHD is a resource that can be used to extract disease-specific microbiome signals in individual case-control studies. Many microbes respond non-specifically to health and disease, and the majority of bacterial associations within individual studies overlap with this "core" response. Researchers should cross-check their results with the data presented here to ensure that their identified microbial associations are specific to their disease under study.

    We provide an updated list of "core" microbes here, as well as the raw OTU tables for anyone who wishes to reproduce and adapt this analysis to their study question.

    If you would like to include your case-control dataset in MicrobiomeHD, please email duvallet[at]mit.edu.

    For us to process your data through our standard pipeline, you will need to provide the following files and information about your data:

    • raw sequencing data in fastq or fasta format (preferably fastq)
    • information about which processing steps will be required (e.g. removing primers or barcodes, merging paired-end reads, etc)
    • sample IDs associated with the sequencing data (either mapped to barcodes still in the sequences, or to each de-multiplexed sequencing file)
    • case/control metadata of each sample
    • other relevant metadata (e.g. sampling site, if not all samples are stool; sampling time point, if multiple samples per patient were taken; etc)

    By using MicrobiomeHD in your own analyses, you agree to contribute your dataset to this database and to make your raw sequencing data (i.e. fastq files) publicly available.

    Citing MicrobiomeHD

    The MicrobiomeHD database and original publications for each of these datasets are described in Duvallet et al. (2017): http://biorxiv.org/content/early/2017/05/08/134031

    If you use any of these datasets in your analysis, please cite both MicrobiomeHD (Duvallet et al. (2017)) and the original publication for each dataset that you use.

    The code used to process and analyze this data in Duvallet et al. (2017) is available on github: https://github.com/cduvallet/microbiomeHD

    Files

    Core genera

    file-S3.core_genera.txt: Supplemental Table 3 from Duvallet et al. (2017), listing the core health- and disease-associated microbes.

    Datasets

    Note that MicrobiomeHD contains all 28 datasets from Duvallet et al. (2017), as well as additional datasets which did not meet the inclusion criteria for the meta-analysis presented in the paper. Additional information about the datasets included in this MicrobiomeHD release are in the original publications and the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, in the file db/dataset_info.yaml.

    The sample sizes listed here reflect what was reported in the original publications. Some may have discrepancies between what is reported and what is in the actual data due to missing data, quality issues, barcode mismatches, etc.

  2. S

    Medicinal Plant Microbiome Database

    • scidb.cn
    Updated Mar 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yuqing; Chen Peng (2024). Medicinal Plant Microbiome Database [Dataset]. http://doi.org/10.57760/sciencedb.17282
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Niu Yuqing; Chen Peng
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    There are abundant endophytic bacteria, fungi and actinomycetes in medicinal plants. The microorganisms of medicinal plants are inseparable from the growth, reproduction and metabolic activities of their host plants, which can not only affect the formation and content of medicinal components of plants, but also affect the authenticity of Chinese medicinal materials. Angelicae Sinensis Radix, Astragali Radix, Codonopsis Radix, Glycyrrhizae Radix et Rhizoma and Rhei Radix et Rhizoma are traditional Chinese medicinal materials and important sources of clinical medicine. In recent years, more and more research has been done on the microbiome of this medicinal plant. In order to integrate data resources and results of numerous studies and promote comparative studies, literature review and information extraction analysis were carried out, so as to construct a knowledge base of medicinal plant microbiome to assist the research of medicinal plant quality and authenticity. The database covers medicinal plant microorganisms by name, host plant, plant source in literature, classification, genus, family, order, class, phylum, function/biological role, technique, sequence length, NCBI reference serial number /GenBank, references and corresponding links. This interface supports the query function of the microbiome content of the above medicinal plants. Therefore, the database will help to provide a research basis for the development and utilization of the microbiome of medicinal plants and provide a reference for the creation of new methods for quality control and authenticity evaluation of medicinal plants.In Version 2, an additional 11 pieces of information have been incorporated for Codonopsis Radix to consider.In Version 3, the number of endophytes in the database was updated to 350.In Version 4, in order to distinguish the origin of host plants, category 'plant source in literature' was added. Meanwhile, the names of host plants were unified as the Latin names, and one duplicate data has been removed.In Version 5, we added a processing file for the data in MPMD, in which we counted the frequency of each endophyte and analyzed parameters such as the proportion of high-frequency endophytes occurring in the five traditional medicinal plants.In Version 6, we corrected the errors that appeared in the description of the past few versions and a new version of MPMD was provided.

  3. Fly Microbiome Diet Database

    • figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danielle Lesperance; Nichole Broderick (2023). Fly Microbiome Diet Database [Dataset]. http://doi.org/10.6084/m9.figshare.11920788.v4
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Danielle Lesperance; Nichole Broderick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the working version of the Fly Microbiome Diet Database, a compilation of published Drosophila diets used by laboratories in the field of fly microbiome research. In addition to listing dietary components as they are described in published research methods, we calculate macronutrient content and protein to carbohydrate ratio of each diet for the purpose of making comparisons between dietary nutrient content across studies. The source files for each diet component's nutrition facts are accessible at https://doi.org/10.6084/m9.figshare.11920743.v2.The database is subject to change as new studies are added or nutritional information is updated to be more accurate. In database, N.S. for Yeast or Cornmeal type means "not specified"; N/A means that ingredient was not used.06/22/2021: V5 published, includes fructose nutritional content, 7 additional diets12/07/2021: V6 published, includes additional diets from recent published microbiome work and self-submitted non-microbiome fly work (submitted via DDCC)

  4. n

    HOMD

    • neuinfo.org
    • rrid.site
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). HOMD [Dataset]. http://identifiers.org/RRID:SCR_012770
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE.Documented on April 14,2022. Database of comprehensive information on the approximately 600 prokaryote species that are present in the human oral cavity. The majority of these species are uncultivated and unnamed, recognized primarily by their 16S rRNA sequences. The HOMD presents a provisional naming scheme for the currently unnamed species so that strain, clone, and probe data from any laboratory can be directly linked to a stably named reference entity. The HOMD links sequence data with phenotypic, phylogenetic, clinical, and bibliographic information. Full and partial oral bacterial genome sequences determined as part of this project and the Human Microbiome Project, are being added to the HOMD as they become available. HOMD offers easy to use tools for viewing all publicly available oral bacterial genomes. Data is also downloadable.

  5. Pbac v1 database - Panda Gut Microbiome Database

    • figshare.com
    application/x-gzip
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FEILONG DENG (2025). Pbac v1 database - Panda Gut Microbiome Database [Dataset]. http://doi.org/10.6084/m9.figshare.29599118.v2
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    FEILONG DENG
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pbac (Panda Gut Microbiome Database): A Curated Resource for Giant Panda Gut Microbial Genomes. In this project, we conducted an in-depth analysis of giant panda metagenome-assembled genomes (MAGs), utilizing both Illumina and Nanopore sequencing technologies. Our extensive efforts resulted in the identification of 2,684 medium- to high-quality MAGs meeting specific criteria: completeness ≥ 50%, contamination < 10%, and length ≥ 500 kb. Remarkably, 960 MAGs surpassed the stringent high-quality thresholds of completeness ≥ 90% and contamination < 5%. Within this dataset, we identified 1,193 non-redundant MAGs through a 99% similarity threshold clustering, with 354 of them being of high quality. Taxonomic analysis revealed that 672 MAGs could be mapped to 219 known species, while 521 MAGs clustered into 228 unique groups, leading to the assignment of new genus or species identifiers.

  6. b

    Human Oral Microbiome Database

    • bioregistry.io
    Updated Apr 29, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Human Oral Microbiome Database [Dataset]. https://bioregistry.io/homd.taxon
    Explore at:
    Dataset updated
    Apr 29, 2021
    Description

    The Human Oral Microbiome Database (HOMD) provides a site-specific comprehensive database for the more than 600 prokaryote species that are present in the human oral cavity. It contains genomic information based on a curated 16S rRNA gene-based provisional naming scheme, and taxonomic information. This datatype contains taxonomic information.

  7. Drinking Water Microbiome OTU Abundance Data Set

    • catalog.data.gov
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Drinking Water Microbiome OTU Abundance Data Set [Dataset]. https://catalog.data.gov/dataset/drinking-water-microbiome-otu-abundance-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    An abundance matrix (BM_OTU.xlsx) contains rows as OTU, columns as samples, and entries representing the abundance of each OTU as a ratio of all sequences obtained for each individual sample. This dataset is associated with the following publication: Gomez-Alvarez, V., and R. Revetta. Monitoring of Nitrification in Chloraminated Drinking Water Distribution Systems With Microbiome Bioindicators Using Supervised Machine Learning. Frontiers in Microbiology. Frontiers, Lausanne, SWITZERLAND, 11: 2254-2267, (2020).

  8. R

    Seed Microbiota Database

    • entrepot.recherche.data.gouv.fr
    bin +3
    Updated Aug 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marie Simonin; Marie Simonin; Matthieu Barret; Matthieu Barret (2022). Seed Microbiota Database [Dataset]. http://doi.org/10.15454/2ANNJM
    Explore at:
    txt(2355470), txt(3642316), txt(268531), txt(41729335), txt(190354), txt(74477956), txt(7757247), txt(645696), txt(602459), tsv(6740401), txt(10883343), txt(552246), txt(4321706), txt(1716534), txt(509600), bin(306978), tsv(9098177), txt(1076587), bin(669069), txt(8921883), tsv(225431), txt(3334), bin(425880), bin(281519), txt(2947221), txt(2192343), bin(460377), txt(1119621), txt(522200), txt(9096207), txt(2013519), txt(237318), txt(355474), tsv(1147416), bin(604956), txt(19467731), txt(14460720), txt(3107937), text/x-perl-script(4029), txt(24713665), txt(93692), txt(57375), text/x-perl-script(3523), tsv(2094685), txt(87858), txt(1478400), txt(63759)Available download formats
    Dataset updated
    Aug 22, 2022
    Dataset provided by
    Recherche Data Gouv
    Authors
    Marie Simonin; Marie Simonin; Matthieu Barret; Matthieu Barret
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Description

    This dataset compiles all the data of the Seed Microbiota Database associated to the publication Simonin et al. 2021 (BioRxiv) Seed microbiota revealed by a large-scale meta-analysis including 50 plant species. This database includes metabarcoding data from 63 seed microbiota studies on 50 plant species ( total of 3190 seed samples) based on 5 different molecular markers (16S rRNA gene - V4 region, 16S rRNA gene - V5-V6 region, gyrB gene, ITS1 region, ITS2 region). All the studies were re-processed from the fastq files (raw data) using DADA2 and Qiime2 and merged in 5 different datasets depending on the molecular marker targeted. The README file presents the structure of the database (Subsets) and files available. This database can be queried online without downloading it on the Askomics instance : https://askomics-192-168-100-151.vm.openstack.genouest.org/ For a full access to the results, you can log to the Askomics instance with the following credentials: Username: consult Password: OcOU83D5

  9. m

    Data from: Gut microbiome structure and metabolic activity in inflammatory...

    • metabolomicsworkbench.org
    zip
    Updated Aug 31, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Avila-Pacheco (2018). Gut microbiome structure and metabolic activity in inflammatory bowel disease [Dataset]. https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST001000&StudyType=MS&ResultType=1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 31, 2018
    Dataset provided by
    Broad Institute of MIT and Harvard
    Authors
    Julian Avila-Pacheco
    Description

    The inflammatory bowel diseases (IBD), which include Crohn’s disease (CD) and ulcerative colitis (UC), are multifactorial, chronic conditions of the gastrointestinal tract. While IBD has been associated with dramatic changes in the gut microbiota, changes in the gut metabolome -- the molecular interface between host and microbiota -- are less-well understood. To address this gap, we performed untargeted LC-MS metabolomic and shotgun metagenomic profiling of cross-sectional stool samples from discovery (n=155) and validation (n=65) cohorts of CD, UC, and non-IBD control subjects. Metabolomic and metagenomic profiles were broadly correlated with fecal calprotectin levels (a measure of gut inflammation). Across >8,000 measured metabolite features, we identified chemicals and chemical classes that were differentially abundant (DA) in IBD, including enrichments for sphingolipids and bile acids, and depletions for triacylglycerols and tetrapyrroles. While >50% of DA metabolite features were uncharacterized, many could be assigned putative roles through metabolomic “guilt-by-association” (covariation with known metabolites). DA species and functions from the metagenomic profiles reflected adaptation to oxidative stress in the IBD gut, and were individually consistent with previous findings. Integrating these data, however, we identified 122 robust associations between DA species and well-characterized DA metabolites, indicating possible mechanistic relationships that are perturbed in IBD. Finally, we found that metabolome- and metagenome-based classifiers of IBD status were highly accurate and, like the vast majority of individual trends, generalized well to the independent validation cohort. Our findings thus provide an improved understanding of perturbations of the microbiome-metabolome interface in IBD, including identification of many potential diagnostic and therapeutic targets.

  10. o

    The Human Microbiome Project

    • registry.opendata.aws
    • kaggle.com
    Updated Apr 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The National Institutes of Health Office of Strategic Coordination - The Common Fund (2018). The Human Microbiome Project [Dataset]. https://registry.opendata.aws/human-microbiome-project/
    Explore at:
    Dataset updated
    Apr 20, 2018
    Dataset provided by
    <a href="https://commonfund.nih.gov/hmp">The National Institutes of Health Office of Strategic Coordination - The Common Fund</a>
    Description

    The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performed for thousands of these samples. In addition, whole genome sequences were generated for isolate strains collected from human body sites to act as reference organisms for analysis. Finally, 16S marker and whole metagenome sequencing was also done on additional samples from people suffering from several disease conditions.

  11. n

    Data from: Ribosomal Database Project

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Ribosomal Database Project [Dataset]. http://identifiers.org/RRID:SCR_006633
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    A database which provides ribosome related data services to the scientific community, including online data analysis, rRNA derived phylogenetic trees, and aligned and annotated rRNA sequences. It specifically contains information on quality-controlled, aligned and annotated bacterial and archaean 16S rRNA sequences, fungal 28S rRNA sequences, and a suite of analysis tools for the scientific community. Most of the RDP tools are now available as open source packages for users to incorporate in their local workflow.

  12. S

    Serofluid dish Microbiome Database (SMD)

    • scidb.cn
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen Peng; Liu Yingjie; Zhang Rentao (2025). Serofluid dish Microbiome Database (SMD) [Dataset]. http://doi.org/10.57760/sciencedb.22778
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Chen Peng; Liu Yingjie; Zhang Rentao
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Serofluid dish is one of the most popular probiotic fermented foods in northern China with a long history. It has not only unique flavor, but also rich nutrition, which is beneficial to human health. There are a lot of probiotics resources in Serofluid dish, such as Lactobacillus and Acetobacter are its dominant species. The lactic acid bacteria contained in it can promote gastrointestinal peristalsis, digestion and absorption after entering the human digestive tract; it also can reduce cholesterol and enhance the body's immunity. In Lanzhou, it is not only a local specialty representative of the region, but also a symbol of the local food culture, attracting a large number of domestic food lovers. Based on the systematic study of the microbial community structure of Serofluid dish, the separation and identification results of natural fermentation Serofluid dish samples collected from different geographical locations were summarized, and the database was compiled. The main contents of the database include: species and genera of culturable microorganisms isolated from Serofluid dish, and GenBank accession, corresponding media and other information.

  13. MMGC: custom Kraken2/Bracken database for analysing the mouse gut microbiome...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Feb 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin S. Beresford-Jones; Benjamin S. Beresford-Jones (2021). MMGC: custom Kraken2/Bracken database for analysing the mouse gut microbiome [Dataset]. http://doi.org/10.5281/zenodo.4300643
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 9, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Benjamin S. Beresford-Jones; Benjamin S. Beresford-Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Custom Kraken2/Bracken database built using the representative genomes for 1,021 microbial species from the mouse gut microbiota. Genomes include isolates and MAGs, but all are near-complete (>90% completeness; <5% contamination; maximum genome size ≤ 8 Mb; maximum contig count ≤ 500; N50 ≥ 10 kb; mean contig length ≥ 5 kb). This database achieved a mean read classification rate of 87.7% when benchmarked on 1,785 independent (i.e. non-contributory) mouse gut shotgun metagenome samples. An equivalent human database (UHGG) only attained classification rates of 36.6%.

    This database is a publicly available resource to facilitate more efficient/deeper analyses of mouse gut shotgun metagenomes.

    Find out more about the Mouse Microbial Genome Collection at our GitHub repository.

  14. Microplastics Fish Gut Microbiome Data For EDA/ML

    • kaggle.com
    zip
    Updated Jul 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ISMAILDRISSI25 (2025). Microplastics Fish Gut Microbiome Data For EDA/ML [Dataset]. https://www.kaggle.com/datasets/ismaildrissi25/microplastics-fish-gut-microbiome-data-for-ml
    Explore at:
    zip(252677 bytes)Available download formats
    Dataset updated
    Jul 19, 2025
    Authors
    ISMAILDRISSI25
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset was compiled for a Master's thesis project focused on investigating the gut microbiota response in fish exposed to microplastics. It contains cleaned and annotated metadata along with taxonomic abundance information and exposure features, prepared for predictive machine learning modeling.

    Context Microplastics (MPs) are emerging pollutants in aquatic ecosystems. Numerous studies have shown that MPs can impact the gut microbial composition of fish. This dataset integrates data from multiple studies through a meta-analysis approach, standardized using bioinformatics and machine learning pipelines.

    Source Sequences and metadata were extracted from public BioProject entries in the NCBI SRA database.

    Data processing: QIIME2, Python (pandas, scikit-learn), Google Colab

    Total size: ~648 FASTQ files → summarized into machine learning-ready tabular format

    Applications Microbiome classification modeling

    Environmental ecotoxicology analysis

    Meta-analysis benchmarking

    Feature importance and interpretability (SHAP, feature selection)

  15. n

    Data and code from: Learning a deep language model for microbiomes: The...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quintin Pope; Rohan Varma; Christine Tataru; Maude David; Xiaoli Fern (2025). Data and code from: Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data [Dataset]. http://doi.org/10.5061/dryad.tb2rbp08p
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    University of Michigan
    Oregon State University
    Authors
    Quintin Pope; Rohan Varma; Christine Tataru; Maude David; Xiaoli Fern
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    We use open source human gut microbiome data to learn a microbial “language” model by adapting techniques from Natural Language Processing (NLP). Our microbial “language” model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from independent studies, even in the presence of substantial distribution shifts. Through a variety of analyses, we further show that the pre-trained, context-sensitive embedding captures meaningful biological information, including taxonomic relationships, correlations with biological pathways, and relevance to IBD expression, despite the model never being explicitly exposed to such signals. Methods No additional raw data was collected for this project. All inputs are available publicly. American Gut Project, Halfvarson, and Schirmer raw data are available from the NCBI database (accession numbers PRJEB11419, PRJEB18471, and PRJNA398089, respectively). We used the curated data produced by Tataru and David, 2020.

  16. Data from: Predicting cancer prognosis and drug response from the tumor...

    • data.niaid.nih.gov
    Updated Jun 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hermida, Leandro Cruz; Gertz, E. Michael; Ruppin, Eytan (2022). Predicting cancer prognosis and drug response from the tumor microbiome [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5221525
    Explore at:
    Dataset updated
    Jun 22, 2022
    Dataset provided by
    National Cancer Institutehttp://www.cancer.gov/
    Authors
    Hermida, Leandro Cruz; Gertz, E. Michael; Ruppin, Eytan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Tumor gene expression is predictive of patient prognosis in some cancers. However, RNA-seq and whole genome sequencing data contain not only reads from host tumor and normal tissue, but also reads from the tumor microbiome, which can be used to infer the microbial abundances in each tumor. Here, we show that tumor microbial abundances, alone or in combination with tumor gene expression data, can predict cancer prognosis and drug response to some extent – microbial abundances are significantly less predictive of prognosis than gene expression, although remarkably, similarly as predictive of drug response, but in mostly different cancer-drug combinations. Thus, it appears possible to leverage existing sequencing technology, or develop new protocols, to obtain more non-redundant information about prognosis and drug response from RNA-seq and whole genome sequencing experiments than could be obtained from tumor gene expression or genomic data alone.

  17. n

    MiMeDB

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). MiMeDB [Dataset]. http://identifiers.org/RRID:SCR_025108
    Explore at:
    Dataset updated
    Jun 20, 2024
    Description

    Database containing detailed information about small molecules produced by human microbiome. Provides metabolite data including structure, names, descriptions, chemical taxonomy, chemical ontology, physico-chemical data, spectra and contains detailed information about microbes that produce these chemicals, enzymatic reactions responsible for their production, bioactivity of chemicals and anatomical location of these chemicals and microbes. Many data fields in the database are hyperlinked to other databases including FooDB, HMDB, KEGG, PubChem, MetaCyc, ChEBI, UniProt, and GenBank. Database is FAIR compliant.The data in MiMeDB are released under the Creative Commons (CC) 4.0 License.

  18. The Human Oral Microbiome Database (December 2020)

    • figshare.com
    xlsx
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    César Rivera (2023). The Human Oral Microbiome Database (December 2020) [Dataset]. http://doi.org/10.6084/m9.figshare.16606310.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    César Rivera
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database used in Sci Rep. 2021 Aug 2;11(1):15646. doi: 10.1038/s41598-021-95228-8.Bacteria list processed to contain only bacteria from the oral cavity.The original files (without processing) can be downloaded from the Human Oral Microbiome Database: HOMD (http://www.homd.org/)

  19. Human Microbiome Compendium dataset

    • zenodo.org
    application/gzip, bin +3
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard J. Abdill; Richard J. Abdill; Samantha P. Graham; Samantha P. Graham; Vincent Rubinetti; Vincent Rubinetti; Ahmadian Mansooreh; Ahmadian Mansooreh; Parker Hicks; Parker Hicks; Ashwin Chetty; Ashwin Chetty; Daniel McDonald; Daniel McDonald; Pamela Ferretti; Pamela Ferretti; Elizabeth Gibbons; Elizabeth Gibbons; Marco Rossi; Marco Rossi; Arjun Krishnan; Arjun Krishnan; Frank W. Albert; Frank W. Albert; Casey S. Greene; Casey S. Greene; Sean Davis; Sean Davis; Ran Blekhman; Ran Blekhman (2024). Human Microbiome Compendium dataset [Dataset]. http://doi.org/10.5281/zenodo.13733642
    Explore at:
    application/gzip, bin, txt, csv, tsvAvailable download formats
    Dataset updated
    Sep 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Richard J. Abdill; Richard J. Abdill; Samantha P. Graham; Samantha P. Graham; Vincent Rubinetti; Vincent Rubinetti; Ahmadian Mansooreh; Ahmadian Mansooreh; Parker Hicks; Parker Hicks; Ashwin Chetty; Ashwin Chetty; Daniel McDonald; Daniel McDonald; Pamela Ferretti; Pamela Ferretti; Elizabeth Gibbons; Elizabeth Gibbons; Marco Rossi; Marco Rossi; Arjun Krishnan; Arjun Krishnan; Frank W. Albert; Frank W. Albert; Casey S. Greene; Casey S. Greene; Sean Davis; Sean Davis; Ran Blekhman; Ran Blekhman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Human Microbiome Compendium is an ongoing project to build a large collection of human microbiome sequencing data processed with a uniform pipeline. Currently, the compendium contains 16S rRNA amplicon sequencing data for human gut microbiome samples retrieved from the Sequence Read Archive. Our website at microbiomap.org has more information about the project and links to related resources.

    This data is freely available under the Creative Commons Attribution 4.0 International license (CC BY 4.0). If you use it in your work, please cite our preprint:

    Abdill, Richard J., Samantha P. Graham, Vincent Rubinetti, Frank W. Albert, Casey S. Greene, Sean Davis, and Ran Blekhman. “Integration of 168,000 Samples Reveals Global Patterns of the Human Gut Microbiome.” bioRxiv, October 11, 2023. https://doi.org/10.1101/2023.10.11.560955.

    If you are using this dataset in combination with your own results, it's important to note that the taxonomic classifications may differ between releases, as documented in CHANGELOG.md. The most recent release (1.1.0) includes assignments made using SILVA 138.2 (SSU Ref NR 99) and Greengenes2 (2022.10 backbone).

  20. Drinking Water Microbiome Sequence Data Set

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Drinking Water Microbiome Sequence Data Set [Dataset]. https://catalog.data.gov/dataset/drinking-water-microbiome-sequence-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The fasta file (BM_OTU.fasta) contain the sequences of the bacterial 16S rRNA-encoding V4 region gene (≈250 nt) for each Operational Taxonomic Unit (OTU). This dataset is associated with the following publication: Gomez-Alvarez, V., and R. Revetta. Monitoring of Nitrification in Chloraminated Drinking Water Distribution Systems With Microbiome Bioindicators Using Supervised Machine Learning. Frontiers in Microbiology. Frontiers, Lausanne, SWITZERLAND, 11: 2254-2267, (2020).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Claire Duvallet; Sean Gibbons; Thomas Gurry; Rafael Irizarry; Eric Alm; Claire Duvallet; Sean Gibbons; Thomas Gurry; Rafael Irizarry; Eric Alm (2020). MicrobiomeHD: the human gut microbiome in health and disease [Dataset]. http://doi.org/10.5281/zenodo.569601
Organization logo

MicrobiomeHD: the human gut microbiome in health and disease

Explore at:
9 scholarly articles cite this dataset (View in Google Scholar)
application/gzipAvailable download formats
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Claire Duvallet; Sean Gibbons; Thomas Gurry; Rafael Irizarry; Eric Alm; Claire Duvallet; Sean Gibbons; Thomas Gurry; Rafael Irizarry; Eric Alm
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Overview

MicrobiomeHD is a standardized database of human gut microbiome studies in health and disease. This database includes publicly available 16S data from published case-control studies and their associated patient metadata. Raw sequencing data for each study was downloaded and processed through a standardized pipeline.

To be included in MicrobiomeHD, datasets have:

  • publicly available raw sequencing data (fastq or fasta)
  • publicly available metadata with at least case and control labels for each patient
  • at least 15 case patients

Currently, MicrobiomeHD is focused on stool samples. Additional samples may be included in certain datasets, as indicated in the metadata.

Files

Additional information about the datasets included in this MicrobiomeHD release are in the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, in the file db/dataset_info.yaml. Top-level identifiers correspond to the dataset IDs used in Duvallet et al. 2017. Sample sizes in the yaml file are those that were described in the papers, and may not exactly reflect the actual data (due to missing/extra data, samples which didn't pass quality control, etc).

Each dataset was downloaded and processed through a standardized pipeline. The raw processing results are available in the *.tar.gz files here. Each file has the same directory structure and files, as described in the pipeline documentation: http://amplicon-sequencing-pipeline.readthedocs.io/en/latest/output.html.

Specific files of interest include:

  • summary_file.txt: this file contains a summary of all parameters used to process the data
  • datasetID.metadata.txt: the metadata associated with the samples. Note that some samples in the metadata may not have sequencing data, and vice versa.
  • RDP/datasetID.otu_table.100.denovo.rdp_assigned: the 100% OTU tables with Latin taxonomic names assigned using the RDP classifier.
  • datasetID.otu_seqs.100.fasta: representative sequences for each OTU in the 100% OTU table. OTU labels in the OTU table end with d_denovoID - these denovoIDs correspond to the sequences in this file. Processing

The raw data was acquired as described in the supplementary materials of Duvallet et al.'s "Meta analysis of microbiome studies identifies shared and disease-specific patterns".

Raw sequencing data was processed with the Alm lab's in-house 16S processing pipeline: https://github.com/thomasgurry/amplicon_sequencing_pipeline

Pipeline documentation is available at: http://amplicon-sequencing-pipeline.readthedocs.io/

Metadata was extracted from the original papers and/or data sources, and formatted manually.

Contributing

MicrobiomeHD is a resource that can be used to extract disease-specific microbiome signals in individual case-control studies. Many microbes respond non-specifically to health and disease, and the majority of bacterial associations within individual studies overlap with this "core" response. Researchers should cross-check their results with the data presented here to ensure that their identified microbial associations are specific to their disease under study.

We provide an updated list of "core" microbes here, as well as the raw OTU tables for anyone who wishes to reproduce and adapt this analysis to their study question.

If you would like to include your case-control dataset in MicrobiomeHD, please email duvallet[at]mit.edu.

For us to process your data through our standard pipeline, you will need to provide the following files and information about your data:

  • raw sequencing data in fastq or fasta format (preferably fastq)
  • information about which processing steps will be required (e.g. removing primers or barcodes, merging paired-end reads, etc)
  • sample IDs associated with the sequencing data (either mapped to barcodes still in the sequences, or to each de-multiplexed sequencing file)
  • case/control metadata of each sample
  • other relevant metadata (e.g. sampling site, if not all samples are stool; sampling time point, if multiple samples per patient were taken; etc)

By using MicrobiomeHD in your own analyses, you agree to contribute your dataset to this database and to make your raw sequencing data (i.e. fastq files) publicly available.

Citing MicrobiomeHD

The MicrobiomeHD database and original publications for each of these datasets are described in Duvallet et al. (2017): http://biorxiv.org/content/early/2017/05/08/134031

If you use any of these datasets in your analysis, please cite both MicrobiomeHD (Duvallet et al. (2017)) and the original publication for each dataset that you use.

The code used to process and analyze this data in Duvallet et al. (2017) is available on github: https://github.com/cduvallet/microbiomeHD

Files

Core genera

file-S3.core_genera.txt: Supplemental Table 3 from Duvallet et al. (2017), listing the core health- and disease-associated microbes.

Datasets

Note that MicrobiomeHD contains all 28 datasets from Duvallet et al. (2017), as well as additional datasets which did not meet the inclusion criteria for the meta-analysis presented in the paper. Additional information about the datasets included in this MicrobiomeHD release are in the original publications and the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, in the file db/dataset_info.yaml.

The sample sizes listed here reflect what was reported in the original publications. Some may have discrepancies between what is reported and what is in the actual data due to missing data, quality issues, barcode mismatches, etc.

Search
Clear search
Close search
Google apps
Main menu