100+ datasets found
  1. f

    Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning

    • figshare.com
    bin
    Updated May 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rikuto Kotoge (2025). MLOmics: Cancer Multi-Omics Database for Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.28729127.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    May 25, 2025
    Dataset provided by
    figshare
    Authors
    Rikuto Kotoge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.

  2. f

    Additional file 2 of Single Cell Atlas: a single-cell multi-omics human cell...

    • springernature.figshare.com
    xlsx
    Updated Aug 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Pan; Paolo Parini; Roman Tremmel; Joseph Loscalzo; Volker M. Lauschke; Bradley A. Maron; Paola Paci; Ingemar Ernberg; Nguan Soon Tan; Zehuan Liao; Weiyao Yin; Sundararaman Rengarajan; Xuexin Li (2024). Additional file 2 of Single Cell Atlas: a single-cell multi-omics human cell encyclopedia [Dataset]. http://doi.org/10.6084/m9.figshare.25656133.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Aug 18, 2024
    Dataset provided by
    figshare
    Authors
    Lu Pan; Paolo Parini; Roman Tremmel; Joseph Loscalzo; Volker M. Lauschke; Bradley A. Maron; Paola Paci; Ingemar Ernberg; Nguan Soon Tan; Zehuan Liao; Weiyao Yin; Sundararaman Rengarajan; Xuexin Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2:Table S1. Cell counts of the adult and fetal tissue groups at each omics level. Table S2. Filtered matrix raw read counts for scRNA-Seq across tissues in both fetal and adult groups. Cell_Count_Filtered_Matrix column represents raw read counts initially obtained from published studies or after filtering for the removal of background noises. Table S3. Statistics of the upregulated genes from adult and fetal tissues, filtered by average Log2FoldChange > 0.25 and adjusted P of 0.05. Clusters represent cell types. Genes were ranked by average log2-fold-change. Table S4. Top receptor–ligand interaction profiles of the cell types in the 38 matching adult and fetal tissues. Interaction analysis was done separately for each tissue, and information on the interaction pairs can be viewed from the first column. Table S5: Top clonotypes (VDJ gene combinations) of each cell type present in the T and B cell repertoires. Table S6. Top TFs in the pseudotime transitions of adult and fetal colon cell types. Table S7. Top receptor-ligand pairs in spatial transcriptomics of adult colons (colon 1 and colon 2) as well as in scRNA-seq adult and fetal colons. The first column represents the data type to which the interactions belong. Table ranked by decreasing interaction ratios. Table S8. Comparison of SCA with other single-cell omics databases. Green tick indicates a yes and a red cross indicates a no. Table S9. List of public resources included in the SCA database portal. SCA_PID refers to SCA-designated project identity number (PID).

  3. f

    Additional file 3: Table S3. of ODG: Omics database generator - a tool for...

    • figshare.com
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Guhlin; Kevin Silverstein; Peng Zhou; Peter Tiffin; Nevin Young (2023). Additional file 3: Table S3. of ODG: Omics database generator - a tool for generating, querying, and analyzing multi-omics comparative databases to facilitate biological understanding [Dataset]. http://doi.org/10.6084/m9.figshare.c.3850801_D3.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Joseph Guhlin; Kevin Silverstein; Peng Zhou; Peter Tiffin; Nevin Young
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gene annotations of top scoring BLAST+ hits for the predicted genes in the four rhizobia strains, as inferred from E. coli MG1655 and E. meliloti 1021. (XLSX 383Â kb)

  4. Multi-omics for Understanding Climate Change (MUCC) database v2.0.0

    • zenodo.org
    bin, txt, zip
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily Bechtold; Kelly Wrighton; Kelly Wrighton; Mike Wilkins; Mike Wilkins; Emily Bechtold (2024). Multi-omics for Understanding Climate Change (MUCC) database v2.0.0 [Dataset]. http://doi.org/10.5281/zenodo.14532347
    Explore at:
    txt, bin, zipAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Emily Bechtold; Kelly Wrighton; Kelly Wrighton; Mike Wilkins; Mike Wilkins; Emily Bechtold
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the Multi-omics for Understanding Climate Change (MUCC database) version 2.0.0. This current version is based on amplicon and metagenomic sequencing of Old Woman Creek (OWC), Prairie Pothole Region(PPR7 and PPR8), Jean Lafitte National Historical Park and Preserve (JLA), AmeriFlux site US-LA2 (LA2), Stordalen Mire (STM-fen and STM-bog), AmeriFlux site-ID US-Twt (TWI), and Peatland Responses Under Changing Environments (SPRUCE) and wetland soils. Additionally, this includes metatranscriptome sequencing from OWC. In the future, this will be expanded to include more data from these sites and from additional wetlands.

    OWC, PPR, JLA and LA2 data are deposited in NCBI Bioproject PRJNA1007388

    Stordalen Mire MAGs are deposited in BioProject PRJNA386538

    AmeriFlux site-ID US-Twt are deposited in SRA SRP003022, SRA SRP010671, SRP010730, SRP010738, SRP010741, SRP010747, SRP010748, SRP010751, SRP010862, SRP010870, and SRP011309.

    SPRUCE data are deposited in PRJNA638786 and PRJNA638601

    Files and datasets included here:

    1. 16S.zip 16S amplicon sequencing data and site metadata for 1,112 samples (fastq files)
    2. MQ_HQ_MAGs.zip Database of 4745 Medium and High Quality MAGs (fasta files)
    3. MUCC_v2.0.0_HQMQ_genes.faa.zip MAG amino acid gene sequences derived from DRAM gene calls (fasta file)
    4. MUCC_v2.0.0_HQMQ_annotations.tsv MAG DRAM ANNOTATIONS
    5. owc_metat_table_methanoregula_genes.csv Metatranscriptomic expression per genes in Methanoregula across 133 metatranscriptomes (csv table)
    6. gtdbtk.ar53.decorated.tree newick file for GTDB de novo work flow Methanoregula MAG tree
    7. Newick_gene_trees.zip Trees used in blast identification of methylotrophic gene homologs to curate MR for methylotrophy
    8. fasta_reference_genes.zip FASTA reference files of genes used as BLAST query to mine Methanoregula MAGs for genes involved in detoxification of reactive oxygen species (ROS) and methanogenic metabolism of methylated compounds
    9. protpipeliner.py Python script is a modification of protpipeliner.rb for building RAXML trees
    10. classification_w_outgroup.txt Taxonomy and corresponding MAG ID for Methanregula used in the tree (Figure 5B)
    11. Methanoregula_metabolism_summary.xlsx The DRAM annotations of the Methanoregula MAGs from MUCC, GTDB, and JGI
    12. Methanoregula_physiology.txt Curation of Methanoregula MAGS for physliogical functions of interest
    13. Methanoregula_MAGs_list.txt Comprehensive list of all Methanoregula MAGs used and what database they were sourced from
    14. Methanoregula_MAGs_DB.zip Database of 108 Methanregula MAGs
  5. b

    L1000 Database

    • bigomics.ch
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NIH LINCS Program (2024). L1000 Database [Dataset]. https://bigomics.ch/blog/top-databases-for-drug-discovery/
    Explore at:
    Dataset updated
    Nov 8, 2024
    Dataset authored and provided by
    NIH LINCS Program
    Description

    A large-scale gene expression database capturing cellular responses to thousands of perturbations.

  6. m

    Dataset 1 - Protein Libraries Of Seven Databases From Cnidaria Omics Data...

    • data.mendeley.com
    Updated Dec 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandre Barroso (2024). Dataset 1 - Protein Libraries Of Seven Databases From Cnidaria Omics Data After Duplicates Removal [Dataset]. http://doi.org/10.17632/grwy638mtr.1
    Explore at:
    Dataset updated
    Dec 6, 2024
    Authors
    Alexandre Barroso
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Non duplicated protein libraries from seven databases of Cnidaria: Db1 – 6 proteomes derived from sequenced genomes of Anthozoa Db2 – 2 proteomes derived from sequenced genomes of Medusozoa Db3 – 46 whole body/non-specific transcriptomes of Anthozoa Db4 – 24 whole body/non specific transcriptomes of Medusozoa Db5 – 25 transcriptomes specific to the tentacles of Anthozoa Db6 – 7 transcriptomes specific to the tentacles of Medusozoa Db7 – 2 transcriptomes specific to the nematocysts of Anthozoa

  7. MangroveDB: A comprehensive online database for mangroves based on...

    • zenodo.org
    • figshare.com
    zip
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chaoqun Xu; Chaoqun Xu (2024). MangroveDB: A comprehensive online database for mangroves based on multi-omics data [Dataset]. http://doi.org/10.5281/zenodo.13907062
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Chaoqun Xu; Chaoqun Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 8, 2024
    Description

    Mangroves are dominant flora of intertidal zones along tropical and subtropical coastline around the world that offer important ecological and economic value. Recently, the genomes of mangroves have been decoded, and massive omics data were generated and deposited in the public databases. Reanalysis of multi-omics data can provide new biological insights excluded in the original studies. However, the requirements for computational resource and lack of bioinformatics skill for experimental researchers limit the effective use of the original data. To fill this gap, we uniformly processed 942 transcriptome data, 386 whole-genome sequencing data, and provided 13 reference genomes and 40 reference transcriptomes for 53 mangroves. Finally, we built an interactive web-based database platform MangroveDB (https://github.com/Jasonxu0109/MangroveDB), which was designed to provide comprehensive gene expression datasets to facilitate their exploration and equipped with several online analysis tools, including principal components analysis, differential gene expression analysis, tissue-specific gene expression analysis, GO and KEGG enrichment analysis. MangroveDB not only provides query functions about genes annotation, but also supports some useful visualization functions for analysis results, such as volcano plot, heatmap, dotplot, PCA plot, bubble plot, population structure etc. In conclusion, MangroveDB is a valuable resource for the mangroves research community to efficiently use the massive public omics datasets.

  8. b

    Cancer Therapeutics Response Portal (CTRP v2)

    • bigomics.ch
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Broad Institute of MIT and Harvard (2024). Cancer Therapeutics Response Portal (CTRP v2) [Dataset]. https://bigomics.ch/blog/top-databases-for-drug-discovery/
    Explore at:
    Dataset updated
    Nov 8, 2024
    Dataset authored and provided by
    Broad Institute of MIT and Harvard
    Description

    A dataset linking genetic and molecular features of cancer cell lines to drug sensitivity.

  9. Z

    EukZoo, an aquatic protistan protein database for meta-omics studies.

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu, Zhenfeng (2020). EukZoo, an aquatic protistan protein database for meta-omics studies. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1476235
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Hu, Sarah
    Caron, David
    Liu, Zhenfeng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database contain protein sequences of aquatic microbial eukaryotes, or protists. The purpose of this is to make a database that is of reasonable quality to serve as resource for both taxonomy and functional interpretation of metagenomic and metatranscriptomic studies of protists. The source of the sequences were mainly from Marine Microbial Eukaryotes Transcriptome Sequencing Project (MMETSP), and supplemented with various genomes and transcriptomes of organisms that were not a part of MMETSP.

    To use this database, one has to understand the main function of the three files here.

    (1) The protein sequences are stored in .faa file. You can build an alignment/search database out of that and search your meta-omics sequences against it. Each sequence in the FASTA file has an ID which always consists of two parts like this: "MMETSP0004_1234567". The text before the first underscore is the source ID of that sequence.

    (2) Taxonomy information of each source ID are stored in "EukZoo_taxonomy_table_v_0.2.tsv". One can use the information within in conjunction with database search results to assign taxonomy to sequences.

    (3) KEGG annotation of each sequence are stored in "EukZoo_KEGG_annotation_v_0.2.tsv". One can use the information within in conjunction with database search results to assign KEGG functional annotation (KO ID) to sequences.

    I also provide scripts to assign taxonomy and KEGG annotation from database search results. You can also find the scripts and explanations on how to use them on the EukZoo GitHub page. You will find details on how the database was created and curated on there as well.

    Please contact me at zhenfeng.liu1@gmail.com if you have any questions or requests. Thank you for your interest in EukZoo.

  10. NCI-60 Cancer Cell Lines

    • bigomics.ch
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Cancer Institute (NCI) (2024). NCI-60 Cancer Cell Lines [Dataset]. https://bigomics.ch/blog/top-databases-for-drug-discovery/
    Explore at:
    Dataset updated
    Nov 8, 2024
    Dataset provided by
    National Cancer Institutehttp://www.cancer.gov/
    Authors
    National Cancer Institute (NCI)
    Description

    A panel of 60 human cancer cell lines used for screening anticancer drugs.

  11. f

    Additional file 2: Table S2. of ODG: Omics database generator - a tool for...

    • springernature.figshare.com
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Guhlin; Kevin Silverstein; Peng Zhou; Peter Tiffin; Nevin Young (2023). Additional file 2: Table S2. of ODG: Omics database generator - a tool for generating, querying, and analyzing multi-omics comparative databases to facilitate biological understanding [Dataset]. http://doi.org/10.6084/m9.figshare.c.3850801_D2.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    figshare
    Authors
    Joseph Guhlin; Kevin Silverstein; Peng Zhou; Peter Tiffin; Nevin Young
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PFam Domains and biological process GO categories for the four rhizobia strains. Predicted proteins related to multiple GO biological process categories are joined together with the pipe character. (XLSX 639Â kb)

  12. H

    Identification of novel biomarkers for thyroid cancer using multi omics data...

    • dataverse.harvard.edu
    Updated Jun 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cheena Dhingra (2022). Identification of novel biomarkers for thyroid cancer using multi omics data analysis [Dataset]. http://doi.org/10.7910/DVN/K4F6DM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Cheena Dhingra
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The biomarkers for thyroid cancer are still not known properly. For treating thyroid cancer these biomarkers can by be targeted specifically. Through this project, we identified and used bioinformatics tools to find biomarkers associated with thyroid cancer. Gene Expression Omnibus database (GEO) was used to find dataset related with thyroid cancer. Their expression profiles were downloaded. Four dataset GSE3467, GSE3678, GSE33630, and GSE53157 were identified from GEO database. The dataset GSE3467 contains nine thyroid tumor samples and nine normal thyroid tissue samples. The GSE3678 contains seven thyroid tumor samples and seven normal thyroid tissue samples. The GSE53157 contains twenty four thyroid tumor samples and three normal thyroid samples. The GSE33630 contains sixty thyroid tumor samples and forty five normal thyroid samples. These four datasets were analyzed individually and were integrated at the end to find the common genes among these four datasets. The microarray analysis of the datasets were performed using excel. T.Test analysis were performed for all the four datasets individually on a separate excel sheet. The data was normalized by converting normal value into log scale. Differential expression analysis of all the four datasets were done to identify differentially expresses genes (DEGs). Only upregulated genes were taken into account. Principal component analysis (PCA) of all the four dataset were performed using the raw data. The PCA analysis were performed using T-BioInfo server and the scatterplots were prepared using excel. RStudio was used to match the gene symbols with the corresponding probe ids using left join function. Inner join function in R was used to find integrated genes between the four datasets. Heatmaps of all the four datasets were performed using RStudio. To find number of intersection of Differentially expressed genes, an upset plot was prepared using RStudio. 74 genes with their corresponding probe ids were found to be common among all the four datasets. These genes are common to at least two datasets. These 74 common genes were analyzed using Database for Annotation, Visualization, and Integrated Discovery (DAVID), to study their Gene onotology (GO) functional annotations and pathways. According to the GO functional annotations result, most of the integrated upregulated genes were involved in protein binding, plasma membrane and integral component of membrane. Most common pathway include Extracellular matrix organization, Neutrophil degranulation, TGF-beta signaling pathway and Epithelial to mesenchymal transition in colorectal cancer. These 74 genes were introduced to STRING database to find protein-protein interactions between the genes. Interactions between the nodes were downloaded from STRING database and introduced to Sytoscape. Sytoscape analysis explained that only 19 genes showed protein-protein interactions between each other. Disease free survival analysis of the 13 genes that were common to three datasets were done using GEPIA. Boxplots of these 13 genes were also prepared using GEPIA. This showed that these differentially expressed genes showed different expression in normal thyroid tissue and thyroid tumor samples. Hence these 13 genes common to 3 datasets can be used as potential biomarkers for thyroid cancer. Among these 13 genes, four genes are implicated in cancer/cell proliferation can be probable target for treatment options.

  13. f

    DataSheet_1_Combining Partially Overlapping Multi-Omics Data in Databases...

    • frontiersin.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deniz Akdemir; Ron Knox; Julio Isidro y Sánchez (2023). DataSheet_1_Combining Partially Overlapping Multi-Omics Data in Databases Using Relationship Matrices.pdf [Dataset]. http://doi.org/10.3389/fpls.2020.00947.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Deniz Akdemir; Ron Knox; Julio Isidro y Sánchez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Private and public breeding programs, as well as companies and universities, have developed different genomics technologies that have resulted in the generation of unprecedented amounts of sequence data, which bring new challenges in terms of data management, query, and analysis. The magnitude and complexity of these datasets bring new challenges but also an opportunity to use the data available as a whole. Detailed phenotype data, combined with increasing amounts of genomic data, have an enormous potential to accelerate the identification of key traits to improve our understanding of quantitative genetics. Data harmonization enables cross-national and international comparative research, facilitating the extraction of new scientific knowledge. In this paper, we address the complex issue of combining high dimensional and unbalanced omics data. More specifically, we propose a covariance-based method for combining partial datasets in the genotype to phenotype spectrum. This method can be used to combine partially overlapping relationship/covariance matrices. Here, we show with applications that our approach might be advantageous to feature imputation based approaches; we demonstrate how this method can be used in genomic prediction using heterogeneous marker data and also how to combine the data from multiple phenotypic experiments to make inferences about previously unobserved trait relationships. Our results demonstrate that it is possible to harmonize datasets to improve available information across gene-banks, data repositories, or other data resources.

  14. b

    CLUE - CMap and LINCS Unified Environment

    • bigomics.ch
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Broad Institute (2024). CLUE - CMap and LINCS Unified Environment [Dataset]. https://bigomics.ch/blog/top-databases-for-drug-discovery/
    Explore at:
    Dataset updated
    Nov 8, 2024
    Dataset authored and provided by
    Broad Institute
    Description

    A platform integrating Connectivity Map (CMap) and LINCS data for drug discovery.

  15. D

    MEANtools: multi-omics integration towards metabolite anticipation and...

    • dataverse.nl
    bin, csv
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kumar Saurabh Singh; Kumar Saurabh Singh (2025). MEANtools: multi-omics integration towards metabolite anticipation and biosynthetic pathway prediction [Dataset]. http://doi.org/10.34894/2MVBGK
    Explore at:
    csv(239905790), bin(260972544), csv(809150)Available download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    DataverseNL
    Authors
    Kumar Saurabh Singh; Kumar Saurabh Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 6, 2025 - Jan 6, 2030
    Dataset funded by
    NWO
    Description

    During evolution, plants have developed the ability to produce a vast array of specialized metabolites, which play crucial roles in helping plants adapt to different environmental niches. However, their biosynthetic pathways remain largely elusive. In the past decades, increasing numbers of plant biosynthetic pathways have been elucidated based on approaches utilizing genomics, transcriptomics, and metabolomics. These efforts, however, are limited by the fact that they typically adopt a target-based approach, requiring prior knowledge. Here, we present MEANtools, a systematic and unsupervised computational integrative omics workflow to predict candidate metabolic pathways de novo by leveraging knowledge of general reaction rules and metabolic structures stored in public databases. In our approach, possible connections between metabolites and transcripts that show correlated abundance across samples are identified using reaction rules linked to the transcript-encoded enzyme families. MEANtools thus assesses whether these reactions can connect transcript-correlated mass features within a candidate metabolic pathway. We validate MEANtools using a paired transcriptomic-metabolomic dataset recently generated to reconstruct the falcarindiol biosynthetic pathway in tomato. MEANtools correctly anticipated five out of seven steps of the characterized pathway and also identified other candidate pathways involved in specialized metabolism, which demonstrates its potential for hypothesis generation. Altogether, MEANtools represents a significant advancement to integrate multi-omics data for the elucidation of biochemical pathways in plants and beyond.

  16. Z

    CLImate for Maize OMICS: CLIM4OMICS Analytics and Database

    • data.niaid.nih.gov
    Updated Jun 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Munoz-Arriola, Francisco (2023). CLImate for Maize OMICS: CLIM4OMICS Analytics and Database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7490245
    Explore at:
    Dataset updated
    Jun 25, 2023
    Dataset provided by
    Sarzaeim, Parisa
    Aslam, Hasnat
    Munoz-Arriola, Francisco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CLIM4OMICS Analytics and Database is Improved database of G2F data repository that contains OMICs (genetic and phenotypic) and environmental data for maize yield predictability across 84 experimental fields in the U.S. and province of ON in Canada between 2014-2021. The goal of this pipeline is to aggregate, improve, and synthesize multi-dimensional G2F data including Geno-type, Phenotype and Environmental data for GxE modeling. This dataset contains 79,122 phenotype measurements, 378 genotypes of maize lines, environmental data of 178 locations and Python Scripts for Quality control (QC), Consistency control (CC) steps and ML models for GxE interactions. The Environmental data is extracted from NWS, DayMet and NSRDB databases and processed for QC and CC. The environmental dataset contains the minimum temperature (Tmin), average temperature (Tmean), maximum temperature (Tmax), minimum dew point (DPmin), average dew point (DPmean), maximum dew point (DPmax), minimum relative humidity (RHmin), average relative humidity (RHmean), maximum relative humidity (RHmax), minimum solar radiation (SRmin), average solar radiation (SRmean), maximum solar radiation (SRmax), accumulative rainfall (Racc), average wind speed (WSmean), and average wind direction (WDmean). This package also contains the raw G2F data and preprocessing pipeline.

  17. o

    Multi-omics for Understanding Climate Change (MUCC) database v1.0.0

    • explore.openaire.eu
    Updated Jul 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angela A. Oliverio; Mikayla A. Borton; Adrienne Narrowe; Kelly C. Wrighton (2023). Multi-omics for Understanding Climate Change (MUCC) database v1.0.0 [Dataset]. http://doi.org/10.5281/zenodo.8194033
    Explore at:
    Dataset updated
    Jul 28, 2023
    Authors
    Angela A. Oliverio; Mikayla A. Borton; Adrienne Narrowe; Kelly C. Wrighton
    Description

    This is the Multi-omics for Understanding Climate Change (MUCC database) version 1.0.0. This current version is based on metagenomic and metatranscriptomic sequencing of Old Woman Creek wetland soils, but will be expanded in the future to include data from additional wetlands. Files and datasets included here: MAGs.zip Dereplicated database of 2,502 MAGs (fasta files) OWC_HQMQ_DB_genes.faa.gz MAG amino acid gene sequences derived from DRAM gene calls (fasta file) OWC_HQMQ_DB_ANNOTATIONS_20220208.txt.gz MAG DRAM annotations owc_metat_table_mags.csv Metatranscriptomic expression per MAG across 133 metatranscriptomes (csv table) owc_metat_table_mags_genes.csv Metatranscriptomic expression per gene across 133 metatranscriptomes (csv table) owc_metat_table_mags_genes_annotations.csv corresponding DRAM annotations to #5 for transcribed genes (csv table)

  18. b

    Genomics of Drug Sensitivity in Cancer (GDSC)

    • bigomics.ch
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wellcome Sanger Institute (2024). Genomics of Drug Sensitivity in Cancer (GDSC) [Dataset]. https://bigomics.ch/blog/top-databases-for-drug-discovery/
    Explore at:
    Dataset updated
    Nov 8, 2024
    Dataset authored and provided by
    Wellcome Sanger Institute
    Description

    A dataset containing drug response profiles for over 600 compounds across multiple cancer cell lines.

  19. a

    OMICS metadata

    • enm-dev.adma.ai
    • enm-legacy.adma.ai
    • +1more
    Updated Oct 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OMICS metadata (2023). OMICS metadata [Dataset]. https://enm-dev.adma.ai/about/enanomapper/
    Explore at:
    Dataset updated
    Oct 18, 2023
    Dataset authored and provided by
    OMICS metadata
    License

    https://enanomapper.adma.ai/about/omicshttps://enanomapper.adma.ai/about/omics

    Description

    OMICS metadata : Nanosafety-relevant omics data - a database covering metadata for transcriptomics, proteomics and microRNA expression data relevant to safety assessment analyses of nanomaterials

  20. Multi-omics for Understanding Climate Change (MUCC) database v2.0.0

    • zenodo.org
    bin, csv, tsv, zip
    Updated May 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily Bechtold; Kelly Wrighton; Kelly Wrighton; Mike Wilkins; Mike Wilkins; Emily Bechtold (2024). Multi-omics for Understanding Climate Change (MUCC) database v2.0.0 [Dataset]. http://doi.org/10.5281/zenodo.10822869
    Explore at:
    bin, zip, tsv, csvAvailable download formats
    Dataset updated
    May 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Emily Bechtold; Kelly Wrighton; Kelly Wrighton; Mike Wilkins; Mike Wilkins; Emily Bechtold
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the Multi-omics for Understanding Climate Change (MUCC database) version 2.0.0. This current version is based on amplicon and metagenomic sequencing of Old Woman Creek (OWC), Prairie Pothole Region(PPR7 and PPR8), Jean Lafitte National Historical Park and Preserve (JLA), AmeriFlux site US-LA2 (LA2), Stordalen Mire (STM-fen and STM-bog), AmeriFlux site-ID US-Twt (TWI), and Peatland Responses Under Changing Environments (SPRUCE) and wetland soils. Additionally, this includes metatranscriptome sequencing from OWC. In the future, this will be expanded to include more data from these sites and from additional wetlands.

    OWC, PPR, JLA and LA2 data are deposited in NCBI Bioproject PRJNA1007388

    Stordalen Mire MAGs are deposited in BioProject PRJNA386538

    AmeriFlux site-ID US-Twt are deposited in SRA SRP003022, SRA SRP010671, SRP010730, SRP010738, SRP010741, SRP010747, SRP010748, SRP010751, SRP010862, SRP010870, and SRP011309.

    SPRUCE data are deposited in PRJNA638786 and PRJNA638601

    Files and datasets included here:

    1. 16S.zip 16S amplicon sequencing data and site metadata for 1,112 samples (fastq files)
    2. MQ_HQ_MAGs.zip Database of 4745 Medium and High Quality MAGs (fast files)
    3. MUCC_v2.0.0_HQMQ_genes.faa.zip MAG amino acid gene sequences derived from DRAM gene calls (fasta file)
    4. MUCC_v2.0.0_HQMQ_annotations.tsv MAG DRAM ANNOTATIONS
    5. owc_metat_table_methanoregula_genes.csv Metatranscriptomic expression per genes in Methanoregula across 133 metatranscriptomes (csv table)
    6. gtdbtk.ar53.decorated.tree newick file for GTDB de novo work flow Methanoregula MAG tree
    7. Newick_gene_trees.zip Trees used in blast identification of methylotrophic gene homologs to curate MR for methylotrophy
    8. fasta_reference_genes.zip FASTA reference files of genes used as BLAST query to mine Methanoregula MAGs for genes involved in detoxification of reactive oxygen species (ROS) and methanogenic metabolism of methylated compounds
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rikuto Kotoge (2025). MLOmics: Cancer Multi-Omics Database for Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.28729127.v2

Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning

Related Article
Explore at:
binAvailable download formats
Dataset updated
May 25, 2025
Dataset provided by
figshare
Authors
Rikuto Kotoge
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.

Search
Clear search
Close search
Google apps
Main menu