Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2:Table S1. Cell counts of the adult and fetal tissue groups at each omics level. Table S2. Filtered matrix raw read counts for scRNA-Seq across tissues in both fetal and adult groups. Cell_Count_Filtered_Matrix column represents raw read counts initially obtained from published studies or after filtering for the removal of background noises. Table S3. Statistics of the upregulated genes from adult and fetal tissues, filtered by average Log2FoldChange > 0.25 and adjusted P of 0.05. Clusters represent cell types. Genes were ranked by average log2-fold-change. Table S4. Top receptor–ligand interaction profiles of the cell types in the 38 matching adult and fetal tissues. Interaction analysis was done separately for each tissue, and information on the interaction pairs can be viewed from the first column. Table S5: Top clonotypes (VDJ gene combinations) of each cell type present in the T and B cell repertoires. Table S6. Top TFs in the pseudotime transitions of adult and fetal colon cell types. Table S7. Top receptor-ligand pairs in spatial transcriptomics of adult colons (colon 1 and colon 2) as well as in scRNA-seq adult and fetal colons. The first column represents the data type to which the interactions belong. Table ranked by decreasing interaction ratios. Table S8. Comparison of SCA with other single-cell omics databases. Green tick indicates a yes and a red cross indicates a no. Table S9. List of public resources included in the SCA database portal. SCA_PID refers to SCA-designated project identity number (PID).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gene annotations of top scoring BLAST+ hits for the predicted genes in the four rhizobia strains, as inferred from E. coli MG1655 and E. meliloti 1021. (XLSX 383Â kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the Multi-omics for Understanding Climate Change (MUCC database) version 2.0.0. This current version is based on amplicon and metagenomic sequencing of Old Woman Creek (OWC), Prairie Pothole Region(PPR7 and PPR8), Jean Lafitte National Historical Park and Preserve (JLA), AmeriFlux site US-LA2 (LA2), Stordalen Mire (STM-fen and STM-bog), AmeriFlux site-ID US-Twt (TWI), and Peatland Responses Under Changing Environments (SPRUCE) and wetland soils. Additionally, this includes metatranscriptome sequencing from OWC. In the future, this will be expanded to include more data from these sites and from additional wetlands.
OWC, PPR, JLA and LA2 data are deposited in NCBI Bioproject PRJNA1007388
Stordalen Mire MAGs are deposited in BioProject PRJNA386538
AmeriFlux site-ID US-Twt are deposited in SRA SRP003022, SRA SRP010671, SRP010730, SRP010738, SRP010741, SRP010747, SRP010748, SRP010751, SRP010862, SRP010870, and SRP011309.
SPRUCE data are deposited in PRJNA638786 and PRJNA638601
Files and datasets included here:
A large-scale gene expression database capturing cellular responses to thousands of perturbations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Non duplicated protein libraries from seven databases of Cnidaria: Db1 – 6 proteomes derived from sequenced genomes of Anthozoa Db2 – 2 proteomes derived from sequenced genomes of Medusozoa Db3 – 46 whole body/non-specific transcriptomes of Anthozoa Db4 – 24 whole body/non specific transcriptomes of Medusozoa Db5 – 25 transcriptomes specific to the tentacles of Anthozoa Db6 – 7 transcriptomes specific to the tentacles of Medusozoa Db7 – 2 transcriptomes specific to the nematocysts of Anthozoa
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mangroves are dominant flora of intertidal zones along tropical and subtropical coastline around the world that offer important ecological and economic value. Recently, the genomes of mangroves have been decoded, and massive omics data were generated and deposited in the public databases. Reanalysis of multi-omics data can provide new biological insights excluded in the original studies. However, the requirements for computational resource and lack of bioinformatics skill for experimental researchers limit the effective use of the original data. To fill this gap, we uniformly processed 942 transcriptome data, 386 whole-genome sequencing data, and provided 13 reference genomes and 40 reference transcriptomes for 53 mangroves. Finally, we built an interactive web-based database platform MangroveDB (https://github.com/Jasonxu0109/MangroveDB), which was designed to provide comprehensive gene expression datasets to facilitate their exploration and equipped with several online analysis tools, including principal components analysis, differential gene expression analysis, tissue-specific gene expression analysis, GO and KEGG enrichment analysis. MangroveDB not only provides query functions about genes annotation, but also supports some useful visualization functions for analysis results, such as volcano plot, heatmap, dotplot, PCA plot, bubble plot, population structure etc. In conclusion, MangroveDB is a valuable resource for the mangroves research community to efficiently use the massive public omics datasets.
A dataset linking genetic and molecular features of cancer cell lines to drug sensitivity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database contain protein sequences of aquatic microbial eukaryotes, or protists. The purpose of this is to make a database that is of reasonable quality to serve as resource for both taxonomy and functional interpretation of metagenomic and metatranscriptomic studies of protists. The source of the sequences were mainly from Marine Microbial Eukaryotes Transcriptome Sequencing Project (MMETSP), and supplemented with various genomes and transcriptomes of organisms that were not a part of MMETSP.
To use this database, one has to understand the main function of the three files here.
(1) The protein sequences are stored in .faa file. You can build an alignment/search database out of that and search your meta-omics sequences against it. Each sequence in the FASTA file has an ID which always consists of two parts like this: "MMETSP0004_1234567". The text before the first underscore is the source ID of that sequence.
(2) Taxonomy information of each source ID are stored in "EukZoo_taxonomy_table_v_0.2.tsv". One can use the information within in conjunction with database search results to assign taxonomy to sequences.
(3) KEGG annotation of each sequence are stored in "EukZoo_KEGG_annotation_v_0.2.tsv". One can use the information within in conjunction with database search results to assign KEGG functional annotation (KO ID) to sequences.
I also provide scripts to assign taxonomy and KEGG annotation from database search results. You can also find the scripts and explanations on how to use them on the EukZoo GitHub page. You will find details on how the database was created and curated on there as well.
Please contact me at zhenfeng.liu1@gmail.com if you have any questions or requests. Thank you for your interest in EukZoo.
A panel of 60 human cancer cell lines used for screening anticancer drugs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PFam Domains and biological process GO categories for the four rhizobia strains. Predicted proteins related to multiple GO biological process categories are joined together with the pipe character. (XLSX 639Â kb)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The biomarkers for thyroid cancer are still not known properly. For treating thyroid cancer these biomarkers can by be targeted specifically. Through this project, we identified and used bioinformatics tools to find biomarkers associated with thyroid cancer. Gene Expression Omnibus database (GEO) was used to find dataset related with thyroid cancer. Their expression profiles were downloaded. Four dataset GSE3467, GSE3678, GSE33630, and GSE53157 were identified from GEO database. The dataset GSE3467 contains nine thyroid tumor samples and nine normal thyroid tissue samples. The GSE3678 contains seven thyroid tumor samples and seven normal thyroid tissue samples. The GSE53157 contains twenty four thyroid tumor samples and three normal thyroid samples. The GSE33630 contains sixty thyroid tumor samples and forty five normal thyroid samples. These four datasets were analyzed individually and were integrated at the end to find the common genes among these four datasets. The microarray analysis of the datasets were performed using excel. T.Test analysis were performed for all the four datasets individually on a separate excel sheet. The data was normalized by converting normal value into log scale. Differential expression analysis of all the four datasets were done to identify differentially expresses genes (DEGs). Only upregulated genes were taken into account. Principal component analysis (PCA) of all the four dataset were performed using the raw data. The PCA analysis were performed using T-BioInfo server and the scatterplots were prepared using excel. RStudio was used to match the gene symbols with the corresponding probe ids using left join function. Inner join function in R was used to find integrated genes between the four datasets. Heatmaps of all the four datasets were performed using RStudio. To find number of intersection of Differentially expressed genes, an upset plot was prepared using RStudio. 74 genes with their corresponding probe ids were found to be common among all the four datasets. These genes are common to at least two datasets. These 74 common genes were analyzed using Database for Annotation, Visualization, and Integrated Discovery (DAVID), to study their Gene onotology (GO) functional annotations and pathways. According to the GO functional annotations result, most of the integrated upregulated genes were involved in protein binding, plasma membrane and integral component of membrane. Most common pathway include Extracellular matrix organization, Neutrophil degranulation, TGF-beta signaling pathway and Epithelial to mesenchymal transition in colorectal cancer. These 74 genes were introduced to STRING database to find protein-protein interactions between the genes. Interactions between the nodes were downloaded from STRING database and introduced to Sytoscape. Sytoscape analysis explained that only 19 genes showed protein-protein interactions between each other. Disease free survival analysis of the 13 genes that were common to three datasets were done using GEPIA. Boxplots of these 13 genes were also prepared using GEPIA. This showed that these differentially expressed genes showed different expression in normal thyroid tissue and thyroid tumor samples. Hence these 13 genes common to 3 datasets can be used as potential biomarkers for thyroid cancer. Among these 13 genes, four genes are implicated in cancer/cell proliferation can be probable target for treatment options.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Private and public breeding programs, as well as companies and universities, have developed different genomics technologies that have resulted in the generation of unprecedented amounts of sequence data, which bring new challenges in terms of data management, query, and analysis. The magnitude and complexity of these datasets bring new challenges but also an opportunity to use the data available as a whole. Detailed phenotype data, combined with increasing amounts of genomic data, have an enormous potential to accelerate the identification of key traits to improve our understanding of quantitative genetics. Data harmonization enables cross-national and international comparative research, facilitating the extraction of new scientific knowledge. In this paper, we address the complex issue of combining high dimensional and unbalanced omics data. More specifically, we propose a covariance-based method for combining partial datasets in the genotype to phenotype spectrum. This method can be used to combine partially overlapping relationship/covariance matrices. Here, we show with applications that our approach might be advantageous to feature imputation based approaches; we demonstrate how this method can be used in genomic prediction using heterogeneous marker data and also how to combine the data from multiple phenotypic experiments to make inferences about previously unobserved trait relationships. Our results demonstrate that it is possible to harmonize datasets to improve available information across gene-banks, data repositories, or other data resources.
A platform integrating Connectivity Map (CMap) and LINCS data for drug discovery.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
During evolution, plants have developed the ability to produce a vast array of specialized metabolites, which play crucial roles in helping plants adapt to different environmental niches. However, their biosynthetic pathways remain largely elusive. In the past decades, increasing numbers of plant biosynthetic pathways have been elucidated based on approaches utilizing genomics, transcriptomics, and metabolomics. These efforts, however, are limited by the fact that they typically adopt a target-based approach, requiring prior knowledge. Here, we present MEANtools, a systematic and unsupervised computational integrative omics workflow to predict candidate metabolic pathways de novo by leveraging knowledge of general reaction rules and metabolic structures stored in public databases. In our approach, possible connections between metabolites and transcripts that show correlated abundance across samples are identified using reaction rules linked to the transcript-encoded enzyme families. MEANtools thus assesses whether these reactions can connect transcript-correlated mass features within a candidate metabolic pathway. We validate MEANtools using a paired transcriptomic-metabolomic dataset recently generated to reconstruct the falcarindiol biosynthetic pathway in tomato. MEANtools correctly anticipated five out of seven steps of the characterized pathway and also identified other candidate pathways involved in specialized metabolism, which demonstrates its potential for hypothesis generation. Altogether, MEANtools represents a significant advancement to integrate multi-omics data for the elucidation of biochemical pathways in plants and beyond.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CLIM4OMICS Analytics and Database is Improved database of G2F data repository that contains OMICs (genetic and phenotypic) and environmental data for maize yield predictability across 84 experimental fields in the U.S. and province of ON in Canada between 2014-2021. The goal of this pipeline is to aggregate, improve, and synthesize multi-dimensional G2F data including Geno-type, Phenotype and Environmental data for GxE modeling. This dataset contains 79,122 phenotype measurements, 378 genotypes of maize lines, environmental data of 178 locations and Python Scripts for Quality control (QC), Consistency control (CC) steps and ML models for GxE interactions. The Environmental data is extracted from NWS, DayMet and NSRDB databases and processed for QC and CC. The environmental dataset contains the minimum temperature (Tmin), average temperature (Tmean), maximum temperature (Tmax), minimum dew point (DPmin), average dew point (DPmean), maximum dew point (DPmax), minimum relative humidity (RHmin), average relative humidity (RHmean), maximum relative humidity (RHmax), minimum solar radiation (SRmin), average solar radiation (SRmean), maximum solar radiation (SRmax), accumulative rainfall (Racc), average wind speed (WSmean), and average wind direction (WDmean). This package also contains the raw G2F data and preprocessing pipeline.
This is the Multi-omics for Understanding Climate Change (MUCC database) version 1.0.0. This current version is based on metagenomic and metatranscriptomic sequencing of Old Woman Creek wetland soils, but will be expanded in the future to include data from additional wetlands. Files and datasets included here: MAGs.zip Dereplicated database of 2,502 MAGs (fasta files) OWC_HQMQ_DB_genes.faa.gz MAG amino acid gene sequences derived from DRAM gene calls (fasta file) OWC_HQMQ_DB_ANNOTATIONS_20220208.txt.gz MAG DRAM annotations owc_metat_table_mags.csv Metatranscriptomic expression per MAG across 133 metatranscriptomes (csv table) owc_metat_table_mags_genes.csv Metatranscriptomic expression per gene across 133 metatranscriptomes (csv table) owc_metat_table_mags_genes_annotations.csv corresponding DRAM annotations to #5 for transcribed genes (csv table)
A dataset containing drug response profiles for over 600 compounds across multiple cancer cell lines.
https://enanomapper.adma.ai/about/omicshttps://enanomapper.adma.ai/about/omics
OMICS metadata : Nanosafety-relevant omics data - a database covering metadata for transcriptomics, proteomics and microRNA expression data relevant to safety assessment analyses of nanomaterials
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the Multi-omics for Understanding Climate Change (MUCC database) version 2.0.0. This current version is based on amplicon and metagenomic sequencing of Old Woman Creek (OWC), Prairie Pothole Region(PPR7 and PPR8), Jean Lafitte National Historical Park and Preserve (JLA), AmeriFlux site US-LA2 (LA2), Stordalen Mire (STM-fen and STM-bog), AmeriFlux site-ID US-Twt (TWI), and Peatland Responses Under Changing Environments (SPRUCE) and wetland soils. Additionally, this includes metatranscriptome sequencing from OWC. In the future, this will be expanded to include more data from these sites and from additional wetlands.
OWC, PPR, JLA and LA2 data are deposited in NCBI Bioproject PRJNA1007388
Stordalen Mire MAGs are deposited in BioProject PRJNA386538
AmeriFlux site-ID US-Twt are deposited in SRA SRP003022, SRA SRP010671, SRP010730, SRP010738, SRP010741, SRP010747, SRP010748, SRP010751, SRP010862, SRP010870, and SRP011309.
SPRUCE data are deposited in PRJNA638786 and PRJNA638601
Files and datasets included here:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.