Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The mean accuracy values obtained over the 30 bootstrap iterations. Acc – is the overall accuracy, F – is the F-score, G – is the G-score. The highest values are highlighted in bold. NOTE: all the corresponding standard deviations are less than 0.02.Classification performance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
High-risk neuroblastoma is a very aggressive disease, with excessive tumor growth and poor outcomes. A proper stratification of the high-risk patients by prognostic outcome is important for treatment. However, there is still a lack of survival stratification for the high-risk neuroblastoma. To fill the gap, we adopt a deep learning algorithm, Autoencoder, to integrate multi-omics data, and combine it with K-means clustering to identify two subtypes with significant survival differences. By comparing the Autoencoder with PCA, iCluster, and DGscore about the classification based on multi-omics data integration, Autoencoder-based classification outperforms the alternative approaches. Furthermore, we also validated the classification in two independent datasets by training machine-learning classification models, and confirmed its robustness. Functional analysis revealed that MYCN amplification was more frequently occurred in the ultra-high-risk subtype, in accordance with the overexpression of MYC/MYCN targets in this subtype. In summary, prognostic subtypes identified by deep learning-based multi-omics integration could not only improve our understanding of molecular mechanism, but also help the clinicians make decisions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
High-risk neuroblastoma is a very aggressive disease, with excessive tumor growth and poor outcomes. A proper stratification of the high-risk patients by prognostic outcome is important for treatment. However, there is still a lack of survival stratification for the high-risk neuroblastoma. To fill the gap, we adopt a deep learning algorithm, Autoencoder, to integrate multi-omics data, and combine it with K-means clustering to identify two subtypes with significant survival differences. By comparing the Autoencoder with PCA, iCluster, and DGscore about the classification based on multi-omics data integration, Autoencoder-based classification outperforms the alternative approaches. Furthermore, we also validated the classification in two independent datasets by training machine-learning classification models, and confirmed its robustness. Functional analysis revealed that MYCN amplification was more frequently occurred in the ultra-high-risk subtype, in accordance with the overexpression of MYC/MYCN targets in this subtype. In summary, prognostic subtypes identified by deep learning-based multi-omics integration could not only improve our understanding of molecular mechanism, but also help the clinicians make decisions.
In aquatic environments, the production and consumption of organic compounds is directly tied to the metabolic potential of the in situ microbial community. The community’s metabolic potential can be assessed using metatranscriptomics, which is a measure of gene expression in the environment. More recently, advances in metabolomics deliver details on specific organic compounds found in the environment. While the integration of these two datasets is desirable, the vast data streams can hinder investigations into biological and chemical processes. Here, we use data in the Kyoto Encyclopedia of Genes and Genomes (KEGG) combined with K means clustering to jointly interrogate metabolomics and metatranscriptomics data collected from a coastal marine environment. Using KEGG allowed us to focus our analysis on genes and compounds connected by a defined biochemical reaction. The K means clustering provided an unbiased means to place metabolites and genes into groups based on their temporal variability. Each of the groups defined by the K means clustering contained both transcripts and metabolites, which emphasized the interconnected nature of these two datasets. While conceptually simple, this analysis allowed us to explore a tractable number of biochemical pathways within our data. Continued development of computational tools to analyze meta-omics data is a pressing need. The application of tools such as described here is an exciting step towards integrated multi-omics data analysis that can be used to address broad questions in biogeochemical cycling.
We apply 3'-End-RNA-seq based sequencing to globally quantify polyadenylation sites and transcript isoform abundance in human U-2 OS cells under wild-type and CPSF6 knock-down conditions at 3 different temperatures (32°C, 37°C and 39°C). We show that CPSF6 knock-down as well as temperature alterations lead to global changes in 3' UTR length and transcript abundance. By means of a differential response analysis with respect to temperature changes in wild type and temperature-decompensated CPSF6 knockdown cells, we reveal candidate genes underlying circadian temperature compensation. Global changes in 3' UTR length and transcript abundance upon CPSF6 knockdown and temperature alterations.
Additional file 2. Significant biomarkers for each disease. Excel spreadsheet with the significant biomarkers found in the use case 2 for each disease, including the mean log2 FC between case and control samples for each gene.
Many cell types exhibit remarkable size homogeneity through successive divisions. In both yeast and humans, size homogeneity during division cycles is a result of tight control on the concentration of cell-cycle inhibitors such as WHI5 or RB1 (Retinoblastoma 1) proteins respectively. However, size control is often lost during oncogenesis, correlating with aggression. How size control is affected by mutations which drive proliferation, such as those of the RAS-ERK pathway, or even loss of RB1 itself, is poorly understood. Using quantitative single cell imaging of different melanoma cell lines, we show that melanoma cells exhibit both inter- and intra-line size variability. Integration of imaging with multi-omic data demonstrates that the translation machinery, G2 regulators, inflammatory mediators and growth-regulatory proteins are key determinants of size in melanoma. Theoretical modelling suggests that the cell size of daughter cells is determined by biosynthetic processes and engagement of stress responses in the mother. Biosynthesis and stress in mother cells impacts the synthesis of heritable prodivision factors like Cyclin D1 (CCND1) which control cell progression by inhibiting RB1 in concentration dependent fashions Small cell sizes and uniform populations are driven by robust DNA repair and increased levels of biosynthetic factors which promote CCDN1 accumulation. We propose that increased size and size-heterogeneity in BRAF and NRAS mutated cells is determined by stress or DNA damage in mother’s division cycle that slows cycle progression in the daughters. Such events lead to increased size and senescent-like states. Taken together our data suggest that oncogenic events, such as the dysregulation of MAPK signalling, provide a means to circumvent normal mechanisms of size regulation and increase cell-to-cell variations in size which may drive disease.
FASTA files containing the sequence data and for Assembled contigs (FastA), Predicted genes (FastA), Predicted proteins (FastA), Gene prediction (GFF v2). This dataset is not publicly accessible because: These are sequences that have already been deposited in publicly available databases and therefore we can avoid replication. Also the data is quite large and there are numerous files associated with these entries, which are included in the links below. It can be accessed through the following means: Using the following web links https://www.ncbi.nlm.nih.gov/bioproject/PRJNA299404 https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP065069 http://enve-omics.ce.gatech.edu/data/showerheads. Format: The data represent genome sequencing and assembly of 180 different contigs. This dataset is associated with the following publication: Soto-Giron, M.J., L. Rodriguez, C. Luo , M. Elk, H. Ryu, J. Santodomingo , and K. Konstantinidis. Biofilms on Hospital Shower Hoses: Characterization and Implications for Nosocomial Infections. APPLIED AND ENVIRONMENTAL MICROBIOLOGY. American Society for Microbiology, Washington, DC, USA, 82(9): 2872-2883, (2016).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Article: Implementing a Functional Precision Medicine Tumor Board for Acute Myeloid Leukemia
Cancer Discovery, DOI: 10.1158/2159-8290.CD-21-0410
Data Types:
1. Clinical summary
2. Drug response data
3. Exome-sequencing data
4. RNA-sequencing data
Updates:
- FILE: File_3.2. DATE: 28.11.2022.
1. Clinical summary
File_0: Common sample annotation including patient and sample IDs, stage of the disease, tissue type and availability of different data types.
File_1.1: Clinical data for 186 AML patients including clinical diagnosis, disease classification, gender, age at diagnosis, treatments, cytogenetic and molecular details. The description of the variables/column titles is given below the clinical data.
File_1.2: Description of the clinical variables in File_1.1.
2. Drug response data for 164 AML patient samples and 17 healthy samples
File_2: Drug library details for 515 chemical compounds. The compound collection includes drugs names, drug class defined by molecular targets or mode of action, concentration range used for drug testing, supplier information, solvent information and vendor information.
File_3.1.: Drug response data including selective drug sensitivity scores (sDSS) for 515 compounds across 181 samples (164 AML patient samples and 17 healthy control samples). The DSS is modified area under the curve values and are calculated as shown in Yadav et al publication (1). The selective drug sensitivity scores (sDSS) is healthy control normalized DSS that gives estimated cancer-selective drug responses. The higher the sDSS values indicate drug sensitivities and negative sDSS values represent drug resistance.
File_3.2.: Drug response data including drug sensitivity scores (DSS) and selective drug sensitivity scores (sDSS) for 515 compounds across 181 samples (164 AML patient samples and 17 healthy control samples). The data is identical to the Supplementary Table 7 in the manuscript.
Note: We recommend using selective DSS values instead of raw values (% inhibition, IC50, DSS).
Note: If the value is missing, the drug was not tested for that given sample.
File_4: Drug sensitivity and resistance testing (DSRT) assay details for 181 samples (164 AML patient samples and 17 healthy control samples). The information includes medium (MCM or CM) used for the drug testing, % cell viability after 72 h without drug testing and blast cell percentage of each sample.
Note: Column E is the ratio of luminescence values at 72 h and 0 h. The fold change in the cell viability without drug treatment was calculated as % cell viability. That is why the value could be more than 100% e.g. 70% cell viability meaning that 30% cells died during 72 h and 300% cell viability meaning that cells grew 3 times in 72 h incubation period.
3. Exome-sequencing data for 225 AML patient samples
Note: The number of samples in the manuscript is 226. The correct number used in the analyses is 225.
Mutation data. The cancer specific gene list was prepared by combining AML related genes from TCGA(2) (n=23), InToGen(3) (n=32), Papaemmanuil et al.(4) (n=111) and Census database(5) (n=616). Out of these genes, we found 340 genes as mutated across 225 AML patient samples. The mutation was called with P-values less than 0.05.
File_5: VAF (variant allele frequency) of 340 cancer-specific genes across 225 AML patient samples. The VAF was calculated using paired skin samples as a control from the same AML patient.
File_6: Binary data for 57 cancer specific genes frequently mutated (a given mutation detected in 5 or more samples) across 225 AML patient samples.
4. RNA-sequencing data for 163 AML patient samples and 4 healthy
CPM (count per million) data: The CPM values are batch corrected values used for direct comparison of gene expression.
File_7: Log2CPM values for 18,202 protein coding genes across 167 samples (163 AML patient samples and 4 healthy CD34+ samples).
File_8: Raw read count data RNA-seq library information for all 60,619 genes across 167 samples (163 AML patient samples and 4 healthy CD34+ samples). The raw read count data was used to calculate differential gene expression.
File_9: RNA-seq library information including RNA extraction method and sequencing library preparation information for 167 samples (163 AML patient samples and 4 healthy CD34+ samples).
References
1. Yadav B, Pemovska T, Szwajda A, Kulesskiy E, Kontro M, Karjalainen R, et al. Quantitative scoring of differential drug sensitivity for individually optimized anticancer therapies. Scientific Reports 2014;4:5193.
2. Ley TJ, Miller C, Ding L, Raphael BJ, Mungall AJ, Robertson A, et al. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 2013;368(22):2059-74.
3. Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J, Tamborero D, Schroeder MP, Jene-Sanz A, et al. IntOGen-mutations identifies cancer drivers across tumor types. Nature Methods 2013;10(11):1081-2.
4. Papaemmanuil E, Gerstung M, Bullinger L, Gaidzik VI, Paschka P, Roberts ND, et al. Genomic classification and prognosis in acute myeloid leukemia. New England Journal of Medicine 2016;374(23):2209-21.
5. Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Research 2019;47(D1):D941-D7.
Superparamagnetic nanoparticles (SPMNPs) are appealing for use in organelle isolation strategies. Yet, this potential remains largely unexplored because thus far research has focused on either physicochemical design or on their application in micron-sized beads. For their use in life sciences, the biocompatibility of SPMNPs goes beyond their chemical composition and shape, and features like their size and more importantly their surface properties are becoming more important to exploit extra- and intracellular interactions. Here we introduce thermal decomposition to manufacture iron oxide based SPMNPs (Ø10nm) and demonstrate how different surface functionalizations can lead to different types of cellular interactions. Cationic aminolipid-coated SPMNPs reside surprisingly strong at the outer cell surface. In contrast, anionic dimercaptosuccinic acid-coated SPMNPs are efficiently internalized and accumulate in a time-dependent manner in endosomal and lysosomal populations. These features allowed us to establish a standardized magnetic isolation procedure to selectively isolate plasma membranes and intracellular late endosomes/lysosomes with high yields and purities as consolidated by biochemical and ultrastructural analyses. Subsequent quantitative and qualitative proteome analysis underpins the overall high enrichment for hydrophobic (membrane) proteins as well as plasma membrane and lysosomal constituents in the respective purified fractions. This nano based technology provides therefore a breakthrough in the field of subcellular ‘omics’ as it allows the identification of subtle alterations in the biomolecular composition of different SPMNP-isolated compartments that would be otherwise not detected in total cell or tissue analysis.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset overview This dataset provides: the updated Integrated Gene Catalog of the human gut microbiota (aka IGC2) 1,989 Metagenomic Species Pangenomes (MSPs) This dataset can be used to analyze shotgun sequencing data of the human gut microbiota. How to use this dataset To perform taxonomic, functionnal and strain level profiling with this dataset, we suggest using Meteor. Methods Gene catalog construction The methodology for creating the IGC2 catalog is described in the original papers: Li et al., 2014 and Wen et al., 2017 MSP creation Reads from publicly available human gut metagenomes were aligned against the IGC2 catalog with the Meteor to produce a raw gene abundance table (10.4M genes quantified in >2000 samples). Then, co-abundant genes were binned in 1,989 Metagenomic Species Pan-genomes (MSPs, i.e. clusters of co-abundant genes that likely belong to the same microbial species) using MSPminer. MSPs taxonomic annotation MSPs taxonomic annotation was performed by aligning MSP core and accessory genes against representative genomes of the Genome Taxonomy Database (GTDB r207) using blastn (task = megablast, word_size = 16). The 20 best hits for each gene were kept (--max-target-seq 20). Using an in-house pipeline, a species-level assignment was given if > 50% of the genes matched the representative genome of a given species, with a mean identity ≥ 95% and mean gene length coverage ≥ 90%. The remaining MSPs were assigned to a higher taxonomic level (genus to superkingdom), if more than 50% of their genes had the same annotation. Construction of the phylogenetic tree 39 universal phylogenetic markers genes were extracted from the MSPs with fetchMGs. Then, the markers were separately aligned with MUSCLE. The alignments were merged and trimmed with trimAl (parameters: -automated1). Finally, the phylogenetic tree was computed with FastTreeMP (parameters: -gamma -pseudo -spr -mlacc 3 -slownni).
This record contains raw data related to article “Definition of a multi-omics signature for Esophageal Adenocarcinoma prognosis prediction "
Abstract: Esophageal cancer is a highly lethal malignancy that accounts for 5% of all cancer deaths. The two main sub-types of the disease are esophageal squamous-cell carcinoma (ESCC) and esophageal adenocarcinoma (EAC). To date, most studies focused on analysing the transcriptional profile in ESCC only a few studies analysed EAC for transcriptional signatures that might be associated with diagnosis and/or prognosis. In this work we performed a single-cell RNA sequencing (scRNAseq) analysis of the CD45+ cells enriched from from tumor and matched non-tumor tissues obtained from 3 therapy-naïve patients to identify all the types of immune cells present in the tumor's immune infiltrate and their transcriptomic profiles, moreover we have analysed the whole transcriptome in a cohort of 23 patients from whom tissue biopsies were taken from tumor and matched non-tumor tissues. The transcriptional signatures derived from both types of analyses were then used to stratify a larger cohort of TCGA EAC patients showing a strong association with their prognosis. The transcriptional signatures here described have therefore proved capable of being able to predict the clinical outcome of patients and could be used to better define the prognosis in EAC after surgery and to direct patients towards effective therapies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Evaluation of the running time represented as the mean over 30 bootstrap iterations. All methods investigated in this study were run single-threaded. For the proposed method the running time is compiled considering the sum of the execution times spent for the feature selection and prioritization steps.Running time.
Mass spectrometry imaging (MSI) experiments result in complex multi-dimensional datasets, which require specialist data analysis tools. Here we have developed massPix - an R package for analysing and interpreting data from MSI of lipids in tissue. MassPix is an open-source tool for the analysis and statistical interpretation of MSI data, and is particularly useful for lipidomics applications. MassPix produces single ion images, performs multivariate statistics and provides putative lipid annotations based on accurate mass matching against generated lipid libraries. Classification of tissue regions with high spectral similarly can be carried out by principal components analysis (PCA) or k-means clustering. Mouse cerebellum was analysed using matrix assisted laser desorption ionisation (MALDI) MSI. The resulting MSI dataset forms the test data for massPix.
The combined use of multiple omics methods to answer complex system biology questions is growing in biological and medical sciences, as the importance of studying interrelated biological processes in their entirety is increasingly recognized. We applied a combination of metabolomics, lipidomics and proteomics to human bone to investigate the potential of this multi-omics approach to estimate the time elapsed since death (i.e., the post-mortem interval, PMI). This “ForensOMICS” approach has the potential to improve accuracy and precision of PMI estimation of skeletonized human remains, thereby helping forensic investigators to establish the timeline of events surrounding death. Anterior midshaft tibial bone was collected from four female body donors in a fresh stage of decomposition before placement of the bodies to decompose outdoors at the human taphonomy facility managed by the Forensic Anthropological Center at Texas State (FACTS). Bone samples were again collected at selected PMIs (219, 790, 834 and 872 days). Liquid chromatography mass spectrometry (LC-MS) was used to obtain untargeted metabolomic, lipidomic and proteomic profiles from the pre- and post-placement bone samples. Multivariate analysis was used to investigate the three omics blocks by means of Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies (DIABLO), to identify the reduced number of markers that could effectively describe post-mortem changes and classify the individuals based on their PMI. The resulting model showed that pre-placement bone metabolome, lipidome and proteome profiles were clearly distinguishable from post-placement profiles. Metabolites associated with the pre-placement samples, suggested an extinction of the energetic metabolism and a switch towards another source of fuelling (e.g., structural proteins). We were able to identify certain biomolecules from the three groups that show excellent potential for estimation of the PMI, predominantly the biomolecules from the metabolomics block. Our findings suggest that, by targeting a combination of compounds with different post-mortem stability, in future studies we could be able to estimate both short PMIs, by using metabolites and lipids, and longer PMIs, by including more stable proteins.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The number of significantly self-consistent and all the selected genes by a given method during the 30 bootstrap iterations. ns – the number of significantly self-consistent genes found, tot – the number of different features selected over the 30 bootstrap iterations, mnsf – the mean number of selected features. The highest values are highlighted in bold.Selection consistency analysis.
A transmission mode-direct analysis in real time-quadrupole time of flight-mass spectrometry (TM-DART-QTOF-MS)-based analytical method coupled to multivariate statistical analysis was developed to interrogate lipophilic compounds in seawater samples without the need of desalinization. An untargeted metabolomics approach addressed here as seaomics was successfully implemented to discriminate sea surface microlayer (SML) from underlying water (ULW) samples (n=22, 10 paired samples) collected during a field campaign at the Cape Verde islands in September-October 2017. A panel of 11 ionic species detected in all samples allowed sample class discrimination by means of supervised multivariate statistical models. Tentative identification of species enriched at SML samples suggests that fatty alcohols, halogenated compounds, and oxygenated boron-containing organic compounds are available at the surface for water-air transfer processes. A subset of SML samples (n=5) were subject to on-site experiments during the campaign using a lab-to-the-field approach to test their secondary organic aerosol (SOA) formation potency. Results from these experiments and the analytical seaomics strategy provide a proof of concept for an approach to identifying organic molecules involved in aerosol formation processes at the water/air interface.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
MetaNutriUnify Collection This work is linked to the iTARGET project (https://qualiment.fr/des-projets-pour-anticiper-les-besoins-de-recherche-des-entreprises-agroalimentaires-2022/hot-topics/), aiming at performing in silico and in vitro targeting of healthy gut bacteria with fiber degrading metabolic potential. In this context, we developed MetaNutriUnify, the first collection of curated and harmonized metagenomic data with unified nutritional data from public study cohorts, with a particular attention on fiber. Overview of the public studies included in MetaNutriUnify MetaNutriUnify is composed of 21 harmonized and curated public projects with available shotgun metagenomes from human adults stool samples and nutritional and anthropometric data. It consists of 949 individuals from 15 countries, totalizing 1656 metagenomes, from which we generated microbial species and associated functional modules abundance tables. We also unified nutritional data, including diet type, study type (observational/interventional), time points for stool sampling, diet intervention and associated information, macro and micronutrients when available, and reported available anthropometric data (gender, country, age, weight, height, BMI). The list of the 21 included studies is: Bioproject PRJEB8249 (2015, SWE, 21 subjects, PMID 26244932) Bioproject PRJNA278393 (2015, TZA & ITA, 33 subjects, PMID 25981789) Bioproject PRJNA328899 (2016, MNG & CHN, 110 subjects, PMID 27708392) Bioproject PRJNA305507 (2017, USA, 33 subjects, PMID 28797298) Bioproject PRJEB28687 (2018, USA & THA, 50 subjects, PMID 30388453) Bioproject PRJEB32794 (2019, IRL, 37 subjects, PMID 31558359) Bioproject PRJNA472785 (2019, USA, 12 subjects, PMID 31235964) Bioproject PRJNA386503 (2019, USA, 4 subjects, PMID 30810441) Bioproject PRJNA397112 (2019, IND, 88 subjects, PMID 30698687) Bioproject PRJEB33500 (2020, ITA, 82 subjects, PMID 32075887) Bioproject PRJNA647720 (2021, USA, 20 subjects, PMID 33727392) Bioproject PRJNA755720 (2021, ESP, 20 subjects, PMID 34444797) Bioproject PRJNA892265 (2022, ESP, 20 subjects, PMID 36364873) Bioproject PRJEB42906 (2022, USA, 50 subjects, PMID 35312171) Bioproject PRJEB45944 (2022, NLD, 149 subjects, PMID 35115599) Bioproject PRJEB48663 (2022, FRA, 39 subjects, PMID 35311446) Bioproject PRJNA762543 (2022, SGP, 62 subjects, PMID 35549618) Bioproject PRJEB48605 (2023, DEU, 68 subjects, PMID 35760036) Bioproject PRJNA939268 (2023, SGP, 10 subjects, PMID 36997838) Bioproject PRJEB26842 (2023, GBR, 29 subjects, PMID 37587110) Bioproject PRJNA906167 (2023, ESP, 12 subjects, PMID 37457982) Method summary Metagenomic data and associated metadata were recovered from the European Nucleotide Archive, while nutritional and anthropometric data were collected from various online resources (main publication, supplementary files, GitHub or BioProject information). QC validation was performed using fastp (version 0.23.4) and host related reads were filtered out with bowtie using the human reference genome (Homo sapiens T2T-CHM13v2.0). Resulting high quality reads were mapped onto the 10.4 million gut IGC2 catalogue of the human microbiome and onto the 8.4 million human oral microbial catalogue using the METEOR software clustered into Metagenomic Species Pangenomes (MSP species) that were previously taxonomically and functionally annotated. MetaNutriUnify characteristics The provided data consists of: Metagenomic Species Pangenomes (MSP) species abundance table and related GTDB taxonomy (GTBD-tk version r220) (final_msp.7z and species_taxonomy_20241119.tab) KEGG, GMM and GBM Functional modules abundance table and related modules definition (KEGG version 107, GMM modules and GBM modules) (final_modules.tab and all_modules_definition_GMM_GBM_KEGG_107_20241119.tab) Manually curated and unified data collected from bioprojects (final_metadata.tab): Metadata from the metagenomes obtained from ENA, as well as nutritional and anthropometric data. Additional data on each sample, such as the evaluation of cross contamination within each bioproject using the CroCodeEL tool, together with the number of high quality reads, the MSP species richness and if the number of reads was below 1M. We proposed a “to_exclude” variable in the deposited MetaNutriUnify file, derived from these data. If one of the following conditions was met: low_read = YES, is.contaminated = YES or MSP_richness < 20, we propose to exclude the sample from downstream analysis. Nutritional data, carefully extracted from each study and reporting information on diet type, energy, macro- and micronutrients when available. No frequency data were included, because of the great variability between studies. We reported variables and their modalities as they were described in the different studies, and only modified units of nutritional data when appropriate. We encourage users to modulate the modalities for some variables, such as time point description, and to refer to the original studies...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository houses fully-adjusted methylome-wide association study (MWAS) summary statistics for 4,231 SomaScan protein measurements. These were generated as part of the study titled ‘Integrated methylome and phenome study of the circulating proteome reveals markers pertinent to brain health’ by Gadd et al. The Stratifying Resilience and Depression Longitudinally (STRADL) cohort used in this study is a subset of individuals from Generation Scotland: The Scottish Family Health Study. There were 744 individuals with complete protein and DNA methylation measurements available at 772,619 CpG probes. MWAS were performed with protein residuals as the outcome and DNA methylation as the exposure, using the Omics-data-based complex trait analysis (OSCA) software.
Fully-adjusted models were run using M-values that were adjusted for age, sex, DNA methylation-derived immune cell estimates, depression status, DNA methylation batch and set, body mass index and a DNA methylation-derived smoking score. Protein levels were rank-based inverse normalised and scaled to have a mean of 0 and standard deviation of 1. Protein levels were residualised by age, sex, available pQTLs, technical covariates and 20 genetic principal components.
Four of the 4,235 protein MWAS models did not converge (15509-2 - NAGLU, 15584-9 - CFHR2, 4407-10 - MST1 and 6402-8 - PILRA). Therefore, summary statistics are provided for 4,231 protein levels.
Each protein MWAS summary statistics file has been saved with the following naming system: "MWAS_SeqId_Protein_gene.csv". For example, the protein with gene name CRYBB2 and SeqId 10000-28 has the following file name: "MWAS_10000-28_CRYBB2.csv".
The SeqIds, UniProt codes, gene names and full UniProt names can be found in "annotation_formatted_for_paper.csv" and the full summary statistics are found within "compressed-protein-ewas.tar.gz".
Please contact either riccardo.marioni@ed.ac.uk or danni.gadd@ed.ac.uk for any queries. All code is available at the following Github repository: https://github.com/DanniGadd/Epigenome-and-phenome-wide-study-of-brain-health-outcomes.
Microorganisms play a key role in cycling nutrients and contaminants in the terrestrial environment depending on their genetic potential. Here we present metagenome-assembled genomes (MAGs) for the bacterial and archaeal community in floodplain sediment samples taken roughly every month in the period May 18 to September 13 in 2017 at a location (Pit2) close to DOE Legacy Management well 855 at the Riverton, Wyoming floodplain site in the Wind River Basin (WRB). The groundwater at this site exhibits persistent U, Mo, and sulfate plumes and is one of the field sites in focus for the SLAC Groundwater Quality SFA program. Cores were taken with a hand-auger and separated into 5-20 cm segments based on soil horizonation down to 150 cm depth below surface. Each segment was subsampled for microbial analyses. Corresponding 16S rRNA gene amplicon data is available at the NCBI Single Read Archive (SRA) Database BioProject ID PRJNA626616, and soil geochemistry data at doi:10.15485/1631972. 40 metagenomes were sequenced through JGI and can be found under Gold sequencing project: Gs0142591. Metagenomes were assembled, binned, and refined using metawrap to generate MAGs (>50% complete and < 10% contamination based on checkM scores). This dataset includes a zip file of 6993 MAG fasta files and a csv file with quality, taxonomic classification (GTDB RS220), and metagenome accessions for MAGs generated from the Wind River Basin (WRB). This dataset also includes a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata and a data dictionary (dd.csv) file that contains column/row headers used throughout the files along with a definition, units, and data type.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The mean accuracy values obtained over the 30 bootstrap iterations. Acc – is the overall accuracy, F – is the F-score, G – is the G-score. The highest values are highlighted in bold. NOTE: all the corresponding standard deviations are less than 0.02.Classification performance.