71 datasets found
  1. GSE58095 Data Normalization Subtype Analysis R

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). GSE58095 Data Normalization Subtype Analysis R [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/gse58095-data-normalization-subtype-analysis-r
    Explore at:
    zip(26134446 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains processed and normalized gene expression data from the public GEO series GSE58095.

    The dataset is prepared to support downstream analyses such as subtype classification, differential expression, and exploratory visualization.

    The content includes R scripts and processed matrices that guide users through normalization, quality control, and biological interpretation steps.

    Gene expression data were aligned, filtered, quality-checked, and normalized using widely accepted bioinformatics pipelines.

    The dataset aids researchers working in cancer genomics, transcriptomics, and molecular subtype discovery.

    Included analyses demonstrate how to classify samples into biologically meaningful subtypes using clustering and statistical approaches.

    The workflow supports reproducible research with clear steps for importing raw data, preprocessing, normalization, and generating subtype assignments.

    The dataset is intended for educational, research, and benchmarking purposes within computational biology and bioinformatics.

    All scripts are written in R for transparency and adaptability to various research workflows.

  2. GSE6740 Data Normalization SubtypeAnalysis Patient

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). GSE6740 Data Normalization SubtypeAnalysis Patient [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/gse6740-data-normalization-subtypeanalysis-patient
    Explore at:
    zip(1838637 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    • This dataset contains processed gene expression data derived from the publicly available GEO series GSE6740. • The dataset focuses on the normalization, preprocessing, and subtype-level analysis of patient samples. • It includes R scripts and resources used to clean, transform, and standardize raw microarray expression values. • The uploaded files support the step-by-step workflow used to perform differential expression and subtype clustering. • The dataset is suitable for users working on microarray analysis, normalization pipelines, and cancer or immune cell subtype research. • All preprocessing steps follow standard bioinformatics workflows, including background correction, log transformation, and quantile normalization. • The dataset allows users to reproduce normalization results, explore subtype-level grouping, and run downstream statistical comparisons. • It includes annotated patient group information and cell-type–specific analytical procedures used in GSE6740-based research. • The content is designed for students, bioinformaticians, and researchers learning microarray data normalization with R. • The dataset can be directly used for training, teaching, method comparison, or as a reference workflow for microarray processing.

  3. f

    Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data...

    • frontiersin.figshare.com
    application/cdfv2
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.doc [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s001
    Explore at:
    application/cdfv2Available download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.

  4. GSE65194 Data Normalization and Subtype Analysis

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). GSE65194 Data Normalization and Subtype Analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/gse65194-data-normalization-and-subtype-analysis
    Explore at:
    zip(54989436 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Raw and preprocessed microarray expression data from the GSE65194 cohort.

    Includes samples from triple-negative breast cancer (TNBC), other breast cancer subtypes, and normal breast tissues.

    Expression profiles generated using the “Affymetrix Human Genome U133 Plus 2.0 Array (GPL570)” platform. tcr.amegroups.org +2 Journal of Cancer +2

    Provides normalized gene expression values suitable for downstream analyses such as differential expression, subtype classification, and clustering.

    Supports the identification of differentially expressed genes (DEGs) between TNBC, non-TNBC subtypes, and normal tissue. Aging-US +2 tcr.amegroups.org +2

    Useful for transcriptomic analyses in breast cancer research, including subtype analysis, biomarker discovery, and comparative studies.

  5. Data from: Size normalizing planktonic Foraminifera abundance in the water...

    • zenodo.org
    bin
    Updated Aug 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sonia Chaabane; Sonia Chaabane; Thibault de Garidel-Thoron; Thibault de Garidel-Thoron; Xavier Giraud; Xavier Giraud; Julie Meilland; Julie Meilland; Geert-Jan A. Brummer; Geert-Jan A. Brummer; Lukas Jonkers; P. Graham Mortyn; Mattia Greco; Nicolas Casajus; Michal Kucera; Olivier Sulpis; Olivier Sulpis; Azumi Kuroyanagi; Hélène Howa; Gregory Beaugrand; Gregory Beaugrand; Ralf Schiebel; Ralf Schiebel; Lukas Jonkers; P. Graham Mortyn; Mattia Greco; Nicolas Casajus; Michal Kucera; Azumi Kuroyanagi; Hélène Howa (2024). Size normalizing planktonic Foraminifera abundance in the water column [Dataset]. http://doi.org/10.5281/zenodo.10750545
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sonia Chaabane; Sonia Chaabane; Thibault de Garidel-Thoron; Thibault de Garidel-Thoron; Xavier Giraud; Xavier Giraud; Julie Meilland; Julie Meilland; Geert-Jan A. Brummer; Geert-Jan A. Brummer; Lukas Jonkers; P. Graham Mortyn; Mattia Greco; Nicolas Casajus; Michal Kucera; Olivier Sulpis; Olivier Sulpis; Azumi Kuroyanagi; Hélène Howa; Gregory Beaugrand; Gregory Beaugrand; Ralf Schiebel; Ralf Schiebel; Lukas Jonkers; P. Graham Mortyn; Mattia Greco; Nicolas Casajus; Michal Kucera; Azumi Kuroyanagi; Hélène Howa
    Time period covered
    Mar 2, 2024
    Description

    Data and R code for the paper Size normalizing planktonic Foraminifera abundance in the water column (https://doi.org/10.1002/lom3.10637) by Sonia Chaabane, Thibault de Garidel-Thoron, Xavier Giraud, Julie Meilland, Geert-Jan A. Brummer, Lukas Jonkers, P. Graham Mortyn, Mattia Greco, Nicolas Casajus, Olivier Sulpis, Michal Kucera, Azumi Kuroyanagi, Hélène Howa, Gregory Beaugrand, Ralf Schiebel

    The codes serve to generate a new normalization approach for estimating the abundance of planktonic Foraminifera (ind/m³) within the specified collection size fraction range. Data utilized in this study are sourced from the FORCIS database, containing records collected from the global ocean at various depths spanning the past century. A cumulative distribution across size fractions is identified and modeled using a Michaelis-Menten function. This modeling results in multiplication factors enabling the normalization of one fraction to any other size fraction equal to or larger than 100 µm. The resultant size normalization model is then tested across various depths and compared against a previous size normalization solution.

    Scripts written by Sonia Chaabane.

    DATA SOURCES

    DATA

    1. data_raw_from_excel.RDS

    CODES

    1. Code 1_Prepare the data.R: Reads the data and prepares it for analysis.
    2. Code 2_Data-model_training.R: Analyzes the data and builds the model.
    3. Code 2_MM_confidence interval_all oceans_depths_seasons.R: Analyzes the data and computes confidence intervals across all oceans, depths, and seasons.
    4. Code 3_Validation.R: Compares actual vs. estimated number concentrations.
    5. Code 4_Test with berger scheme.R: Compares actual vs. estimated number concentrations using Berger 1969 correction scheme.
    6. Code 5_Cross validation_Retailleau et al.R: Applies the FORCIS number concentration-size correction scheme on an independent dataset.
    7. Code 6_Retailleau et al. using berger approach.R: Compares actual vs. estimated number concentrations using Berger 1969 correction scheme from an independent dataset.
    8. function.R: Additional functions used in the analysis.

  6. s

    R script to reproduce "Improved normalization of species count data in...

    • repository.soilwise-he.eu
    Updated Jul 1, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). R script to reproduce "Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities". [Dataset]. https://repository.soilwise-he.eu/cat/collections/metadata:main/items/b7260968-33ab-4b37-8158-4c7f6a599a75
    Explore at:
    Dataset updated
    Jul 1, 2020
    Description

    The R script and data are available for download:
    https://metadata.bonares.de/smartEditor/rest/upload/ID_7050_2020_05_13_Beule_Karlovsky.zip

    R script and data for the reproduction of the paper entitled "Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities" by Lukas Beule and Petr Karlovsky.

    Comparison of scaling with ranked subsampling (SRS) with rarefying for the normalization of species count data in ecology. The example provided is a library obtained from next generation sequencing of a soil bacterial community. Different alpha diversity measures, community composition, and relative abundance of taxonomic units are compared.

  7. d

    Methods for normalizing microbiome data: an ecological perspective

    • datadryad.org
    zip
    Updated Oct 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 30, 2018
    Dataset provided by
    Dryad
    Authors
    Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
    Time period covered
    Oct 19, 2018
    Description

    Simulation script 1This R script will simulate two populations of microbiome samples and compare normalization methods.Simulation script 2This R script will simulate two populations of microbiome samples and compare normalization methods via PcOAs.Sample.OTU.distributionOTU distribution used in the paper: Methods for normalizing microbiome data: an ecological perspective

  8. GEOExpressionMatrixHandlingNormalization GSE157103

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). GEOExpressionMatrixHandlingNormalization GSE157103 [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/geoexpressionmatrixhandlingnormalization-gse157103
    Explore at:
    zip(22797882 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    • This dataset provides processed outputs from the GEO dataset GSE157103. • It focuses on gene expression matrix handling and normalization workflows. • Includes normalized expression data and quality-control visualizations. • Demonstrates quantile normalization applied to raw expression values. • Contains plots used to assess distribution changes before and after normalization. • Useful for bioinformatics learners and researchers studying expression normalization. • Supports reproducible analysis for transcriptomics and preprocessing pipelines.

  9. GSE40012 Data Normalization and Subtype Analysis

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). GSE40012 Data Normalization and Subtype Analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/gse40012-data-normalization-and-subtype-analysis
    Explore at:
    zip(19093460 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Blood transcriptome data from study GSE40012 (critically ill patients with bacterial pneumonia, influenza A (H1N1) pneumonia, mixed bacterial–influenza pneumonia, systemic inflammatory response syndrome (SIRS), and healthy controls) OmicsDI +1

    Data obtained via Illumina HT-12 microarray bead-arrays. OmicsDI

    Whole-blood gene expression measured at multiple time points (including day 1 and day 5 for healthy controls; up to 5 days follow-up for influenza pneumonia patients) to profile host immune response dynamics. OmicsDI

    Provides normalized & preprocessed expression values suitable for downstream analysis (e.g. differential expression, subtype classification, immune response profiling).

    Useful for comparing host transcriptomic signatures between bacterial pneumonia, viral (influenza) pneumonia, mixed infections, SIRS, and healthy controls.

    Can be used to identify potential biomarkers distinguishing influenza vs bacterial pneumonia, or to study temporal changes in host response.

    Enables meta-analysis or integration with other blood transcriptome datasets for broader immunological studies.

  10. Processed data - DegNorm: Normalization of generalized transcript...

    • zenodo.org
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bin Xiong; Yiben Yang; Frank R Fineis; Ji-Ping Wang; Bin Xiong; Yiben Yang; Frank R Fineis; Ji-Ping Wang (2020). Processed data - DegNorm: Normalization of generalized transcript degradation improves accuracy in RNA-seq analysis [Dataset]. http://doi.org/10.5281/zenodo.2595303
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bin Xiong; Yiben Yang; Frank R Fineis; Ji-Ping Wang; Bin Xiong; Yiben Yang; Frank R Fineis; Ji-Ping Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Processed data from DegNorm:

    • "_raw.txt": raw read counts matrix;
    • "_DI.txt": Degradation index score matrix;
    • "_DegNorm.txt": normalized read counts matrix from DegNorm output;
    • "_coverage.Rdata": list of coverage matrix for the sample;
    • "_countsTIN.txt": TIN normalized counts.

  11. b

    Data from: A systematic evaluation of normalization methods and probe...

    • nde-dev.biothings.io
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Universidade de São Paulo
    Hospital for Sick Children
    University of Toronto
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  12. GEO ExpressionMatrixHandlingNormalization GSE32138

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). GEO ExpressionMatrixHandlingNormalization GSE32138 [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/geo-expressionmatrixhandlingnormalization-gse32138
    Explore at:
    zip(8536153 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    • This dataset contains expression matrix handling and normalization results derived from GEO dataset GSE32138. • It includes raw gene expression values processed using standardized bioinformatics workflows. • The dataset demonstrates quantile normalization applied to microarray-based expression data. • It provides visualization outputs used to assess data distribution before and after normalization. • The goal of this dataset is to support reproducible analysis of GSE32138 preprocessing and quality control. • Researchers can use the files for practice in normalization, exploratory data analysis, and visualization. • This dataset is useful for learning microarray preprocessing techniques in R or Python.

  13. H

    GC/MS Simulated Data Sets normalized using quantile normalization

    • dataverse.harvard.edu
    Updated Jan 25, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denise Scholtens (2017). GC/MS Simulated Data Sets normalized using quantile normalization [Dataset]. http://doi.org/10.7910/DVN/R3P9SS
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Denise Scholtens
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using quantile normalization (Bolstad et al. 2003).

  14. Additional file 3: of DBNorm: normalizing high-density oligonucleotide...

    • springernature.figshare.com
    txt
    Updated Nov 30, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy (2017). Additional file 3: of DBNorm: normalizing high-density oligonucleotide microarray data based on distributions [Dataset]. http://doi.org/10.6084/m9.figshare.5648932.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 30, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DBNorm test script. Code of how we test DBNorm package. (TXT 2Â kb)

  15. Naturalistic Neuroimaging Database

    • openneuro.org
    Updated Apr 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v1.1.3
    Explore at:
    Dataset updated
    Apr 20, 2021
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Overview

    • The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.
    • The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

    v2.0 Changes

    • Overview
      • We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.
    • Normalization

      • Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:
      # Generate a resting state (rs) timeseries (ts)
      # Install / load package to make fake fMRI ts
      # install.packages("neuRosim")
      library(neuRosim)
      # Generate a ts
      ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
      # 3dDetrend -normalize
      # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
      # Do for the full timeseries
      ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
      # Do this again for a shorter version of the same timeseries
      ts.shorter.length <- length(ts.normalised.long)/4
      ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
      # By looking at the summaries, it can be seen that the median values become  larger
      summary(ts.normalised.long)
      summary(ts.normalised.short)
      # Plot results for the long and short ts
      # Truncate the longer ts for plotting only
      ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
      # Give the plot a title
      title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
      plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
      # Add zero line
      lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
      # 3dDetrend -normalize -polort 0 for long timeseries
      lines(ts.normalised.long.made.shorter, col='blue');
      # 3dDetrend -normalize -polort 0 for short timeseries
      lines(ts.normalised.short, col='red');
      
    • Standardization/modernization

      • The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.
    • New afni_proc.py command line

      • The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

      We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

      Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

    • Effect on results

      • From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
  16. f

    Data from: hemaClass.org: Online One-By-One Microarray Normalization and...

    • datasetcatalog.nlm.nih.gov
    Updated Oct 5, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young, Ken H.; Nielsen, Kasper Lindblad; Bilgrau, Anders Ellern; Brøndum, Rasmus Froberg; Johnsen, Hans Erik; El-Galaly, Tarec Christoffer; Have, Jonas; Dybkær, Karen; Schmitz, Alexander; Bøgsted, Martin; Falgreen, Steffen; Jakobsen, Lasse Hjort; Bødker, Julie Støve (2016). hemaClass.org: Online One-By-One Microarray Normalization and Classification of Hematological Cancers for Precision Medicine [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001602916
    Explore at:
    Dataset updated
    Oct 5, 2016
    Authors
    Young, Ken H.; Nielsen, Kasper Lindblad; Bilgrau, Anders Ellern; Brøndum, Rasmus Froberg; Johnsen, Hans Erik; El-Galaly, Tarec Christoffer; Have, Jonas; Dybkær, Karen; Schmitz, Alexander; Bøgsted, Martin; Falgreen, Steffen; Jakobsen, Lasse Hjort; Bødker, Julie Støve
    Description

    BackgroundDozens of omics based cancer classification systems have been introduced with prognostic, diagnostic, and predictive capabilities. However, they often employ complex algorithms and are only applicable on whole cohorts of patients, making them difficult to apply in a personalized clinical setting.ResultsThis prompted us to create hemaClass.org, an online web application providing an easy interface to one-by-one RMA normalization of microarrays and subsequent risk classifications of diffuse large B-cell lymphoma (DLBCL) into cell-of-origin and chemotherapeutic sensitivity classes. Classification results for one-by-one array pre-processing with and without a laboratory specific RMA reference dataset were compared to cohort based classifiers in 4 publicly available datasets. Classifications showed high agreement between one-by-one and whole cohort pre-processsed data when a laboratory specific reference set was supplied. The website is essentially the R-package hemaClass accompanied by a Shiny web application. The well-documented package can be used to run the website locally or to use the developed methods programmatically.ConclusionsThe website and R-package is relevant for biological and clinical lymphoma researchers using affymetrix U-133 Plus 2 arrays, as it provides reliable and swift methods for calculation of disease subclasses. The proposed one-by-one pre-processing method is relevant for all researchers using microarrays.

  17. s

    Scaling with ranked subsampling (SRS) algorithm for the normalization of...

    • repository.soilwise-he.eu
    Updated Jul 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Scaling with ranked subsampling (SRS) algorithm for the normalization of species count data. [Dataset]. https://repository.soilwise-he.eu/cat/collections/metadata:main/items/4b2b65c6-ff50-4669-99cc-ace343de3548
    Explore at:
    Dataset updated
    Jul 1, 2020
    Description

    Scaling with ranked subsampling (SRS) is an algorithm for the normalization of species count data in ecology. So far, SRS has successfully been applied to microbial community data. "SRS is now available on CRAN: https://CRAN.R-project.org/package=SRS" An implementation of SRS in R is available for download: https://metadata.bonares.de/smartEditor/rest/upload/ID_7049_2020_05_13_SRS_function_v1_0_R.zip

    SRS consists of two steps. In the first step, the counts for all OTUs (operational taxonomic untis) are divided by a scaling factor chosen in such a way that the sum of the scaled counts (Cscaled with integer or non-integer values) equals Cmin. In the second step, the non-integer count values are converted into integers by an algorithm that we dub ranked subsampling. The scaled count Cscaled for each OTU is split into the integer-part Cint by truncating the digits after the decimal separator (Cint = floor(Cscaled)) and the fractional part Cfrac (Cfrac = Cscaled - Cint). Since ΣCint ≤ Cmin, additional ∆C = Cmin - ΣCint counts have to be added to the library to reach the total count of Cmin. This is achieved as follows. OTUs are ranked in the descending order of their Cfrac values. Beginning with the OTU of the highest rank, single count per OTU is added to the normalized library until the total number of added counts reaches ∆C and the sum of all counts in the normalized library equals Cmin. When the lowest Cfrag involved in picking ∆C counts is shared by several OTUs, the OTUs used for adding a single count to the library are selected in the order of their Cint values. This selection minimizes the effect of normalization on the relative frequencies of OTUs. OTUs with identical Cfrag as well as Cint are sampled randomly without replacement.

  18. Novel R Pipeline for Analyzing Biolog Phenotypic Microarray Data

    • plos.figshare.com
    pdf
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minna Vehkala; Mikhail Shubin; Thomas R Connor; Nicholas R Thomson; Jukka Corander (2023). Novel R Pipeline for Analyzing Biolog Phenotypic Microarray Data [Dataset]. http://doi.org/10.1371/journal.pone.0118392
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Minna Vehkala; Mikhail Shubin; Thomas R Connor; Nicholas R Thomson; Jukka Corander
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data produced by Biolog Phenotype MicroArrays are longitudinal measurements of cells’ respiration on distinct substrates. We introduce a three-step pipeline to analyze phenotypic microarray data with novel procedures for grouping, normalization and effect identification. Grouping and normalization are standard problems in the analysis of phenotype microarrays defined as categorizing bacterial responses into active and non-active, and removing systematic errors from the experimental data, respectively. We expand existing solutions by introducing an important assumption that active and non-active bacteria manifest completely different metabolism and thus should be treated separately. Effect identification, in turn, provides new insights into detecting differing respiration patterns between experimental conditions, e.g. between different combinations of strains and temperatures, as not only the main effects but also their interactions can be evaluated. In the effect identification, the multilevel data are effectively processed by a hierarchical model in the Bayesian framework. The pipeline is tested on a data set of 12 phenotypic plates with bacterium Yersinia enterocolitica. Our pipeline is implemented in R language on the top of opm R package and is freely available for research purposes.

  19. Data for A Systemic Framework for Assessing the Risk of Decarbonization to...

    • zenodo.org
    txt
    Updated Sep 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soheil Shayegh; Soheil Shayegh; Giorgia Coppola; Giorgia Coppola (2025). Data for A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union [Dataset]. http://doi.org/10.5281/zenodo.17152310
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Soheil Shayegh; Soheil Shayegh; Giorgia Coppola; Giorgia Coppola
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 18, 2025
    Area covered
    European Union
    Description

    README — Code and data
    Project: LOCALISED

    Work Package 7, Task 7.1

    Paper: A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union

    What this repo does
    -------------------
    Builds the Transition‑Risk Index (TRI) for EU manufacturing at NUTS‑2 × NACE Rev.2, and reproduces the article’s Figures 3–6:
    • Exposure (emissions by region/sector)
    • Vulnerability (composite index)
    • Risk = Exposure ⊗ Vulnerability
    Outputs include intermediate tables, the final analysis dataset, and publication figures.

    Folder of interest
    ------------------
    Code and data/
    ├─ Code/ # R scripts (run in order 1A → 5)
    │ └─ Create Initial Data/ # scripts to (re)build Initial data/ from Eurostat API with imputation
    ├─ Initial data/ # Eurostat inputs imputed for missing values
    ├─ Derived data/ # intermediates
    ├─ Final data/ # final analysis-ready tables
    └─ Figures/ # exported figures

    Quick start
    -----------
    1) Open R (or RStudio) and set the working directory to “Code and data/Code”.
    Example: setwd(".../Code and data/Code")
    2) Initial data/ contains the required Eurostat inputs referenced by the scripts.
    To reproduce the inputs in Initial data/, run the scripts in Code/Create Initial Data/.
    These scripts download the required datasets from the respective API and impute missing values; outputs are written to ../Initial data/.
    3) Run scripts sequentially (they use relative paths to ../Raw data, ../Derived data, etc.):
    1A-non-sector-data.R → 1B-sector-data.R → 1C-all-data.R → 2-reshape-data.R → 3-normalize-data-by-n-enterpr.R → 4-risk-aggregation.R → 5A-results-maps.R, 5B-results-radar.R

    What each script does
    ---------------------
    Create Initial Data — Recreate inputs
    • Download source tables from the Eurostat API or the Localised DSP, apply light cleaning, and impute missing values.
    • Write the resulting inputs to Initial data/ for the analysis pipeline.

    1A / 1B / 1C — Build the unified base
    • Read individual Eurostat datasets (some sectoral, some only regional).
    • Harmonize, aggregate, and align them into a single analysis-ready schema.
    • Write aggregated outputs to Derived data/ (and/or Final data/ as needed).

    2 — Reshape and enrich
    • Reshapes the combined data and adds metadata.
    • Output: Derived data/2_All_data_long_READY.xlsx (all raw indicators in tidy long format, with indicator names and values).

    3 — Normalize (enterprises & min–max)
    • Divide selected indicators by number of enterprises.
    • Apply min–max normalization to [0.01, 0.99].
    • Exposure keeps real zeros (zeros remain zero).
    • Write normalized tables to Derived data/ or Final data/.

    4 — Aggregate indices
    • Vulnerability: build dimension scores (Energy, Labour, Finance, Supply Chain, Technology).
    – Within each dimension: equal‑weight mean of directionally aligned, [0.01,0.99]‑scaled indicators.
    – Dimension scores are re‑scaled to [0.01,0.99].
    • Aggregate Vulnerability: equal‑weight mean of the five dimensions.
    • TRI (Risk): combine Exposure (E) and Vulnerability (V) via a weighted geometric rule with α = 0.5 in the baseline.
    – Policy‑intuitive properties: high E & high V → high risk; imbalances penalized (non‑compensatory).
    • Output: Final data/ (main analysis tables).

    5A / 5B — Visualize results
    • 5A: maps and distribution plots for Exposure, Vulnerability, and Risk → Figures 3 & 4.
    • 5B: comparative/radar profiles for selected countries/regions/subsectors → Figures 5 & 6.
    • Outputs saved to Figures/.

    Data flow (at a glance)
    -----------------------
    Initial data → (1A–1C) Aggregated base → (2) Tidy long file → (3) Normalized indicators → (4) Composite indices → (5) Figures
    | | |
    v v v
    Derived data/ 2_All_data_long_READY.xlsx Final data/ & Figures/

    Assumptions & conventions
    -------------------------
    • Geography: EU NUTS‑2 regions; Sector: NACE Rev.2 manufacturing subsectors.
    • Equal weights by default where no evidence supports alternatives.
    • All indicators directionally aligned so that higher = greater transition difficulty.
    • Relative paths assume working directory = Code/.

    Reproducing the article
    -----------------------
    • Optionally run the codes from the Code/Create Initial Data subfolder
    • Run 1A → 5B without interruption to regenerate:
    – Figure 3: Exposure, Vulnerability, Risk maps (total manufacturing).
    – Figure 4: Vulnerability dimensions (Energy, Labour, Finance, Supply Chain, Technology).
    – Figure 5: Drivers of risk—highest vs. lowest risk regions (example: Germany & Greece).
    – Figure 6: Subsector case (e.g., basic metals) by selected regions.
    • Final tables for the paper live in Final data/. Figures export to Figures/.

    Requirements
    ------------
    • R (version per your environment).
    • Install any missing packages listed at the top of each script (e.g., install.packages("...")).

    Troubleshooting
    ---------------
    • “File not found”: check that the previous script finished and wrote its outputs to the expected folder.
    • Paths: confirm getwd() ends with /Code so relative paths resolve to ../Raw data, ../Derived data, etc.
    • Reruns: optionally clear Derived data/, Final data/, and Figures/ before a clean rebuild.

    Provenance & citation
    ---------------------
    • Inputs: Eurostat and related sources cited in the paper and headers of the scripts.
    • Methods: OECD composite‑indicator guidance; IPCC AR6 risk framing (see paper references).
    • If you use this code, please cite the article:
    A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union.

  20. d

    Data from: Reference transcriptomics of porcine peripheral immune cells...

    • catalog.data.gov
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing [Dataset]. https://catalog.data.gov/dataset/data-from-reference-transcriptomics-of-porcine-peripheral-immune-cells-created-through-bul-e667c
    Explore at:
    Dataset updated
    Dec 2, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dr. Nagendra (2025). GSE58095 Data Normalization Subtype Analysis R [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/gse58095-data-normalization-subtype-analysis-r
Organization logo

GSE58095 Data Normalization Subtype Analysis R

Gene Expression Normalization and Subtype Analysis for GSE58095

Explore at:
zip(26134446 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset contains processed and normalized gene expression data from the public GEO series GSE58095.

The dataset is prepared to support downstream analyses such as subtype classification, differential expression, and exploratory visualization.

The content includes R scripts and processed matrices that guide users through normalization, quality control, and biological interpretation steps.

Gene expression data were aligned, filtered, quality-checked, and normalized using widely accepted bioinformatics pipelines.

The dataset aids researchers working in cancer genomics, transcriptomics, and molecular subtype discovery.

Included analyses demonstrate how to classify samples into biologically meaningful subtypes using clustering and statistical approaches.

The workflow supports reproducible research with clear steps for importing raw data, preprocessing, normalization, and generating subtype assignments.

The dataset is intended for educational, research, and benchmarking purposes within computational biology and bioinformatics.

All scripts are written in R for transparency and adaptability to various research workflows.

Search
Clear search
Close search
Google apps
Main menu