100+ datasets found
  1. n

    Data from: A systematic evaluation of normalization methods and probe...

    • data.niaid.nih.gov
    • dataone.org
    • +2more
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Universidade de São Paulo
    University of Toronto
    Hospital for Sick Children
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  2. Comparison of normalization approaches for gene expression studies completed...

    • plos.figshare.com
    tiff
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L. Fridley (2023). Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing [Dataset]. http://doi.org/10.1371/journal.pone.0206312
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L. Fridley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Normalization of RNA-Seq data has proven essential to ensure accurate inferences and replication of findings. Hence, various normalization methods have been proposed for various technical artifacts that can be present in high-throughput sequencing transcriptomic studies. In this study, we set out to compare the widely used library size normalization methods (UQ, TMM, and RLE) and across sample normalization methods (SVA, RUV, and PCA) for RNA-Seq data using publicly available data from The Cancer Genome Atlas (TCGA) cervical cancer study. Additionally, an extensive simulation study was completed to compare the performance of the across sample normalization methods in estimating technical artifacts. Lastly, we investigated the effect of reduction in degrees of freedom in the normalized data and their impact on downstream differential expression analysis results. Based on this study, the TMM and RLE library size normalization methods give similar results for CESC dataset. In addition, the simulated datasets results show that the SVA (“BE”) method outperforms the other methods (SVA “Leek”, PCA) by correctly estimating the number of latent artifacts. Moreover, ignoring the loss of degrees of freedom due to normalization results in an inflated type I error rates. We recommend adjusting not only for library size differences but also the assessment of known and unknown technical artifacts in the data, and if needed, complete across sample normalization. In addition, we suggest that one includes the known and estimated latent artifacts in the design matrix to correctly account for the loss in degrees of freedom, as opposed to completing the analysis on the post-processed normalized data.

  3. f

    Normalization methods and their properties.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    • +1more
    Updated Apr 17, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu, Zhiyong; Hakala, Kai; Pyysalo, Sampo; Salakoski, Tapio; Ginter, Filip; Kao, Hung-Yu; Björne, Jari; Van Landeghem, Sofie; Van de Peer, Yves; Wei, Chih-Hsuan; Ananiadou, Sophia (2013). Normalization methods and their properties. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001706093
    Explore at:
    Dataset updated
    Apr 17, 2013
    Authors
    Lu, Zhiyong; Hakala, Kai; Pyysalo, Sampo; Salakoski, Tapio; Ginter, Filip; Kao, Hung-Yu; Björne, Jari; Van Landeghem, Sofie; Van de Peer, Yves; Wei, Chih-Hsuan; Ananiadou, Sophia
    Description

    The different normalization methods applied in this study, and whether or not they account for lexical variation, synonymy, orthology and species-specific resolution. By creating combinations of these algorithms, their individual strengths can be aggregated.

  4. A comparison of per sample global scaling and per gene normalization methods...

    • plos.figshare.com
    pdf
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaohong Li; Guy N. Brock; Eric C. Rouchka; Nigel G. F. Cooper; Dongfeng Wu; Timothy E. O’Toole; Ryan S. Gill; Abdallah M. Eteleeb; Liz O’Brien; Shesh N. Rai (2023). A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data [Dataset]. http://doi.org/10.1371/journal.pone.0176185
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Xiaohong Li; Guy N. Brock; Eric C. Rouchka; Nigel G. F. Cooper; Dongfeng Wu; Timothy E. O’Toole; Ryan S. Gill; Abdallah M. Eteleeb; Liz O’Brien; Shesh N. Rai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (

  5. d

    Data from: Evaluation of normalization procedures for oligonucleotide array...

    • catalog.data.gov
    • odgavaprod.ogopendata.com
    • +1more
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Evaluation of normalization procedures for oligonucleotide array data based on spiked cRNA controls [Dataset]. https://catalog.data.gov/dataset/evaluation-of-normalization-procedures-for-oligonucleotide-array-data-based-on-spiked-crna
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Affymetrix oligonucleotide arrays simultaneously measure the abundances of thousands of mRNAs in biological samples. Comparability of array results is necessary for the creation of large-scale gene expression databases. The standard strategy for normalizing oligonucleotide array readouts has practical drawbacks. We describe alternative normalization procedures for oligonucleotide arrays based on a common pool of known biotin-labeled cRNAs spiked into each hybridization. Results We first explore the conditions for validity of the 'constant mean assumption', the key assumption underlying current normalization methods. We introduce 'frequency normalization', a 'spike-in'-based normalization method which estimates array sensitivity, reduces background noise and allows comparison between array designs. This approach does not rely on the constant mean assumption and so can be effective in conditions where standard procedures fail. We also define 'scaled frequency', a hybrid normalization method relying on both spiked transcripts and the constant mean assumption while maintaining all other advantages of frequency normalization. We compare these two procedures to a standard global normalization method using experimental data. We also use simulated data to estimate accuracy and investigate the effects of noise. We find that scaled frequency is as reproducible and accurate as global normalization while offering several practical advantages. Conclusions Scaled frequency quantitation is a convenient, reproducible technique that performs as well as global normalization on serial experiments with the same array design, while offering several additional features. Specifically, the scaled-frequency method enables the comparison of expression measurements across different array designs, yields estimates of absolute message abundance in cRNA and determines the sensitivity of individual arrays.

  6. f

    DataSheet1_TimeNorm: a novel normalization method for time course microbiome...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    An, Lingling; Lu, Meng; Butt, Hamza; Luo, Qianwen; Du, Ruofei; Lytal, Nicholas; Jiang, Hongmei (2024). DataSheet1_TimeNorm: a novel normalization method for time course microbiome data.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001407445
    Explore at:
    Dataset updated
    Sep 24, 2024
    Authors
    An, Lingling; Lu, Meng; Butt, Hamza; Luo, Qianwen; Du, Ruofei; Lytal, Nicholas; Jiang, Hongmei
    Description

    Metagenomic time-course studies provide valuable insights into the dynamics of microbial systems and have become increasingly popular alongside the reduction in costs of next-generation sequencing technologies. Normalization is a common but critical preprocessing step before proceeding with downstream analysis. To the best of our knowledge, currently there is no reported method to appropriately normalize microbial time-series data. We propose TimeNorm, a novel normalization method that considers the compositional property and time dependency in time-course microbiome data. It is the first method designed for normalizing time-series data within the same time point (intra-time normalization) and across time points (bridge normalization), separately. Intra-time normalization normalizes microbial samples under the same condition based on common dominant features. Bridge normalization detects and utilizes a group of most stable features across two adjacent time points for normalization. Through comprehensive simulation studies and application to a real study, we demonstrate that TimeNorm outperforms existing normalization methods and boosts the power of downstream differential abundance analysis.

  7. Data from: Profound effect of normalization on detection of differentially...

    • healthdata.gov
    • data.virginia.gov
    • +1more
    csv, xlsx, xml
    Updated Jul 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis [Dataset]. https://healthdata.gov/NIH/Profound-effect-of-normalization-on-detection-of-d/her4-nexh
    Explore at:
    csv, xlsx, xmlAvailable download formats
    Dataset updated
    Jul 14, 2025
    Description

    A number of procedures for normalization and detection of differentially expressed genes have been proposed. Four different normalization methods and all possible combinations with three different statistical algorithms have been used for detection of differentially expressed genes on a dataset. The number of genes detected as differentially expressed differs by a factor of about three.

  8. d

    Data from: A new non-linear normalization method for reducing variability in...

    • catalog.data.gov
    • odgavaprod.ogopendata.com
    • +2more
    Updated Sep 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). A new non-linear normalization method for reducing variability in DNA microarray experiments [Dataset]. https://catalog.data.gov/dataset/a-new-non-linear-normalization-method-for-reducing-variability-in-dna-microarray-experimen
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    National Institutes of Health
    Description

    A simple and robust non-linear method is presented for normalization using array signal distribution analysis and cubic splines. Both the regression and spline-based methods described performed better than existing linear methods when assessed on the variability of replicate arrays

  9. f

    Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data...

    • frontiersin.figshare.com
    application/cdfv2
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.doc [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s001
    Explore at:
    application/cdfv2Available download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.

  10. f

    Table_1_Comparison of Normalization Methods for Analysis of TempO-Seq...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jun 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bushel, Pierre R.; Ramaiahgari, Sreenivasa C.; Auerbach, Scott S.; Paules, Richard S.; Ferguson, Stephen S. (2020). Table_1_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.XLSX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000579045
    Explore at:
    Dataset updated
    Jun 23, 2020
    Authors
    Bushel, Pierre R.; Ramaiahgari, Sreenivasa C.; Auerbach, Scott S.; Paules, Richard S.; Ferguson, Stephen S.
    Description

    Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.

  11. d

    Methods for normalizing microbiome data: an ecological perspective

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Oct 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 30, 2018
    Dataset provided by
    Dryad
    Authors
    Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
    Time period covered
    Oct 19, 2018
    Description

    Simulation script 1This R script will simulate two populations of microbiome samples and compare normalization methods.Simulation script 2This R script will simulate two populations of microbiome samples and compare normalization methods via PcOAs.Sample.OTU.distributionOTU distribution used in the paper: Methods for normalizing microbiome data: an ecological perspective

  12. f

    Additional file 2: of Comparing the normalization methods for the...

    • datasetcatalog.nlm.nih.gov
    • springernature.figshare.com
    Updated Dec 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryu, Keun; Piao, Yongjun; Shon, Ho; Li, Peipei (2016). Additional file 2: of Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001915419
    Explore at:
    Dataset updated
    Dec 15, 2016
    Authors
    Ryu, Keun; Piao, Yongjun; Shon, Ho; Li, Peipei
    Description

    Detailed spearman correlation coefficient results for all normalization methods. (XLSX 17Â kb)

  13. e

    Data from: A generic normalization method for proper quantification in...

    • ebi.ac.uk
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandra Anjo, A generic normalization method for proper quantification in untargeted proteomics screening [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD009068
    Explore at:
    Authors
    Sandra Anjo
    Variables measured
    Proteomics
    Description

    The label-free quantitative mass spectrometry methods, in particular the SWATH-MS approach, have gained popularity and became a powerful technique for comparison of large datasets. In the present work, it is introduced the use of recombinant proteins as internal standards for untargeted label-free methods. The proposed internal standard strategy reveals a similar intragroup normalization capacity when compared with the most common normalization methods, with the additional advantage of maintaining the overall proteome changes between groups (which are lost using the methods referred above). Thus, proving to be able to maintain a good performance even when large qualitative and quantitative differences in sample composition are observed, such as the ones induced by biological regulation (as observed in secretome and other biofluids’ analyses) or by enrichment approaches (such as immunopurifications). Moreover, it corresponds to a cost-effective alternative, easier to implement than the current stable-isotope labeling internal standards, therefore being an appealing strategy for large quantitative screening, as clinical cohorts for biomarker discovery.

  14. f

    Normalization methods impact the number of significant genus-level...

    • datasetcatalog.nlm.nih.gov
    Updated Apr 23, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gibbons, Sean M.; Alm, Eric J.; Duvallet, Claire (2018). Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000653783
    Explore at:
    Dataset updated
    Apr 23, 2018
    Authors
    Gibbons, Sean M.; Alm, Eric J.; Duvallet, Claire
    Description

    Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases.

  15. d

    Data from: Normalization and analysis of DNA microarray data by...

    • catalog.data.gov
    • healthdata.gov
    • +1more
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Normalization and analysis of DNA microarray data by self-consistency and local regression [Dataset]. https://catalog.data.gov/dataset/normalization-and-analysis-of-dna-microarray-data-by-self-consistency-and-local-regression
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    A robust semi-parametric normalization technique has been developed, based on the assumption that the large majority of genes will not have their relative expression levels changed from one treatment group to the next, and on the assumption that departures of the response from linearity are small and slowly varying. The method was tested using data simulated under various error models and it performs well.

  16. A new non-linear normalization method for reducing variability in DNA...

    • healthdata.gov
    csv, xlsx, xml
    Updated Sep 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). A new non-linear normalization method for reducing variability in DNA microarray experiments - qasd-zvnh - Archive Repository [Dataset]. https://healthdata.gov/dataset/A-new-non-linear-normalization-method-for-reducing/gbaz-64pm
    Explore at:
    csv, xml, xlsxAvailable download formats
    Dataset updated
    Sep 10, 2025
    Description

    This dataset tracks the updates made on the dataset "A new non-linear normalization method for reducing variability in DNA microarray experiments" as a repository for previous versions of the data and metadata.

  17. dataset for applying normalization techniques

    • kaggle.com
    zip
    Updated Nov 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akalya Subramanian (2020). dataset for applying normalization techniques [Dataset]. https://www.kaggle.com/akalyasubramanian/dataset-for-applying-normalization-techniques
    Explore at:
    zip(3155326 bytes)Available download formats
    Dataset updated
    Nov 4, 2020
    Authors
    Akalya Subramanian
    Description

    Dataset

    This dataset was created by Akalya Subramanian

    Contents

  18. m

    Data Normalization Method for Geo-Spatial Analysis on Ports

    • data.mendeley.com
    Updated Jun 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nazmus Sakib (2020). Data Normalization Method for Geo-Spatial Analysis on Ports [Dataset]. http://doi.org/10.17632/skn24jntn3.1
    Explore at:
    Dataset updated
    Jun 6, 2020
    Authors
    Nazmus Sakib
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Based on open access data, 79 Mediterranean passenger ports are analyzed to compare their infrastructure, hinterland accessibility and offered multi-modality. Comparative Geo-spatial analysis is also carried out by using the data normalization method in order to visualize the ports' performance on maps. These data driven comprehensive analytical results can bring added value to sustainable development policy and planning initiatives in the Mediterranean Region. The analyzed elements can be also contributed to the development of passenger port performance indicators. The empirical research methods used for the Mediterranean passenger ports can be replicated for transport nodes of any region around the world to determine their relative performance on selected criteria for improvement and planning.

    The Mediterranean passenger ports were initially categorizing into cruise and ferry ports. The cruise ports were identified from the member list of the Association for the Mediterranean Cruise Ports (MedCruise), representing more than 80% of the cruise tourism activities per country. The identified cruise ports were mapped by selecting the corresponding geo-referenced ports from the map layer developed by the European Marine Observation and Data Network (EMODnet). The United Nations (UN) Code for Trade and Transport Locations (LOCODE) was identified for each of the cruise ports as the common criteria to carry out the selection. The identified cruise ports not listed by the EMODnet were added to the geo-database by using under license the editing function of the ArcMap (version 10.1) geographic information system software. The ferry ports were identified from the open access industry initiative data provided by the Ferrylines, and were mapped in a similar way as the cruise ports (Figure 1).

    Based on the available data from the identified cruise ports, a database (see Table A1–A3) was created for a Mediterranean scale analysis. The ferry ports were excluded due to the unavailability of relevant information on selected criteria (Table 2). However, the cruise ports serving as ferry passenger ports were identified in order to maximize the scope of the analysis. Port infrastructure and hinterland accessibility data were collected from the recent statistical reports published by the MedCruise, which are a compilation of data provided by its individual member port authorities and the cruise terminal operators. Other supplementary sources were the European Sea Ports Organization (ESPO) and the Global Ports Holding, a cruise terminal operator with an established presence in the Mediterranean. Additionally, open access data sources (e.g. the Google Maps and Trip Advisor) were consulted in order to identify the multi-modal transports and bridge the data gaps on hinterland accessibility by measuring the approximate distances.

  19. d

    Data and Code for: \"Universal Adaptive Normalization Scale (AMIS):...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kravtsov, Gennady (2025). Data and Code for: \"Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System\" [Dataset]. http://doi.org/10.7910/DVN/BISM0N
    Explore at:
    Dataset updated
    Nov 15, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Kravtsov, Gennady
    Description

    Dataset Title: Data and Code for: "Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System" Description: This dataset contains source data and processing results for validating the Adaptive Multi-Interval Scale (AMIS) normalization method. Includes educational performance data (student grades), economic statistics (World Bank GDP), and Python implementation of the AMIS algorithm with graphical interface. Contents: - Source data: educational grades and GDP statistics - AMIS normalization results (3, 5, 9, 17-point models) - Comparative analysis with linear normalization - Ready-to-use Python code for data processing Applications: - Educational data normalization and analysis - Economic indicators comparison - Development of unified metric systems - Methodology research in data scaling Technical info: Python code with pandas, numpy, scipy, matplotlib dependencies. Data in Excel format.

  20. d

    Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust...

    • datamed.org
    Updated Dec 27, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust normalization of single-cell rna-seq data [Dataset]. https://datamed.org/display-item.php?repository=0008&id=5914e6815152c67771b63e90&query=H1-4
    Explore at:
    Dataset updated
    Dec 27, 2016
    Description

    Normalization of RNA-sequencing data is essential for accurate downstream inference, but the assumptions upon which most methods are based do not hold in the single-cell setting. Consequently, applying existing normalization methods to single-cell RNA-seq data introduces artifacts that bias downstream analyses. To address this, we introduce SCnorm for accurate and efficient normalization of scRNA-seq data. Overall design: Total 183 single cells (92 H1 cells, 91 H9 cells), sequenced twice, were used to evaluate SCnorm in normalizing single cell RNA-seq experiments. Total 48 bulk H1 samples were used to compare bulk and single cell properties. For single-cell RNA-seq, the identical single-cell indexed and fragmented cDNA were pooled at 96 cells per lane or at 24 cells per lane to test the effects of sequencing depth, resulting in approximately 1 million and 4 million mapped reads per cell in the two pooling groups, respectively.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v

Data from: A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data

Related Article
Explore at:
zipAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Universidade de São Paulo
University of Toronto
Hospital for Sick Children
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

Search
Clear search
Close search
Google apps
Main menu