100+ datasets found

Robust RT-qPCR Data Normalization: Validation and Selection of Internal...
plos.figshare.com
tiff
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daijun Ling; Paul M. Salvaterra (2023). Robust RT-qPCR Data Normalization: Validation and Selection of Internal Reference Genes during Post-Experimental Data Analysis [Dataset]. http://doi.org/10.1371/journal.pone.0017762
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0017762
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Daijun Ling; Paul M. Salvaterra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reverse transcription and real-time PCR (RT-qPCR) has been widely used for rapid quantification of relative gene expression. To offset technical confounding variations, stably-expressed internal reference genes are measured simultaneously along with target genes for data normalization. Statistic methods have been developed for reference validation; however normalization of RT-qPCR data still remains arbitrary due to pre-experimental determination of particular reference genes. To establish a method for determination of the most stable normalizing factor (NF) across samples for robust data normalization, we measured the expression of 20 candidate reference genes and 7 target genes in 15 Drosophila head cDNA samples using RT-qPCR. The 20 reference genes exhibit sample-specific variation in their expression stability. Unexpectedly the NF variation across samples does not exhibit a continuous decrease with pairwise inclusion of more reference genes, suggesting that either too few or too many reference genes may detriment the robustness of data normalization. The optimal number of reference genes predicted by the minimal and most stable NF variation differs greatly from 1 to more than 10 based on particular sample sets. We also found that GstD1, InR and Hsp70 expression exhibits an age-dependent increase in fly heads; however their relative expression levels are significantly affected by NF using different numbers of reference genes. Due to highly dependent on actual data, RT-qPCR reference genes thus have to be validated and selected at post-experimental data analysis stage rather than by pre-experimental determination.
c
Data from: LVMED: Dataset of Latvian text normalisation samples for the...
repository.clarin.lv
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85
Explore at:
Dataset updated
May 30, 2023
Authors
Viesturs Jūlijs Lasmanis; Normunds Grūzītis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
n
Data from: A systematic evaluation of normalization methods and probe...
data.niaid.nih.gov
dataone.org
+2more
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc7v
Dataset updated
May 30, 2023
Dataset provided by
Hospital for Sick Children
University of Toronto
Universidade de São Paulo
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.
f
Data from: proteiNorm – A User-Friendly Tool for Normalization and Analysis...
datasetcatalog.nlm.nih.gov
Updated Sep 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Byrd, Alicia K; Zafar, Maroof K; Graw, Stefan; Tang, Jillian; Byrum, Stephanie D; Peterson, Eric C.; Bolden, Chris (2020). proteiNorm – A User-Friendly Tool for Normalization and Analysis of TMT and Label-Free Protein Quantification [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000568582
Explore at:
Dataset updated
Sep 30, 2020
Authors
Byrd, Alicia K; Zafar, Maroof K; Graw, Stefan; Tang, Jillian; Byrum, Stephanie D; Peterson, Eric C.; Bolden, Chris
Description
The technological advances in mass spectrometry allow us to collect more comprehensive data with higher quality and increasing speed. With the rapidly increasing amount of data generated, the need for streamlining analyses becomes more apparent. Proteomics data is known to be often affected by systemic bias from unknown sources, and failing to adequately normalize the data can lead to erroneous conclusions. To allow researchers to easily evaluate and compare different normalization methods via a user-friendly interface, we have developed “proteiNorm”. The current implementation of proteiNorm accommodates preliminary filters on peptide and sample levels followed by an evaluation of several popular normalization methods and visualization of the missing value. The user then selects an adequate normalization method and one of the several imputation methods used for the subsequent comparison of different differential expression methods and estimation of statistical power. The application of proteiNorm and interpretation of its results are demonstrated on two tandem mass tag multiplex (TMT6plex and TMT10plex) and one label-free spike-in mass spectrometry example data set. The three data sets reveal how the normalization methods perform differently on different experimental designs and the need for evaluation of normalization methods for each mass spectrometry experiment. With proteiNorm, we provide a user-friendly tool to identify an adequate normalization method and to select an appropriate method for differential expression analysis.
Example of normalizing the word ‘foooooooooood’ and ‘welllllllllllll’ using...
plos.figshare.com
xls
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zainab Mansur; Nazlia Omar; Sabrina Tiun; Eissa M. Alshari (2024). Example of normalizing the word ‘foooooooooood’ and ‘welllllllllllll’ using the proposed method and four other normalization methods. [Dataset]. http://doi.org/10.1371/journal.pone.0299652.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299652.t003
Dataset updated
Mar 21, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Zainab Mansur; Nazlia Omar; Sabrina Tiun; Eissa M. Alshari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example of normalizing the word ‘foooooooooood’ and ‘welllllllllllll’ using the proposed method and four other normalization methods.
f
File S1 - Normalization of RNA-Sequencing Data from Samples with Varying...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Feb 25, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collas, Philippe; Rognes, Torbjørn; Aanes, Håvard; Winata, Cecilia; Moen, Lars F.; Aleström, Peter; Østrup, Olga; Mathavan, Sinnakaruppan (2014). File S1 - Normalization of RNA-Sequencing Data from Samples with Varying mRNA Levels [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001266682
Explore at:
Dataset updated
Feb 25, 2014
Authors
Collas, Philippe; Rognes, Torbjørn; Aanes, Håvard; Winata, Cecilia; Moen, Lars F.; Aleström, Peter; Østrup, Olga; Mathavan, Sinnakaruppan
Description
Table S1 and Figures S1–S6. Table S1. List of primers. Forward and reverse primers used for qPCR. Figure S1. Changes in total and polyA+ RNA during development. a) Amount of total RNA per embryo at different developmental stages. b) Amount of polyA+ RNA per 100 embryos at different developmental stages. Vertical bars represent standard errors. Figure S2. The TMM scaling factor. a) The TMM scaling factor estimated using dataset 1 and 2. We observe very similar values. b) The TMM scaling factor obtained using the replicates in dataset 2. The TMM values are very reproducible. c) The TMM scale factor when RNA-seq data based on total RNA was used. Figure S3. Comparison of scales. We either square-root transformed or used that scales directly and compared the normalized fold-changes to RT-qPCR results. a) Transcripts with dynamic change pre-ZGA. b) Transcripts with decreased abundance post-ZGA. c) Transcripts with increased expression post-ZGA. Vertical bars represent standard deviations. Figure S4. Comparison of RT-qPCR results depending on RNA template (total or poly+ RNA) and primers (random or oligo(dT) primers) for setd3 (a), gtf2e2 (b) and yy1a (c). The increase pre-ZGA is dependent on template (setd3 and gtf2e2) and not primer type. Figure S5. Efficiency calibrated fold-changes for a subset of transcripts. Vertical bars represent standard deviations. Figure S6. Comparison normalization methods using dataset 2 for transcripts with decreased expression post-ZGA (a) and increased expression post-ZGA (b). Vertical bars represent standard deviations. (PDF)
A comparison of per sample global scaling and per gene normalization methods...
plos.figshare.com
pdf
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaohong Li; Guy N. Brock; Eric C. Rouchka; Nigel G. F. Cooper; Dongfeng Wu; Timothy E. O’Toole; Ryan S. Gill; Abdallah M. Eteleeb; Liz O’Brien; Shesh N. Rai (2023). A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data [Dataset]. http://doi.org/10.1371/journal.pone.0176185
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0176185
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Xiaohong Li; Guy N. Brock; Eric C. Rouchka; Nigel G. F. Cooper; Dongfeng Wu; Timothy E. O’Toole; Ryan S. Gill; Abdallah M. Eteleeb; Liz O’Brien; Shesh N. Rai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (
f
Example data of fusion features and growth indicators after Z-Score...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shen, Yingming; Wu, Mengyao; Tian, Peng; Qian, Ye; Li, Zhaowen; Sun, Jihong; Zhao, Jiawei; Wang, Xinrui (2025). Example data of fusion features and growth indicators after Z-Score normalization. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002086355
Explore at:
Dataset updated
May 20, 2025
Authors
Shen, Yingming; Wu, Mengyao; Tian, Peng; Qian, Ye; Li, Zhaowen; Sun, Jihong; Zhao, Jiawei; Wang, Xinrui
Description
Example data of fusion features and growth indicators after Z-Score normalization.
GSE206848 Data Normalization and Subtype Analysis
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). GSE206848 Data Normalization and Subtype Analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/gse206848-data-normalization-and-subtype-analysis
Explore at:
zip(2631363 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset for human osteoarthritis (OA) — microarray gene expression (Affymetrix GPL570) PMC +1

Contains expression data for 7 healthy control (normal) tissue samples and 7 osteoarthritis patient tissue samples from synovial / joint tissue. PMC +1

Pre-processed for normalization (background correction, log-transformation, normalization) to remove technical variation.

Suitable for downstream analyses: differential gene expression (normal vs OA), subtype- or phenotype-based classification, machine learning.

Can act as a validation dataset when combining with other GEO datasets to increase sample size or test reproducibility. SpringerLink +1

Useful for biomarker discovery, pathway enrichment analysis (e.g., GO, KEGG), immune infiltration analysis, and subtype analysis in osteoarthritis research.
f
DataSheet1_TimeNorm: a novel normalization method for time course microbiome...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
An, Lingling; Lu, Meng; Butt, Hamza; Luo, Qianwen; Du, Ruofei; Lytal, Nicholas; Jiang, Hongmei (2024). DataSheet1_TimeNorm: a novel normalization method for time course microbiome data.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001407445
Explore at:
Dataset updated
Sep 24, 2024
Authors
An, Lingling; Lu, Meng; Butt, Hamza; Luo, Qianwen; Du, Ruofei; Lytal, Nicholas; Jiang, Hongmei
Description
Metagenomic time-course studies provide valuable insights into the dynamics of microbial systems and have become increasingly popular alongside the reduction in costs of next-generation sequencing technologies. Normalization is a common but critical preprocessing step before proceeding with downstream analysis. To the best of our knowledge, currently there is no reported method to appropriately normalize microbial time-series data. We propose TimeNorm, a novel normalization method that considers the compositional property and time dependency in time-course microbiome data. It is the first method designed for normalizing time-series data within the same time point (intra-time normalization) and across time points (bridge normalization), separately. Intra-time normalization normalizes microbial samples under the same condition based on common dominant features. Bridge normalization detects and utilizes a group of most stable features across two adjacent time points for normalization. Through comprehensive simulation studies and application to a real study, we demonstrate that TimeNorm outperforms existing normalization methods and boosts the power of downstream differential abundance analysis.
Hospital Management System
kaggle.com
zip
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Shamoon Butt (2025). Hospital Management System [Dataset]. https://www.kaggle.com/mshamoonbutt/hospital-management-system
Explore at:
zip(1049391 bytes)Available download formats
Dataset updated
Jun 9, 2025
Authors
Muhammad Shamoon Butt
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This Hospital Management System project features a fully normalized relational database designed to manage hospital data including patients, doctors, appointments, diagnoses, medications, and billing. The schema applies database normalization (1NF, 2NF, 3NF) to reduce redundancy and maintain data integrity, providing an efficient, scalable structure for healthcare data management. Included are SQL scripts to create tables and insert sample data, making it a useful resource for learning practical database design and normalization in a healthcare context.
Example of normalizing the word ‘aaaaaaannnnnndddd’ using the proposed...
plos.figshare.com
xls
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zainab Mansur; Nazlia Omar; Sabrina Tiun; Eissa M. Alshari (2024). Example of normalizing the word ‘aaaaaaannnnnndddd’ using the proposed method and four other normalization methods. [Dataset]. http://doi.org/10.1371/journal.pone.0299652.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299652.t004
Dataset updated
Mar 21, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Zainab Mansur; Nazlia Omar; Sabrina Tiun; Eissa M. Alshari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example of normalizing the word ‘aaaaaaannnnnndddd’ using the proposed method and four other normalization methods.
Comparison of normalization approaches for gene expression studies completed...
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L. Fridley (2023). Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing [Dataset]. http://doi.org/10.1371/journal.pone.0206312
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0206312
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L. Fridley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Normalization of RNA-Seq data has proven essential to ensure accurate inferences and replication of findings. Hence, various normalization methods have been proposed for various technical artifacts that can be present in high-throughput sequencing transcriptomic studies. In this study, we set out to compare the widely used library size normalization methods (UQ, TMM, and RLE) and across sample normalization methods (SVA, RUV, and PCA) for RNA-Seq data using publicly available data from The Cancer Genome Atlas (TCGA) cervical cancer study. Additionally, an extensive simulation study was completed to compare the performance of the across sample normalization methods in estimating technical artifacts. Lastly, we investigated the effect of reduction in degrees of freedom in the normalized data and their impact on downstream differential expression analysis results. Based on this study, the TMM and RLE library size normalization methods give similar results for CESC dataset. In addition, the simulated datasets results show that the SVA (“BE”) method outperforms the other methods (SVA “Leek”, PCA) by correctly estimating the number of latent artifacts. Moreover, ignoring the loss of degrees of freedom due to normalization results in an inflated type I error rates. We recommend adjusting not only for library size differences but also the assessment of known and unknown technical artifacts in the data, and if needed, complete across sample normalization. In addition, we suggest that one includes the known and estimated latent artifacts in the design matrix to correctly account for the loss in degrees of freedom, as opposed to completing the analysis on the post-processed normalized data.
d
Methods for normalizing microbiome data: an ecological perspective
datadryad.org
data.niaid.nih.gov
zip
Updated Oct 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tn8qs35
Dataset updated
Oct 30, 2018
Dataset provided by
Dryad
Authors
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
Time period covered
Oct 19, 2018
Description
Simulation script 1This R script will simulate two populations of microbiome samples and compare normalization methods.Simulation script 2This R script will simulate two populations of microbiome samples and compare normalization methods via PcOAs.Sample.OTU.distributionOTU distribution used in the paper: Methods for normalizing microbiome data: an ecological perspective
Arabic OCR Project Dataset
kaggle.com
zip
Updated Nov 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yousef Gomaa (2025). Arabic OCR Project Dataset [Dataset]. https://www.kaggle.com/datasets/yousefgomaa43/arabic-ocr-project-dataset
Explore at:
zip(1873466285 bytes)Available download formats
Dataset updated
Nov 2, 2025
Authors
Yousef Gomaa
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Summary

Arabic handwritten paragraph dataset to be used for text normalization and generation using conditional deep generative models, such as:

Conditional Variational Autoencoder (CVAE)

Conditional Generative Adversarial Network (cGAN) (any GAN variant such as Pix2Pix, CycleGAN, or StyleGAN2)

Transformer-based generator (e.g., Vision Transformer with autoregressive decoding or text-to-image transformer)

Data Example

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17351483%2Fe1f10b4e62e5186c26dbe1f6741e3bdc%2F43.jpg?generation=1761401307913748&alt=media" alt="43.jpg">

Usage Examples

1. Preprocessing & Data Analysis:

Dataset exploration and cleaning

Character/word-level segmentation and normalization

Data augmentation (e.g., rotation, distortion)

2. Model Implementation:

Model 1: Conditional Variational Autoencoder (CVAE)

Model 2: Conditional GAN (any variant such as Pix2Pix or StyleGAN2)

Model 3: Transformer-based handwriting generator

3. Evaluation:

Quantitative metrics:

FID (Fréchet Inception Distance)

SSIM (Structural Similarity Index)

PSNR (Peak Signal-to-Noise Ratio)

Qualitative comparison:

Visual quality and handwriting consistency

Accuracy in representing Arabic characters and diacritics
d
Data from: Evaluation of normalization procedures for oligonucleotide array...
catalog.data.gov
odgavaprod.ogopendata.com
+1more
Updated Sep 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Evaluation of normalization procedures for oligonucleotide array data based on spiked cRNA controls [Dataset]. https://catalog.data.gov/dataset/evaluation-of-normalization-procedures-for-oligonucleotide-array-data-based-on-spiked-crna
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description
Background Affymetrix oligonucleotide arrays simultaneously measure the abundances of thousands of mRNAs in biological samples. Comparability of array results is necessary for the creation of large-scale gene expression databases. The standard strategy for normalizing oligonucleotide array readouts has practical drawbacks. We describe alternative normalization procedures for oligonucleotide arrays based on a common pool of known biotin-labeled cRNAs spiked into each hybridization. Results We first explore the conditions for validity of the 'constant mean assumption', the key assumption underlying current normalization methods. We introduce 'frequency normalization', a 'spike-in'-based normalization method which estimates array sensitivity, reduces background noise and allows comparison between array designs. This approach does not rely on the constant mean assumption and so can be effective in conditions where standard procedures fail. We also define 'scaled frequency', a hybrid normalization method relying on both spiked transcripts and the constant mean assumption while maintaining all other advantages of frequency normalization. We compare these two procedures to a standard global normalization method using experimental data. We also use simulated data to estimate accuracy and investigate the effects of noise. We find that scaled frequency is as reproducible and accurate as global normalization while offering several practical advantages. Conclusions Scaled frequency quantitation is a convenient, reproducible technique that performs as well as global normalization on serial experiments with the same array design, while offering several additional features. Specifically, the scaled-frequency method enables the comparison of expression measurements across different array designs, yields estimates of absolute message abundance in cRNA and determines the sensitivity of individual arrays.
G
Multi-OEM VRF Data Normalization Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Multi-OEM VRF Data Normalization Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/multi-oem-vrf-data-normalization-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Oct 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Multi-OEM VRF Data Normalization Market Outlook

According to our latest research, the global Multi-OEM VRF Data Normalization market size reached USD 1.14 billion in 2024, with a robust year-on-year growth trajectory. The market is expected to expand at a CAGR of 12.6% during the forecast period, reaching a projected value of USD 3.38 billion by 2033. This impressive growth is primarily fueled by the increasing adoption of Variable Refrigerant Flow (VRF) systems across multiple sectors, the proliferation of multi-OEM environments, and the rising demand for seamless data integration and analytics within building management systems. The market’s expansion is further supported by advancements in IoT, AI-driven analytics, and the urgent need for energy-efficient HVAC solutions worldwide.

One of the primary growth drivers for the Multi-OEM VRF Data Normalization market is the rapid digital transformation in the HVAC industry. Organizations are increasingly deploying VRF systems from multiple original equipment manufacturers (OEMs) to optimize performance, reduce costs, and future-proof their infrastructure. However, the lack of standardization in data formats across different OEMs presents significant integration challenges. Data normalization solutions bridge this gap by ensuring interoperability, enabling seamless aggregation, and facilitating advanced analytics for predictive maintenance and energy optimization. As facilities managers and building operators seek to harness actionable insights from disparate VRF systems, the demand for sophisticated data normalization platforms continues to rise, driving sustained market growth.

Another significant factor propelling market expansion is the growing emphasis on energy efficiency and sustainability. Regulatory mandates and green building certifications are pushing commercial, industrial, and residential end-users to adopt smart HVAC solutions that minimize energy consumption and carbon emissions. Multi-OEM VRF Data Normalization platforms play a pivotal role in this transition by enabling real-time monitoring, granular energy management, and automated system optimization across heterogeneous VRF networks. The ability to consolidate and analyze operational data from multiple sources not only enhances system reliability and occupant comfort but also helps organizations achieve compliance with stringent environmental standards, further fueling market adoption.

The proliferation of cloud computing, IoT connectivity, and AI-powered analytics is also transforming the Multi-OEM VRF Data Normalization landscape. Cloud-based deployment models offer unparalleled scalability, remote accessibility, and cost-efficiency, making advanced data normalization solutions accessible to a broader spectrum of users. Meanwhile, the integration of AI and machine learning algorithms enables predictive maintenance, anomaly detection, and automated fault diagnosis, reducing downtime and optimizing lifecycle costs. As more organizations recognize the strategic value of unified, normalized VRF data, investments in next-generation data normalization platforms are expected to accelerate, driving innovation and competitive differentiation in the market.

Regionally, the Asia Pacific market dominates the Multi-OEM VRF Data Normalization sector, accounting for the largest share in 2024, driven by rapid urbanization, robust construction activity, and widespread adoption of VRF technology in commercial and residential buildings. North America and Europe follow closely, fueled by stringent energy efficiency standards, a mature building automation ecosystem, and strong investments in smart infrastructure. Latin America and the Middle East & Africa are also witnessing steady growth, underpinned by rising demand for modern HVAC solutions and increasing awareness about the benefits of data-driven facility management. The regional outlook remains highly positive, with each geography contributing uniquely to the global market’s upward trajectory.

Component Analysis

The Mul
GSE65194 Data Normalization and Subtype Analysis
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). GSE65194 Data Normalization and Subtype Analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/gse65194-data-normalization-and-subtype-analysis
Explore at:
zip(54989436 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Raw and preprocessed microarray expression data from the GSE65194 cohort.

Includes samples from triple-negative breast cancer (TNBC), other breast cancer subtypes, and normal breast tissues.

Expression profiles generated using the “Affymetrix Human Genome U133 Plus 2.0 Array (GPL570)” platform. tcr.amegroups.org +2 Journal of Cancer +2

Provides normalized gene expression values suitable for downstream analyses such as differential expression, subtype classification, and clustering.

Supports the identification of differentially expressed genes (DEGs) between TNBC, non-TNBC subtypes, and normal tissue. Aging-US +2 tcr.amegroups.org +2

Useful for transcriptomic analyses in breast cancer research, including subtype analysis, biomarker discovery, and comparative studies.
d
Data from: Accurate normalization of real-time quantitative RT-PCR data by...
catalog.data.gov
odgavaprod.ogopendata.com
+1more
Updated Sep 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes [Dataset]. https://catalog.data.gov/dataset/accurate-normalization-of-real-time-quantitative-rt-pcr-data-by-geometric-averaging-of-mul
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description
Using real-time reverse transcription PCR ten housekeeping genes from different abundance and functional classes in various human tissues were evaluated. The conventional use of a single gene for normalization leads to relatively large errors in a significant proportion of samples tested.
d
The time-series gene expression data in PMA stimulated THP-1
datamed.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The time-series gene expression data in PMA stimulated THP-1 [Dataset]. https://datamed.org/display-item.php?repository=0044&idName=ID&id=5841d9165152c649505fbb31
Explore at:
Description
(1) qPCR Gene Expression Data The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Undifferentiated cells were harvested in TRIzol reagent at the beginning of the LPS time-course. One biological replicate was prepared for each time point. Total RNA was purified from TRIzol lysates according to manufacturer’s instructions. Genespecific primer pairs were designed using Primer3 software, with an optimal primer size of 20 bases, amplification size of 140bp, and annealing temperature of 60°C. Primer sequences were designed for 2,396 candidate genes including four potential controls: GAPDH, beta actin (ACTB), beta-2-microglobulin (B2M), phosphoglycerate kinase 1 (PGK1). The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA, USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate. Reactions were carried out in 20μL volumes in 384-well plates; each reaction contained: 0.5 U of HotStar Taq DNA polymerase (Qiagen) and the manufacturer’s 1× amplification buffer adjusted to a final concentration of 1mM MgCl2, 160μM dNTPs, 1/38000 SYBR Green I (Molecular Probes), 7% DMSO, 0.4% ROX Reference Dye (Invitrogen), 300 nM of each primer (forward and reverse), and 2μL of 40-fold diluted first-strand cDNA synthesis reaction mixture (12.5ng total RNA equivalent). Polymerase activation at 95ºC for 15 min was followed by 40 cycles of 15 s at 94ºC, 30 s at 60ºC, and 30 s at 72ºC. The dissociation curve analysis, which evaluates each PCR product to be amplified from single cDNA, was carried out in accordance with the manufacturer’s protocol. Expression levels were reported as Ct values. The large number of genes assayed and the replicates measures required that samples be distributed across multiple amplification plates, with an average of twelve plates per sample. Because it was envisioned that GAPDH would serve as a single-gene normalization control, this gene was included on each plate. All primer pairs were replicated in triplicates. Raw qPCR expression measures were quantified using Applied Biosystems SDS software and reported as Ct values. The Ct value represents the number of cycles or rounds of amplification required for the fluorescence of a gene or primer pair to surpass an arbitrary threshold. The magnitude of the Ct value is inversely proportional to the expression level so that a gene expressed at a high level will have a low Ct value and vice versa. Replicate Ct values were combined by averaging, with additional quality control constraints imposed by a standard filtering method developed by the RIKEN group for the preprocessing of their qPCR data. Briefly this method entails: 1. Sort the triplicate Ct values in ascending order: Ct1, Ct2, Ct3. Calculate differences between consecutive Ct values: difference1 = Ct2 – Ct1 and difference2 = Ct3 – Ct2. 2. Four regions are defined (where Region4 overrides the other regions): Region1: difference ≦ 0.2, Region2: 0.2 < difference ≦ 1.0, Region3: 1.0 < difference, Region4: one of the Ct values in the difference calculation is 40 If difference1 and difference2 fall in the same region, then the three replicate Ct values are averaged to give a final representative measure. If difference1 and difference2 are in different regions, then the two replicate Ct values that are in the small number region are averaged instead. This particular filtering method is specific to the data set we used here and does not represent a part of the normalization procedure itself; Alternate methods of filtering can be applied if appropriate prior to normalization. Moreover while the presentation in this manuscript has used Ct values as an example, any measure of transcript abundance, including those corrected for primer efficiency can be used as input to our data-driven methods. (2) Quantile Normalization Algorithm Quantile normalization proceeds in two stages. First, if samples are distributed across multiple plates, normalization is applied to all of the genes assayed for each sample to remove plate-to-plate effects by enforcing the same quantile distribution on each plate. Then, an overall quantile normalization is applied between samples, assuring that each sample has the same distribution of expression values as all of the other samples to be compared. A similar approach using quantile ormalization has been previously described in the context of microarray normalization. Briefly, our method entails the following steps: i) qPCR data from a single RNA sample are stored in a matrix M of dimension k (maximum number of genes or primer pairs on a plate) rows by p (number of plates) columns. Plates with differing numbers of genes are made equivalent by padded plates with missing values to constrain M to a rectangular structure. ii) Each column is sorted into ascending order and stored in matrix M’. The sorted columns correspond to the quantile distribution of each plate. The missing values are placed at the end of each ordered column. All calculations in quantile normalization are performed on non-missing values. iii) The average quantile distribution is calculated by taking the average of each row in M’. Each column in M’ is replaced by this average quantile distribution and rearranged to have the same ordering as the original row order in M. This gives the within-sample normalized data from one RNA sample. iv) Steps analogous to 1 – 3 are repeated for each sample. Between-sample normalization is performed by storing the within-normalized data as a new matrix N of dimension k (total number of genes, in our example k = 2,396) rows by n (number of samples) columns. Steps 2 and 3 are then applied to this matrix. (3) Rank-Invariant Set Normalization Algorithm We describe an extension of this method for use on qPCR data with any number of experimental conditions or samples in which we identify a set of stably-expressed genes from within the measured expression data and then use these to adjust expression between samples. Briefly, i) qPCR data from all samples are stored in matrix R of dimension g (total number of genes or primer pairs used for all plates) rows by s (total number of samples). ii) We first select gene sets that are rank-invariant across a single sample compared to a common reference. The reference may be chosen in a variety of ways, depending on the experimental design and aims of the experiment. As described in Tseng et al., the reference may be designated as a particular sample from the experiment (e.g. time zero in a time course experiment), the average or median of all samples, or selecting the sample which is closest to the average or median of all samples. Genes are considered to be rank-invariant if they retain their ordering or rank with respect to expression across the experimental sample versus the common reference sample. We collect sets of rank-invariant genes for all of the s pairwise comparisons, relative to a common reference. We take the intersection of all s sets to obtain the final set of rank-invariant genes that is used for normalization. iii) Let αj represent the average expression value of the rank-invariant genes in sample j. (α1, …, αs) then represents the vector of rank-invariant average expression values for all conditions 1 to s iv) We calculate the scale f The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Total RNA was purifed from TRIzol lysates according to manufacturer’s instructions. The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA,USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate.

Facebook

Twitter

Click to copy link

Link copied

Cite

Daijun Ling; Paul M. Salvaterra (2023). Robust RT-qPCR Data Normalization: Validation and Selection of Internal Reference Genes during Post-Experimental Data Analysis [Dataset]. http://doi.org/10.1371/journal.pone.0017762

Robust RT-qPCR Data Normalization: Validation and Selection of Internal Reference Genes during Post-Experimental Data Analysis

Explore at:

69 scholarly articles cite this dataset (View in Google Scholar)

tiffAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0017762

Dataset updated

May 31, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Daijun Ling; Paul M. Salvaterra

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Reverse transcription and real-time PCR (RT-qPCR) has been widely used for rapid quantification of relative gene expression. To offset technical confounding variations, stably-expressed internal reference genes are measured simultaneously along with target genes for data normalization. Statistic methods have been developed for reference validation; however normalization of RT-qPCR data still remains arbitrary due to pre-experimental determination of particular reference genes. To establish a method for determination of the most stable normalizing factor (NF) across samples for robust data normalization, we measured the expression of 20 candidate reference genes and 7 target genes in 15 Drosophila head cDNA samples using RT-qPCR. The 20 reference genes exhibit sample-specific variation in their expression stability. Unexpectedly the NF variation across samples does not exhibit a continuous decrease with pairwise inclusion of more reference genes, suggesting that either too few or too many reference genes may detriment the robustness of data normalization. The optimal number of reference genes predicted by the minimal and most stable NF variation differs greatly from 1 to more than 10 based on particular sample sets. We also found that GstD1, InR and Hsp70 expression exhibits an age-dependent increase in fly heads; however their relative expression levels are significantly affected by NF using different numbers of reference genes. Due to highly dependent on actual data, RT-qPCR reference genes thus have to be validated and selected at post-experimental data analysis stage rather than by pre-experimental determination.

Clear search

Close search

Google apps

Main menu

Robust RT-qPCR Data Normalization: Validation and Selection of Internal...

Data from: LVMED: Dataset of Latvian text normalisation samples for the...

Data from: A systematic evaluation of normalization methods and probe...

Data from: proteiNorm – A User-Friendly Tool for Normalization and Analysis...

Example of normalizing the word ‘foooooooooood’ and ‘welllllllllllll’ using...

File S1 - Normalization of RNA-Sequencing Data from Samples with Varying...

A comparison of per sample global scaling and per gene normalization methods...

Example data of fusion features and growth indicators after Z-Score...

GSE206848 Data Normalization and Subtype Analysis

DataSheet1_TimeNorm: a novel normalization method for time course microbiome...

Hospital Management System

Example of normalizing the word ‘aaaaaaannnnnndddd’ using the proposed...

Comparison of normalization approaches for gene expression studies completed...

Methods for normalizing microbiome data: an ecological perspective

Arabic OCR Project Dataset

Summary

Data Example

Usage Examples

1. Preprocessing & Data Analysis:

2. Model Implementation:

3. Evaluation:

Quantitative metrics:

Qualitative comparison:

Data from: Evaluation of normalization procedures for oligonucleotide array...

Multi-OEM VRF Data Normalization Market Research Report 2033

Multi-OEM VRF Data Normalization Market Outlook

Component Analysis

GSE65194 Data Normalization and Subtype Analysis

Data from: Accurate normalization of real-time quantitative RT-PCR data by...

The time-series gene expression data in PMA stimulated THP-1

Robust RT-qPCR Data Normalization: Validation and Selection of Internal Reference Genes during Post-Experimental Data Analysis