100+ datasets found

f
Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data...
frontiersin.figshare.com
zip
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.zip [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.00400.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
f
Data from: proteiNorm – A User-Friendly Tool for Normalization and Analysis...
acs.figshare.com
zip
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Graw; Jillian Tang; Maroof K Zafar; Alicia K Byrd; Chris Bolden; Eric C. Peterson; Stephanie D Byrum (2023). proteiNorm – A User-Friendly Tool for Normalization and Analysis of TMT and Label-Free Protein Quantification [Dataset]. http://doi.org/10.1021/acsomega.0c02564.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acsomega.0c02564.s002
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Stefan Graw; Jillian Tang; Maroof K Zafar; Alicia K Byrd; Chris Bolden; Eric C. Peterson; Stephanie D Byrum
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The technological advances in mass spectrometry allow us to collect more comprehensive data with higher quality and increasing speed. With the rapidly increasing amount of data generated, the need for streamlining analyses becomes more apparent. Proteomics data is known to be often affected by systemic bias from unknown sources, and failing to adequately normalize the data can lead to erroneous conclusions. To allow researchers to easily evaluate and compare different normalization methods via a user-friendly interface, we have developed “proteiNorm”. The current implementation of proteiNorm accommodates preliminary filters on peptide and sample levels followed by an evaluation of several popular normalization methods and visualization of the missing value. The user then selects an adequate normalization method and one of the several imputation methods used for the subsequent comparison of different differential expression methods and estimation of statistical power. The application of proteiNorm and interpretation of its results are demonstrated on two tandem mass tag multiplex (TMT6plex and TMT10plex) and one label-free spike-in mass spectrometry example data set. The three data sets reveal how the normalization methods perform differently on different experimental designs and the need for evaluation of normalization methods for each mass spectrometry experiment. With proteiNorm, we provide a user-friendly tool to identify an adequate normalization method and to select an appropriate method for differential expression analysis.
d
Data from: A systematic evaluation of normalization methods and probe...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc7v
Dataset updated
Nov 30, 2023
Dataset provided by
Dryad Digital Repository
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
Time period covered
Jan 1, 2022
Description
Background The Infinium EPIC array measures the methylation status ofâ€‰>â€‰850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearsonâ€™s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.Â
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the b...,

Study Participants and SamplesÂ

The whole blood samples were obtained from the Health, Well-being and Aging (SaÃºde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of SÃ£o Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9Â±0.71 years apart). The 24 individuals were 67.41Â±5.52 years of age (mean Â± standard deviation) at time point one; and 76.41Â±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University o..., We provide data on an Excel file, with absolute differences in beta values between replicate samples for each probe provided in different tabs for raw data and different normalization methods.
f
Agreement among different normalization methods.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bin Wang (2023). Agreement among different normalization methods. [Dataset]. http://doi.org/10.1371/journal.pone.0230594.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0230594.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Bin Wang
Description
Agreement among different normalization methods.
n
Methods for normalizing microbiome data: an ecological perspective
data.niaid.nih.gov
datadryad.org
zip
Updated Oct 30, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tn8qs35
Dataset updated
Oct 30, 2018
Dataset provided by
University of New England
James Cook University
Authors
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Microbiome sequencing data often need to be normalized due to differences in read depths, and recommendations for microbiome analyses generally warn against using proportions or rarefying to normalize data and instead advocate alternatives, such as upper quartile, CSS, edgeR-TMM, or DESeq-VS. Those recommendations are, however, based on studies that focused on differential abundance testing and variance standardization, rather than community-level comparisons (i.e., beta diversity), Also, standardizing the within-sample variance across samples may suppress differences in species evenness, potentially distorting community-level patterns. Furthermore, the recommended methods use log transformations, which we expect to exaggerate the importance of differences among rare OTUs, while suppressing the importance of differences among common OTUs. 2. We tested these theoretical predictions via simulations and a real-world data set. 3. Proportions and rarefying produced more accurate comparisons among communities and were the only methods that fully normalized read depths across samples. Additionally, upper quartile, CSS, edgeR-TMM, and DESeq-VS often masked differences among communities when common OTUs differed, and they produced false positives when rare OTUs differed. 4. Based on our simulations, normalizing via proportions may be superior to other commonly used methods for comparing ecological communities.
Data for bulk ATAC-seq normalization paper
zenodo.org
zip
Updated Jun 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koen Van den Berge; Koen Van den Berge (2022). Data for bulk ATAC-seq normalization paper [Dataset]. http://doi.org/10.5281/zenodo.4441902
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4441902
Dataset updated
Jun 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Koen Van den Berge; Koen Van den Berge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This zipped file contains all public datasets used in our benchmark of bulk ATAC-seq normalization methods.
f
Table_5_Normalization Methods on Single-Cell RNA-seq Data: An Empirical...
figshare.com
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Lytal; Di Ran; Lingling An (2023). Table_5_Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey.docx [Dataset]. http://doi.org/10.3389/fgene.2020.00041.s006
Explore at:
Unique identifier
https://doi.org/10.3389/fgene.2020.00041.s006
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Nicholas Lytal; Di Ran; Lingling An
Description
Data normalization is vital to single-cell sequencing, addressing limitations presented by low input material and various forms of bias or noise present in the sequencing process. Several such normalization methods exist, some of which rely on spike-in genes, molecules added in known quantities to serve as a basis for a normalization model. Depending on available information and the type of data, some methods may express certain advantages over others. We compare the effectiveness of seven available normalization methods designed specifically for single-cell sequencing using two real data sets containing spike-in genes and one simulation study. Additionally, we test those methods not dependent on spike-in genes using a real data set with three distinct cell-cycle states and a real data set under the 10X Genomics GemCode platform with multiple cell types represented. We demonstrate the differences in effectiveness for the featured methods using visualization and classification assessment and conclude which methods are preferable for normalizing a certain type of data for further downstream analysis, such as classification or differential analysis. The comparison in computational time for all methods is addressed as well.
m
Data Normalization Method for Geo-Spatial Analysis on Ports
data.mendeley.com
Updated Jun 11, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nazmus Sakib (2020). Data Normalization Method for Geo-Spatial Analysis on Ports [Dataset]. http://doi.org/10.17632/skn24jntn3.2
Explore at:
Unique identifier
https://doi.org/10.17632/skn24jntn3.2
Dataset updated
Jun 11, 2020
Authors
Nazmus Sakib
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Based on open access data, 79 Mediterranean passenger ports are analyzed to compare their infrastructure, hinterland accessibility and offered multi-modality categories. Comparative Geo-spatial analysis is also carried out by using the data normalization method in order to visualize the ports' performance on maps. These data driven comprehensive analytical results can bring added value to sustainable development policy and planning initiatives in the Mediterranean Region. The analyzed elements can be also contributed to the development of passenger port performance indicators. The empirical research methods used for the Mediterranean passenger ports can be replicated for transport nodes of any region around the world to determine their relative performance on selected criteria for improvement and planning.

The Mediterranean passenger ports were initially categorized into cruise and ferry ports. The cruise ports were identified from the member list of the Association for the Mediterranean Cruise Ports (MedCruise), representing more than 80% of the cruise tourism activities per country. The identified cruise ports were mapped by selecting the corresponding geo-referenced ports from the map layer developed by the European Marine Observation and Data Network (EMODnet). The United Nations (UN) Code for Trade and Transport Locations (LOCODE) was identified for each of the cruise ports as the common criteria to carry out the selection. The identified cruise ports not listed by the EMODnet were added to the geo-database by using under license the editing function of the ArcMap (version 10.1) geographic information system software. The ferry ports were identified from the open access industry initiative data provided by the Ferrylines, and were mapped in a similar way as the cruise ports (Figure 1).

Based on the available data from the identified cruise ports, a database (see Table A1–A3) was created for a Mediterranean scale analysis. The ferry ports were excluded due to the unavailability of relevant information on selected criteria (Table 2). However, the cruise ports serving as ferry passenger ports were identified in order to maximize the scope of the analysis. Port infrastructure and hinterland accessibility data were collected from the statistical reports published by the MedCruise, which are a compilation of data provided by its individual member port authorities and the cruise terminal operators. Other supplementary sources were the European Sea Ports Organization (ESPO) and the Global Ports Holding, a cruise terminal operator with an established presence in the Mediterranean. Additionally, open access data sources (e.g. the Google Maps and Trip Advisor) were consulted in order to identify the multi-modal transports and bridge the data gaps on hinterland accessibility by measuring the approximate distances.
f
Comparison of normalization approaches for gene expression studies completed...
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L. Fridley (2023). Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing [Dataset]. http://doi.org/10.1371/journal.pone.0206312
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0206312
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L. Fridley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Normalization of RNA-Seq data has proven essential to ensure accurate inferences and replication of findings. Hence, various normalization methods have been proposed for various technical artifacts that can be present in high-throughput sequencing transcriptomic studies. In this study, we set out to compare the widely used library size normalization methods (UQ, TMM, and RLE) and across sample normalization methods (SVA, RUV, and PCA) for RNA-Seq data using publicly available data from The Cancer Genome Atlas (TCGA) cervical cancer study. Additionally, an extensive simulation study was completed to compare the performance of the across sample normalization methods in estimating technical artifacts. Lastly, we investigated the effect of reduction in degrees of freedom in the normalized data and their impact on downstream differential expression analysis results. Based on this study, the TMM and RLE library size normalization methods give similar results for CESC dataset. In addition, the simulated datasets results show that the SVA (“BE”) method outperforms the other methods (SVA “Leek”, PCA) by correctly estimating the number of latent artifacts. Moreover, ignoring the loss of degrees of freedom due to normalization results in an inflated type I error rates. We recommend adjusting not only for library size differences but also the assessment of known and unknown technical artifacts in the data, and if needed, complete across sample normalization. In addition, we suggest that one includes the known and estimated latent artifacts in the design matrix to correctly account for the loss in degrees of freedom, as opposed to completing the analysis on the post-processed normalized data.
e
Data from: A generic normalization method for proper quantification in...
ebi.ac.uk
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandra Anjo, A generic normalization method for proper quantification in untargeted proteomics screening [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD009068
Explore at:
Authors
Sandra Anjo
Variables measured
Proteomics
Description
The label-free quantitative mass spectrometry methods, in particular the SWATH-MS approach, have gained popularity and became a powerful technique for comparison of large datasets. In the present work, it is introduced the use of recombinant proteins as internal standards for untargeted label-free methods. The proposed internal standard strategy reveals a similar intragroup normalization capacity when compared with the most common normalization methods, with the additional advantage of maintaining the overall proteome changes between groups (which are lost using the methods referred above). Thus, proving to be able to maintain a good performance even when large qualitative and quantitative differences in sample composition are observed, such as the ones induced by biological regulation (as observed in secretome and other biofluids’ analyses) or by enrichment approaches (such as immunopurifications). Moreover, it corresponds to a cost-effective alternative, easier to implement than the current stable-isotope labeling internal standards, therefore being an appealing strategy for large quantitative screening, as clinical cohorts for biomarker discovery.
D
A spatio-temporal normalization method for geophysical data
phys-techsciences.datastations.nl
application/gzip +7
Updated Mar 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
E. Pavlidou; E. Pavlidou (2016). A spatio-temporal normalization method for geophysical data [Dataset]. http://doi.org/10.17026/DANS-X7R-KGNR
Explore at:
bin(385755360), txt(236616), txt(61409), bin(126133), txt(61663), application/x-rlang-transport(3032434), txt(61466), pdf(137846), application/x-rlang-transport(2530641), txt(61440), txt(102823), txt(219096), txt(227856), txt(61462), xml(1330621), txt(17952), application/gzip(1651172), type/x-r-syntax(1300), txt(61404), type/x-r-syntax(165), type/x-r-syntax(213), bin(2477), txt(6235), txt(61630), txt(61420), bin(126121), txt(61401), txt(61693), type/x-r-syntax(603), bin(2375), txt(61374), type/x-r-syntax(215), txt(61390), bin(10184), txt(79577), type/x-r-syntax(387), txt(61446), txt(94843), bin(125724), txt(6333), txt(149883), type/x-r-syntax(384), txt(61435), application/x-rlang-transport(11333253), txt(61410), txt(61439), bin(2360), bin(382566720), txt(56882), type/x-r-syntax(433), txt(61431), txt(17928), txt(102793), type/x-r-syntax(3646), txt(70702), bin(6615), txt(61454), type/x-r-syntax(792), zip(61616), bin(3001), type/x-r-syntax(261), type/x-r-syntax(401)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-X7R-KGNR
Dataset updated
Mar 1, 2016
Dataset provided by
DANS Data Station Physical and Technical Sciences
Authors
E. Pavlidou; E. Pavlidou
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The objective of the study was to introduce a normalization algorithm which highlights short-term, localized, non-periodic fluctuations in hyper-temporal satellite data by dividing each pixel by the mean value of its spatial neighbourhood set. The algorithm was designed to suppress signal patterns that are common in the central and surrounding pixels, utilizing spatial and temporal information at different scales. Twee folders ('Normalized_different_framesizes' en 'Retrieval_different_anomalies') zijn te groot voor upload en worden nagestuurd via SURF Filesender
n
Data from: Online spatial normalization for real-time fMRI
data.niaid.nih.gov
explore.openaire.eu
+2more
zip
Updated Jul 9, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaofei Li; Li Yao; Qing Ye; Xiaojie Zhao (2015). Online spatial normalization for real-time fMRI [Dataset]. http://doi.org/10.5061/dryad.1642b
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.1642b
Dataset updated
Jul 9, 2015
Dataset provided by
Beijing Normal University
Authors
Xiaofei Li; Li Yao; Qing Ye; Xiaojie Zhao
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Real-time functional magnetic resonance imaging (rtfMRI) is a recently emerged technique that demands fast data processing within a single repetition time (TR), such as a TR of 2 seconds. Data preprocessing in rtfMRI has rarely involved spatial normalization, which can not be accomplished in a short time period. However, spatial normalization may be critical for accurate functional localization in a stereotactic space and is an essential procedure for some emerging applications of rtfMRI. In this study, we introduced an online spatial normalization method that adopts a novel affine registration (AFR) procedure based on principal axes registration (PA) and Gauss-Newton optimization (GN) using the self-adaptive β parameter, termed PA-GN(β) AFR and nonlinear registration (NLR) based on discrete cosine transform (DCT). In AFR, PA provides an appropriate initial estimate of GN to induce the rapid convergence of GN. In addition, the β parameter, which relies on the change rate of cost function, is employed to self-adaptively adjust the iteration step of GN. The accuracy and performance of PA-GN(β) AFR were confirmed using both simulation and real data and compared with the traditional AFR. The appropriate cutoff frequency of the DCT basis function in NLR was determined to balance the accuracy and calculation load of the online spatial normalization. Finally, the validity of the online spatial normalization method was further demonstrated by brain activation in the rtfMRI data.
Data from: Label-free quantitative phosphoproteomics with novel pairwise...
data.niaid.nih.gov
ebi.ac.uk
xml
Updated Aug 24, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susumu Imanishi; Susumu Y. Imanishi (2015). Label-free quantitative phosphoproteomics with novel pairwise abundance normalization reveals synergistic RAS and CIP2A signaling [Dataset]. https://data.niaid.nih.gov/resources?id=pxd001374
Explore at:
xmlAvailable download formats
Dataset updated
Aug 24, 2015
Dataset provided by
Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland
Meijo University
Authors
Susumu Imanishi; Susumu Y. Imanishi
Variables measured
Proteomics
Description
Label-free quantification is a powerful method for studying cellular protein phosphorylation dynamics. However, whether current data normalization methods achieve sufficient accuracy has not been examined systematically. Here, we demonstrate that a large uni-directional shift in the phosphopeptide abundance distribution is problematic for global median centering and quantile-based normalizations and may mislead the biological conclusion from unlabeled phosphoproteome data. Instead, we present a novel normalization strategy, named pairwise normalization, which is based on adjusting phosphopeptide abundances measured before and after the enrichment. The superior performance of pairwise normalization was validated by statistical methods, western blotting analysis, and by bioinformatics analysis. In addition, we demonstrate that the choice of normalization method influences the downstream analyses of the data and perceived pathway activities. Furthermore, we demonstrate that the developed normalization method, combined with pathway analysis algorithms, revealed a novel biological synergism between Ras signalling and PP2A inhibition by CIP2A.
N
Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust...
data.niaid.nih.gov
Updated May 15, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bacher R; Chu L; Kendziorski C; Swanson S (2019). Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust normalization of single-cell rna-seq data [Dataset]. https://data.niaid.nih.gov/resources?id=gse85917
Explore at:
Dataset updated
May 15, 2019
Dataset provided by
University of Florida
Authors
Bacher R; Chu L; Kendziorski C; Swanson S
Description
Normalization of RNA-sequencing data is essential for accurate downstream inference, but the assumptions upon which most methods are based do not hold in the single-cell setting. Consequently, applying existing normalization methods to single-cell RNA-seq data introduces artifacts that bias downstream analyses. To address this, we introduce SCnorm for accurate and efficient normalization of scRNA-seq data. Total 183 single cells (92 H1 cells, 91 H9 cells), sequenced twice, were used to evaluate SCnorm in normalizing single cell RNA-seq experiments. Total 48 bulk H1 samples were used to compare bulk and single cell properties. For single-cell RNA-seq, the identical single-cell indexed and fragmented cDNA were pooled at 96 cells per lane or at 24 cells per lane to test the effects of sequencing depth, resulting in approximately 1 million and 4 million mapped reads per cell in the two pooling groups, respectively.
f
Table_3_Normalization Methods on Single-Cell RNA-seq Data: An Empirical...
frontiersin.figshare.com
docx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Lytal; Di Ran; Lingling An (2023). Table_3_Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey.docx [Dataset]. http://doi.org/10.3389/fgene.2020.00041.s004
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2020.00041.s004
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Nicholas Lytal; Di Ran; Lingling An
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data normalization is vital to single-cell sequencing, addressing limitations presented by low input material and various forms of bias or noise present in the sequencing process. Several such normalization methods exist, some of which rely on spike-in genes, molecules added in known quantities to serve as a basis for a normalization model. Depending on available information and the type of data, some methods may express certain advantages over others. We compare the effectiveness of seven available normalization methods designed specifically for single-cell sequencing using two real data sets containing spike-in genes and one simulation study. Additionally, we test those methods not dependent on spike-in genes using a real data set with three distinct cell-cycle states and a real data set under the 10X Genomics GemCode platform with multiple cell types represented. We demonstrate the differences in effectiveness for the featured methods using visualization and classification assessment and conclude which methods are preferable for normalizing a certain type of data for further downstream analysis, such as classification or differential analysis. The comparison in computational time for all methods is addressed as well.
Retail Store Star Schema Dataset
kaggle.com
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrinivas Vishnupurikar (2025). Retail Store Star Schema Dataset [Dataset]. https://www.kaggle.com/datasets/shrinivasv/retail-store-star-schema-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shrinivas Vishnupurikar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🛍️ Retail Star Schema (Normalized & Denormalized) – Synthetic Dataset

This dataset provides a simulated retail data warehouse designed using star schema modeling principles.

It includes both normalized and denormalized versions of a retail sales star schema, making it a valuable resource for data engineers, analysts, and data warehouse enthusiasts who want to explore real-world scenarios, performance tuning, and modeling strategies.

📁 Dataset Structure

This dataset set has two Fact tables: - fact_sales_normalized.csv – No columns from the dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2F11f3c0350acd609e6b9d9336d0abb448%2FNormalized-Retail-Star-Schema.png?generation=1745327115564885&alt=media" alt="Normalized Star Schema">

fact_sales_denormalized.csv – Specific columns from certain dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2Fb567c752c7bc8bc55d9d6142d6ac40cf%2FDenormalized-Retial-Star-Schema.png?generation=1745327148166677&alt=media" alt="Denormalized Star Schema">

However, the dim_* table stay the same for both as follows: - Dim_Customers.csv - Dim_Products.csv - Dim_Stores.csv - Dim_Dates.csv - Dim_Salesperson - Dim_Campaign

🧠 Use Cases

Practice star schema design and dimensional modeling

Learn how to denormalize dimensions for BI and analytics performance

Benchmark analytical queries (joins, aggregations, filtering)

Test data pipelines, ETL/ELT transformations, and query optimization strategies

Explore how denormalization affects storage, redundancy, and performance

📌 Notes

All data is synthetic and randomly generated via python scripts that use polars library for data manipulation— no real customer or business data is included.

Ideal for use with tools like SQL engines, Redshift, BigQuery, Snowflake, or even DuckDB.

📎 Credits

Shrinivas Vishnupurikar, Data Engineer @Velotio Technologies.
P
NAMEXTEND Dataset
paperswithcode.com
Updated Feb 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Drechsel; Steffen Herbold (2025). NAMEXTEND Dataset [Dataset]. https://paperswithcode.com/dataset/namextend
Explore at:
Dataset updated
Feb 2, 2025
Authors
Jonathan Drechsel; Steffen Herbold
Description
This dataset extends NAMEXACT by including words that can be used as names, but may not exclusively be used as names in every context.

Dataset Details Dataset Description

Unlike NAMEXACT, this datasets contains words that are mostly used as names, but may also be used in other contexts, such as

Christian (believer in Christianity) Drew (simple past of the verb to draw) Florence (an Italian city) Henry (the SI unit of inductance) Mercedes (a car brand)

In addition, names with ambiguous gender are included - once for each gender. For instance, Skyler is included as female (F) name with a probability of 37.3%, and as male (M) name with a probability of 62.7%.

Dataset Sources [optional]

Repository: github.com/aieng-lab/gradiend

Original Dataset: Gender by Name

Dataset Structure

name: the name gender: the gender of the name (M for male and F for female) count: the count value of this name (raw value from the original dataset) probability: the probability of this name (raw value from original dataset; not normalized to this dataset!) gender_agreement: a value describing the certainty that this name has an unambiguous gender computed as the maximum probability of that name across both genders, e.g., $max(37.7%, 62.7%)=62.7%$ for Skyler. For names with a unique gender in this dataset, this value is 1.0 primary_gender: is equal to gender for names with a unique gender in this dataset, and equals otherwise the gender of that name with higher probability genders: label B if both genders are contained for this name in this dataset, otherwise equal to gender prob_F: the probability of that name being used as a female name (i.e., 0.0 or 1.0 if genders != B) prob_M: the probability of that name being used as a male name

Dataset Creation Source Data

The data is created by filtering Gender by Name.

Data Collection and Processing

The original data is filtered to contain only names with a count of at least 100 to remove very rare names. This threshold reduces the total number of names by $72%, from 133910 to 37425.

Bias, Risks, and Limitations

The original dataset provides counts of names (with their gender) for male and female babies from open-source government authorities in the US (1880-2019), UK (2011-2018), Canada (2011-2018), and Australia (1944-2019) in these periods
S
A radiometric normalization dataset of Shandong Province based on Gaofen-1...
scidb.cn
Updated Feb 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
黄莉婷; 焦伟利; 龙腾飞 (2020). A radiometric normalization dataset of Shandong Province based on Gaofen-1 WFV image (2018) [Dataset]. http://doi.org/10.11922/sciencedb.947
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.947
Dataset updated
Feb 20, 2020
Dataset provided by
Science Data Bank
Authors
黄莉婷; 焦伟利; 龙腾飞
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Shandong
Description
Surface reflectance is a critical physical variable that affects the energy budget in land-atmosphere interactions, feature recognition and classification, and climate change research. This dataset uses the relative radiometric normalization method, and takes the Landsat-8 Operational Land Imager (OLI) surface reflectance products as the reference image to normalize the GF-1 satellite WFV sensor cloud-free images of Shandong Province in 2018. Relative radiometric normalization processing mainly includes atmospheric correction, image resampling, image registration, mask, extract the no-change pixels and calculate normalization coefficients. After relative radiometric normalization, the no-change pixels of each GF-1 WFV image and its reference image, R2 is 0.7295 above, RMSE is below 0.0172. The surface reflectance accuracy of GF-1 WFV image is improved, which can be used in cooperation with Landsat data to provide data support for remote sensing quantitative inversion. This dataset is in GeoTIFF format, and the spatial resolution of the image is 16 m.
o
Transcription profiling by array on Exiqon platform of microRNAs in mouse...
omicsdi.org
xml
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sewer Alain,Sam Ansari, Transcription profiling by array on Exiqon platform of microRNAs in mouse blood and lung tisssues exposed to cigarette smoke or fresh air (comparative platform evaluation for microRNA analysis, Exiqon vs Affymetrix) [Dataset]. https://www.omicsdi.org/dataset/arrayexpress-repository/E-MTAB-876
Explore at:
xmlAvailable download formats
Authors
Sewer Alain,Sam Ansari
Variables measured
Transcriptomics
Description
The goal of this study was to investigate the generation and normalization of microRNA expression data based on microarray technologies to comparatively assess their quality. Two profiling platforms were compared: the single-channel Affymetrix GeneChip(r) and the Exiqon dual-channel miRCURY LNA(tm), which was processed as a single-channel array. Due to fundamental differences in the platform constitution, the normalization methods developed for gene expression need to be applied very cautiously to microRNA raw data. This motivated the development of a novel normalization method based on controllable assumptions, which uses the intensities of spike-in control probes. The results showed that the novel normalization method reduced the data variability in the most consistent way and confirmed the reliability of the differentially expressed microRNAs obtained, based on an RT-qPCR experiment performed for a subset of microRNAs. The conclusion was that the Exiqon platform combined with the novel spike-in controls based normalization method provides high-quality microRNA expression data suitable for reliable downstream analysis. This preprocessing pipeline was implemented into an R package called ExiMiR and deposited in the Bioconductor repository. Data generated in a sister experiment on the Affymetrix platform has been submitted to ArrayExpress under accession E-MTAB-875.
f
Data from: Normalization Method Utilizing Endogenous Proteins for...
acs.figshare.com
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kai Yan; Yueying Yang; Yunpeng Zhang; Wanbing Zhao; Lujian Liao (2023). Normalization Method Utilizing Endogenous Proteins for Quantitative Proteomics [Dataset]. http://doi.org/10.1021/jasms.0c00012.s002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1021/jasms.0c00012.s002
Dataset updated
Jun 7, 2023
Dataset provided by
ACS Publications
Authors
Kai Yan; Yueying Yang; Yunpeng Zhang; Wanbing Zhao; Lujian Liao
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
We developed a normalization method utilizing the expression levels of a panel of endogenous proteins as normalization standards (EPNS herein). We tested the validity of the method using two sets of tandem mass tag (TMT)-labeled data and found that this normalization method effectively reduced global intensity bias at the protein level. The coefficient of variation (CV) of the overall median was reduced by 55% and 82% on average, compared to the reduction by 72% and 86% after normalization using the upper quartile. Furthermore, we used differential protein expression analysis and statistical learning to identify biomarkers for colorectal cancer from a CPTAC data set. The expression changes of a panel of proteins, including NUP205, GTPBP4, CNN2, GNL3, and S100A11, all of which highly correlate with colorectal cancer. Applying these five proteins as model features, random forest modeling obtained prediction results with the maximum AUC of 0.9998 using EPNS-normalized data, comparing favorably to the AUC of 0.9739 using the raw data. Thus, the normalization method based on EPNS reduced the global intensity bias and is applicable for quantitative proteomic analysis.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.zip [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s002

Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.zip

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.3389/fgene.2019.00400.s002

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers

Authors

Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.

Clear search

Close search

Google apps

Main menu

Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data...

Data from: proteiNorm – A User-Friendly Tool for Normalization and Analysis...

Data from: A systematic evaluation of normalization methods and probe...

Agreement among different normalization methods.

Methods for normalizing microbiome data: an ecological perspective

Data for bulk ATAC-seq normalization paper

Table_5_Normalization Methods on Single-Cell RNA-seq Data: An Empirical...

Data Normalization Method for Geo-Spatial Analysis on Ports

Comparison of normalization approaches for gene expression studies completed...

Data from: A generic normalization method for proper quantification in...

A spatio-temporal normalization method for geophysical data

Data from: Online spatial normalization for real-time fMRI

Data from: Label-free quantitative phosphoproteomics with novel pairwise...

Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust...

Table_3_Normalization Methods on Single-Cell RNA-seq Data: An Empirical...

Retail Store Star Schema Dataset

🛍️ Retail Star Schema (Normalized & Denormalized) – Synthetic Dataset

📁 Dataset Structure

🧠 Use Cases

📌 Notes

📎 Credits

NAMEXTEND Dataset

A radiometric normalization dataset of Shandong Province based on Gaofen-1...

Transcription profiling by array on Exiqon platform of microRNAs in mouse...

Data from: Normalization Method Utilizing Endogenous Proteins for...

Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.zipSee More Versions

Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.zip