50 datasets found
  1. f

    Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data...

    • frontiersin.figshare.com
    application/cdfv2
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.doc [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s001
    Explore at:
    application/cdfv2Available download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.

  2. d

    Methods for normalizing microbiome data: an ecological perspective

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Oct 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 30, 2018
    Dataset provided by
    Dryad
    Authors
    Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
    Time period covered
    2018
    Description

    Simulation script 1This R script will simulate two populations of microbiome samples and compare normalization methods.Simulation script 2This R script will simulate two populations of microbiome samples and compare normalization methods via PcOAs.Sample.OTU.distributionOTU distribution used in the paper: Methods for normalizing microbiome data: an ecological perspective

  3. n

    Data from: A systematic evaluation of normalization methods and probe...

    • data.niaid.nih.gov
    • dataone.org
    • +2more
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    University of Toronto
    Universidade de São Paulo
    Hospital for Sick Children
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  4. d

    R script to reproduce \"Improved normalization of species count data in...

    • search.dataone.org
    Updated Mar 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BonaRes Repository (2025). R script to reproduce \"Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities\".@en [Dataset]. https://search.dataone.org/view/sha256%3Aa934b23425b0e7e7d9d4278f89745fc842e75fdfe8b47de25c797034dadc1f51
    Explore at:
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    BonaRes Repository
    Area covered
    Description

    R script to reproduce "Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities"..

  5. Additional file 3: of DBNorm: normalizing high-density oligonucleotide...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy (2023). Additional file 3: of DBNorm: normalizing high-density oligonucleotide microarray data based on distributions [Dataset]. http://doi.org/10.6084/m9.figshare.5648932.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DBNorm test script. Code of how we test DBNorm package. (TXT 2Â kb)

  6. d

    GC/MS Simulated Data Sets normalized using Batch Normalizer

    • search.dataone.org
    Updated Nov 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scholtens, Denise (2023). GC/MS Simulated Data Sets normalized using Batch Normalizer [Dataset]. http://doi.org/10.7910/DVN/NMCKWV
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Scholtens, Denise
    Description

    1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using Batch Normalizer (Wang et al. 2012).

  7. d

    GC/MS Simulated Data Sets normalized using mean centering

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scholtens, Denise (2023). GC/MS Simulated Data Sets normalized using mean centering [Dataset]. http://doi.org/10.7910/DVN/UYO4YF
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Scholtens, Denise
    Description

    1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using mean centering as described in Reisetter et al.

  8. N

    Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust...

    • data.niaid.nih.gov
    Updated May 15, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bacher R; Chu L; Kendziorski C; Swanson S (2019). Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust normalization of single-cell rna-seq data [Dataset]. https://data.niaid.nih.gov/resources?id=gse85917
    Explore at:
    Dataset updated
    May 15, 2019
    Dataset provided by
    University of Florida
    Authors
    Bacher R; Chu L; Kendziorski C; Swanson S
    Description

    Normalization of RNA-sequencing data is essential for accurate downstream inference, but the assumptions upon which most methods are based do not hold in the single-cell setting. Consequently, applying existing normalization methods to single-cell RNA-seq data introduces artifacts that bias downstream analyses. To address this, we introduce SCnorm for accurate and efficient normalization of scRNA-seq data. Total 183 single cells (92 H1 cells, 91 H9 cells), sequenced twice, were used to evaluate SCnorm in normalizing single cell RNA-seq experiments. Total 48 bulk H1 samples were used to compare bulk and single cell properties. For single-cell RNA-seq, the identical single-cell indexed and fragmented cDNA were pooled at 96 cells per lane or at 24 cells per lane to test the effects of sequencing depth, resulting in approximately 1 million and 4 million mapped reads per cell in the two pooling groups, respectively.

  9. Additional file 4: of DBNorm: normalizing high-density oligonucleotide...

    • springernature.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy (2023). Additional file 4: of DBNorm: normalizing high-density oligonucleotide microarray data based on distributions [Dataset]. http://doi.org/10.6084/m9.figshare.5648956.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DBNorm installation. Describes how to install DBNorm via devtools in R. (TXT 4Â kb)

  10. s

    Scaling with ranked subsampling (SRS) algorithm for the normalization of...

    • repository.soilwise-he.eu
    Updated Jul 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Scaling with ranked subsampling (SRS) algorithm for the normalization of species count data. [Dataset]. https://repository.soilwise-he.eu/cat/collections/metadata:main/items/4b2b65c6-ff50-4669-99cc-ace343de3548
    Explore at:
    Dataset updated
    Jul 1, 2020
    Description

    Scaling with ranked subsampling (SRS) is an algorithm for the normalization of species count data in ecology. So far, SRS has successfully been applied to microbial community data. "SRS is now available on CRAN: https://CRAN.R-project.org/package=SRS" An implementation of SRS in R is available for download: https://metadata.bonares.de/smartEditor/rest/upload/ID_7049_2020_05_13_SRS_function_v1_0_R.zip

    SRS consists of two steps. In the first step, the counts for all OTUs (operational taxonomic untis) are divided by a scaling factor chosen in such a way that the sum of the scaled counts (Cscaled with integer or non-integer values) equals Cmin. In the second step, the non-integer count values are converted into integers by an algorithm that we dub ranked subsampling. The scaled count Cscaled for each OTU is split into the integer-part Cint by truncating the digits after the decimal separator (Cint = floor(Cscaled)) and the fractional part Cfrac (Cfrac = Cscaled - Cint). Since ΣCint ≤ Cmin, additional ∆C = Cmin - ΣCint counts have to be added to the library to reach the total count of Cmin. This is achieved as follows. OTUs are ranked in the descending order of their Cfrac values. Beginning with the OTU of the highest rank, single count per OTU is added to the normalized library until the total number of added counts reaches ∆C and the sum of all counts in the normalized library equals Cmin. When the lowest Cfrag involved in picking ∆C counts is shared by several OTUs, the OTUs used for adding a single count to the library are selected in the order of their Cint values. This selection minimizes the effect of normalization on the relative frequencies of OTUs. OTUs with identical Cfrag as well as Cint are sampled randomly without replacement.

  11. d

    Water-quality trends and trend component estimates for the Nation's rivers...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Water-quality trends and trend component estimates for the Nation's rivers and streams using Weighted Regressions on Time, Discharge, and Season (WRTDS) models and generalized flow normalization, 1972-2012 [Dataset]. https://catalog.data.gov/dataset/water-quality-trends-and-trend-component-estimates-for-the-nations-rivers-and-streams-1972
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    Nonstationary streamflow due to environmental and human-induced causes can affect water quality over time, yet these effects are poorly accounted for in water-quality trend models. This data release provides instream water-quality trends and estimates of two components of change, for sites across the Nation previously presented in Oelsner et al. (2017). We used previously calibrated Weighted Regressions on Time, Discharge, and Season (WRTDS) models published in De Cicco et al. (2017) to estimate instream water-quality trends and associated uncertainties with the generalized flow normalization procedure available in EGRET version 3.0 (Hirsch et al., 2018a) and EGRETci version 2.0 (Hirsch et al., 2018b). The procedure allows for nonstationarity in the flow regime, whereas previous versions of EGRET assumed streamflow stationarity. Water-quality trends of annual mean concentrations and loads (also referred to as fluxes) are provided as an annual series and the change between the start and end year for four trend periods (1972-2012, 1982-2012, 1992-2012, and 2002-2012). Information about the sites, including the collecting agency and associated streamflow gage, and information about site selection and the data screening process can be found in Oelsner et al. (2017). This data release includes results for 19 water-quality parameters including nutrients (ammonia, nitrate, filtered and unfiltered orthophosphate, total nitrogen, total phosphorus), major ions (calcium, chloride, magnesium, potassium, sodium, sulfate), salinity indicators (specific conductance, total dissolved solids), carbon (alkalinity, dissolved organic carbon, total organic carbon), and sediment (total suspended solids, suspended-sediment concentration) at over 1,200 sites. Note, the number of parameters with data varies by site with most sites having data for 1-4 parameters. Each water-quality trend was parsed into two components of change: (1) the streamflow trend component (QTC) and (2) the watershed management trend component (MTC). The QTC is an indicator of the amount of change in the water-quality trend attributed to changes in the streamflow regime, and the MTC is an indicator of the amount of change in the water-quality trend that may be attributed to human actions and changes in point and non-point sources in a watershed. Note, the MTC is referred to as the concentration-discharge trend component (CQTC) in the EGRET version 3.0 software. For our work, we chose to refer to this trend component as the MTC because it provides a more conceptual description (Murphy and Sprague, 2019). The trend results presented here expand upon the results in De Cicco et al. (2017) and Oelsner et al. (2017), which were analyzed using flow-normalization under the stationary streamflow assumption. The results presented in this data release are intended to complement these previously published results and support investigations into natural and human effects on water-quality trends across the United States. Data preparation information and WRTDS model specifications are described in Oelsner et al. (2017) and Murphy and Sprague (2019). This work was completed as part of the National Water-Quality Assessment (NAWQA) project of the National Water-Quality Program. De Cicco, L.A., Sprague, L.A., Murphy, J.C., Riskin, M.L., Falcone, J.A., Stets, E.G., Oelsner, G.P., and Johnson, H.M., 2017, Water-quality and streamflow datasets used in the Weighted Regressions on Time, Discharge, and Season (WRTDS) models to determine trends in the Nation’s rivers and streams, 1972-2012 (ver. 1.1 July 7, 2017): U.S. Geological Survey data release, https://doi.org/10.5066/F7KW5D4H. Hirsch, R., De Cicco, L., Watkins, D., Carr, L., and Murphy, J., 2018a, EGRET: Exploration and Graphics for RivEr Trends, version 3.0, https://CRAN.R-project.org/package=EGRET. Hirsch, R., De Cicco, L., and Murphy, J., 2018b, EGRETci: Exploration and Graphics for RivEr Trends (EGRET) Confidence Intervals, version 2.0. https://CRAN.R-project.org/package=EGRETci. Murphy, J.C., and Sprague, L.A., 2019, Water-quality trends in US rivers: Exploring effects from streamflow trends and changes in watershed management: The Science of the total environment, ISSN: 1879-1026, Vol: 656, Page: 645-658, https://doi.org/10.1016/j.scitotenv.2018.11.255. Oelsner, G.P., Sprague, L.A., Murphy, J.C., Zuellig, R.E., Johnson, H.M., Ryberg, K.R., Falcone, J.A., Stets, E.G., Vecchia, A.V., Riskin, M.L., De Cicco, L.A., Mills, T.J., and Farmer, W.H., 2017, Water-quality trends in the Nation’s rivers and streams, 1972–2012—Data preparation, statistical methods, and trend results (ver. 2.0, October 2017): U.S. Geological Survey Scientific Investigations Report 2017–5006, 136 p., https://doi.org/10.3133/sir20175006.

  12. Citizen Science in the Ironbound Community

    • catalog.data.gov
    Updated Nov 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Citizen Science in the Ironbound Community [Dataset]. https://catalog.data.gov/dataset/citizen-science-in-the-ironbound-community
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    Ironbound
    Description

    Time stamp data of non-identified locations within the Ironbound community. A normalization data table is provided that defines established regression between sensors pod units 1-4 and State of New Jersey reference monitoring data performed during a collocation event. In addition, a table is provided that defines the 90th percentile of air quality measures following data normalization. This dataset is associated with the following publication: Kaufman, A., R. Williams, T. Barzyk, M. Greenberg, M. OShea, P. Sheridan, A. Hoang, C. Ash, A. Teitz, M. Mustafa, and S. Garvey. A Citizen Science and Government Collaboration: Developing Tools to Facilitate Community Air Monitoring. ENVIRONMENTAL JUSTICE. Mary Ann Liebert, Inc., New Rochelle, NY, USA, 10(2): 1-11, (2017).

  13. h

    Data from: A High Statistics Measurement of the Deuteron Structure Functions...

    • hepdata.net
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A High Statistics Measurement of the Deuteron Structure Functions F2 (X, $Q^2$) and R From Deep Inelastic Muon Scattering at High $Q^2$ [Dataset]. http://doi.org/10.17182/hepdata.6191.v1
    Explore at:
    Description

    CERN-SPS. NA4/BCDMS Collaboration. Plab 120 - 280 GeV/c. These are data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a deuterium target. The ranges of X,Q**2 are 0.06& lt;X& lt;0.8 and 8& lt;Q**2& lt;260 GeV**2. CR.= The publication lists values of F2 corresponding to R=0 and R=R(QCD) at each of the three energies, 120, 200 and 280 GeV. As well as the statistical errors also given are 5 factors representing the effects of estimated systematic errors on F2 associated with (1) beam momentum calibration, (2) magnetic field calibration, (3) spectrometer resolution, (4) detector and trigger inefficiencies, and (5) relative normalization uncertainty of data taken from external and internal targets. This record contains our attempt to merge these data at different energies using the statistical errors as weight factors. The final one-sigma systematic errors given here have been calculated using a prescription from the authors involving calculation of new merged F2 values for each of the systematic errors applied individually, and the combining in quadrature the differences in the new merged F2 values and the original F2. The individual F2 values at each energy are given in separate database records. Plab 120 GeV/c. These are data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a deuterium target. The ranges of X,Q**2 are 0.06& lt;X& lt;0.8 and 8& lt;Q**2& lt;260 GeV**2. CR.= The publication lists values of F2 corresponding to R=0 and R=R(QCD) at each of the three energies, 120, 200 and 280 GeV. As well as the statistical errors also given are 5 factors representing the effects of estimated systematic errors on F2 associated with (1) beam momentum calibration, (2) magnetic field calibration, (3) spectrometer resolution, (4) detector and trigger inefficiencies, and (5) relative normalization uncertainty of data taken from external and internal targets. The systematic error shown in the tables is a result of combining together the 5 individual errors according to a prescription provided by the authors. The method involves taking the quadratic sum of the errors from each source. Plab 200 GeV/c. These are data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a deuterium target. The ranges of X,Q**2 are 0.06& lt;X& lt;0.8 and 8& lt;Q**2& lt;260 GeV**2. CR.= The publication lists values of F2 corresponding to R=0 and R=R(QCD) at each of the three energies, 120, 200 and 280 GeV. As well as the statistical errors also given are 5 factors representing the effects of estimated systematic errors on F2 associated with (1) beam momentum calibration, (2) magnetic field calibration, (3) spectrometer resolution, (4) detector and trigger inefficiencies, and (5) relative normalization uncertainty of data taken from external and internal targets. The systematic error shown in the tables is a result of combining together the 5 individual errors according to a prescription provided by the authors. This method involves taking the quadratic sum of the errors from each source. Plab 280 GeV/c. These are data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a deuterium target. The ranges of X,Q**2 are 0.06& lt;X& lt;0.8 and 8& lt;Q**2& lt;260 GeV**2. CR.= The publication lists values of F2 corresponding to R=0 and R=R(QCD) at each of the three energies, 120, 200 and 280 GeV. As well as the statistical errors also given are 5 factors representing the effects of estimated systematic errors on F2 associated with (1) beam momentum calibration, (2) magnetic field calibration, (3) spectrometer resolution, (4) detector and trigger inefficiencies, and (5) relative normalization uncertainty of data taken from external and internal targets. The systematic error shown in the tables is a result of combining together the 5 individual errors according to a prescription provided by the authors. This method involves taking the quadratic sum of the errors from each source.

  14. m

    Mitoplate S-1 analysis using R

    • data.mendeley.com
    Updated Mar 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flavia Radogna (2020). Mitoplate S-1 analysis using R [Dataset]. http://doi.org/10.17632/b9mprfdvmv.1
    Explore at:
    Dataset updated
    Mar 5, 2020
    Authors
    Flavia Radogna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This R script performs normalisation of data obtained with the MitoPlate S-1 commercialised by Biolog. In addition, it creates a scatterplot of initial rate values between conditions of interest. The script includes a first normalisation step using the "No substrate" well (A1) required for the rows A to H and a second normalisation step using the "L-Malic Acid 100 µM" (G1) only required for the rows G and H. Initial rate values are calculated as the slope of a linear regression fitted between 30 minutes and 2 hours.

  15. d

    Data from: Copy number variants outperform SNPs to reveal...

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Aug 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yann Dorant; Hugo Cayuela; Kyle Wellband; Martin Laporte; Quentin Rougemont; Claire Mérot; Éric Normandeau; Rémy Rochette; Louis Bernatchez (2020). Copy number variants outperform SNPs to reveal genotype-temperature association in a marine species [Dataset]. http://doi.org/10.5061/dryad.vt4b8gtnv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 19, 2020
    Dataset provided by
    Dryad
    Authors
    Yann Dorant; Hugo Cayuela; Kyle Wellband; Martin Laporte; Quentin Rougemont; Claire Mérot; Éric Normandeau; Rémy Rochette; Louis Bernatchez
    Time period covered
    2020
    Description

    Raw VCF consisted of 44374 unfiltered SNPs batch_1.vcf

    File containing estimated metrics for SNPs caracterization (i.e. singleton, duplicated, diverged, low confidence) SNPs_caracterization_metrics.txt

    Filtered VCF of 14534 SNPs identified as singletons filtered_singleton_SNPs.vcf

    Filtered VCF of 9659 SNPs identified as duplicated filtered_duplicated_SNPs.vcf

    Martix of normalized read depth for CNV loci CNVs_normalized_read_depth.txt

    Sea surface temperature data Sea_surface_temperatures.txt

    Script for SNPs classification (singleton, duplicated) 00_classify_snps_lobster_Rapture.R

    Script for CNV data normalization 01-edgeR_normalization_CNVs_data.R

  16. Transposon DNA sequences facilitate the tissue-specific horizontal transfer:...

    • zenodo.org
    zip
    Updated Sep 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Munevver Cinar; Lourdes Martinez-Medina; Pavan K. Puvvula; Arsen Arakelyan; Arsen Arakelyan; Badri N. Vardarajan; Neil Anthony; Ganji P. Nagaraju; Dongkyoo Park; Lei Feng; Faith Sheff; Marina Mosunjac; Debra Saxe; Steven Flygare; Olatunji B. Alese; Jonathan Kaufman; Sagar Lonial; Juan Sarmiento; Izidore S. Lossos; Paula M. Vertino; Jose A. Lopez; Bassel El-Rayes; Leon Bernal-Mizrachi; Leon Bernal-Mizrachi; Munevver Cinar; Lourdes Martinez-Medina; Pavan K. Puvvula; Badri N. Vardarajan; Neil Anthony; Ganji P. Nagaraju; Dongkyoo Park; Lei Feng; Faith Sheff; Marina Mosunjac; Debra Saxe; Steven Flygare; Olatunji B. Alese; Jonathan Kaufman; Sagar Lonial; Juan Sarmiento; Izidore S. Lossos; Paula M. Vertino; Jose A. Lopez; Bassel El-Rayes (2023). Transposon DNA sequences facilitate the tissue-specific horizontal transfer: te expression supplementary datasets [Dataset]. http://doi.org/10.5281/zenodo.8005564
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Munevver Cinar; Lourdes Martinez-Medina; Pavan K. Puvvula; Arsen Arakelyan; Arsen Arakelyan; Badri N. Vardarajan; Neil Anthony; Ganji P. Nagaraju; Dongkyoo Park; Lei Feng; Faith Sheff; Marina Mosunjac; Debra Saxe; Steven Flygare; Olatunji B. Alese; Jonathan Kaufman; Sagar Lonial; Juan Sarmiento; Izidore S. Lossos; Paula M. Vertino; Jose A. Lopez; Bassel El-Rayes; Leon Bernal-Mizrachi; Leon Bernal-Mizrachi; Munevver Cinar; Lourdes Martinez-Medina; Pavan K. Puvvula; Badri N. Vardarajan; Neil Anthony; Ganji P. Nagaraju; Dongkyoo Park; Lei Feng; Faith Sheff; Marina Mosunjac; Debra Saxe; Steven Flygare; Olatunji B. Alese; Jonathan Kaufman; Sagar Lonial; Juan Sarmiento; Izidore S. Lossos; Paula M. Vertino; Jose A. Lopez; Bassel El-Rayes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datasets contain data on analyses of TE expression stability. Raw RNA seq counts were processed using DESeq2 R package. Low count genes and TEs were removed. Counts across samples were normalized for library sizes and log-transformed using 'regularized log' transformation. Batch normalization was performed on log-transformed data with ComBat function from sva R package. Expression variability (EV) of TEs and genes (probes) was estimated using the previously described method [1, 2].

    1. Bashkeel, N., Perkins, T.J., Kærn, M. et al. Human gene expression variability and its dependence on methylation and aging. BMC Genomics 20, 941 (2019). https://doi.org/10.1186/s12864-019-6308-7
    2. Alemu EY, Carl JW Jr, Corrada Bravo H, Hannenhalli S. Determinants of expression variability. Nucleic Acids Res. 2014;42(6):3503-3514. doi:10.1093/nar/gkt1364

    PC.zip - the results of TE expression and stability in prostate cancer.

    • 0.PC.RlogMAD.pdf - count barplots for TE identified with MAD criteria
    • 0.PC.RlogSD.pdf - count barplots for TE identified with SD criteria
    • 0.PC.TE.rlogcpm.mad.xls -stability measures according MAD (median absolute deviance) criteria
    • 0.PC.TE.rlogcpm.sd.xls - stability measures according SD criteria
    • 0.PC_TE_bootstrap.pdf - TE expression stability
    • PC.deseq.logCPM.csv - TE log transformed expression matrix
    • PC.TE_count_table.csv - TE raw count matrix
    • PC_Deseq2data.Rdata - R data object with deseq objet, raw and normalized counts

    MM.zip - the results of TE expression and stability in multiple myeloma.

    • 0.MM.RlogMAD.pdf - count barplots for TE identified with MAD criteria
    • 0.MM.RlogSD.pdf - count barplots for TE identified with SD criteria
    • 0.MM_TE_bootstrap.pdf - TE expression stability
    • MM.deseq.logCPM.csv - TE raw count matrix
    • MM.rlog.mad.xlsx- stability measures according MAD (median absolute deviance) criteria
    • MM.rlog.sd.xlsx - stability measures according MAD (median absolute deviance) criteria
    • MM.TE_count_table.csv - TE raw count matrix
    • Myeloma_Deseq2data.Rdata - R data object with deseq objet, raw and normalized counts
  17. Geospatial Deep Learning Seminar Online Course

    • ckan.americaview.org
    Updated Nov 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.americaview.org (2021). Geospatial Deep Learning Seminar Online Course [Dataset]. https://ckan.americaview.org/dataset/geospatial-deep-learning-seminar-online-course
    Explore at:
    Dataset updated
    Nov 2, 2021
    Dataset provided by
    CKANhttps://ckan.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This seminar is an applied study of deep learning methods for extracting information from geospatial data, such as aerial imagery, multispectral imagery, digital terrain data, and other digital cartographic representations. We first provide an introduction and conceptualization of artificial neural networks (ANNs). Next, we explore appropriate loss and assessment metrics for different use cases followed by the tensor data model, which is central to applying deep learning methods. Convolutional neural networks (CNNs) are then conceptualized with scene classification use cases. Lastly, we explore semantic segmentation, object detection, and instance segmentation. The primary focus of this course is semantic segmenation for pixel-level classification. The associated GitHub repo provides a series of applied examples. We hope to continue to add examples as methods and technologies further develop. These examples make use of a vareity of datasets (e.g., SAT-6, topoDL, Inria, LandCover.ai, vfillDL, and wvlcDL). Please see the repo for links to the data and associated papers. All examples have associated videos that walk through the process, which are also linked to the repo. A variety of deep learning architectures are explored including UNet, UNet++, DeepLabv3+, and Mask R-CNN. Currenlty, two examples use ArcGIS Pro and require no coding. The remaining five examples require coding and make use of PyTorch, Python, and R within the RStudio IDE. It is assumed that you have prior knowledge of coding in the Python and R enviroinments. If you do not have experience coding, please take a look at our Open-Source GIScience and Open-Source Spatial Analytics (R) courses, which explore coding in Python and R, respectively. After completing this seminar you will be able to: explain how ANNs work including weights, bias, activation, and optimization. describe and explain different loss and assessment metrics and determine appropriate use cases. use the tensor data model to represent data as input for deep learning. explain how CNNs work including convolutional operations/layers, kernel size, stride, padding, max pooling, activation, and batch normalization. use PyTorch, Python, and R to prepare data, produce and assess scene classification models, and infer to new data. explain common semantic segmentation architectures and how these methods allow for pixel-level classification and how they are different from traditional CNNs. use PyTorch, Python, and R (or ArcGIS Pro) to prepare data, produce and assess semantic segmentation models, and infer to new data.

  18. d

    GC/MS Simulated Data Sets including batch effects and data truncation (not...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scholtens, Denise (2023). GC/MS Simulated Data Sets including batch effects and data truncation (not normalized) [Dataset]. http://doi.org/10.7910/DVN/JDRJGY
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Scholtens, Denise
    Description

    1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are simulated data sets that include batch effects and data truncation and are not yet normalized.

  19. n

    Data from: Novel R pipeline for analyzing Biolog phenotypic microarray data

    • data.niaid.nih.gov
    • dataone.org
    • +2more
    zip
    Updated Feb 17, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minna Vehkala; Mikhail Shubin; Thomas R. Connor; Nicholas R. Thomson; Jukka Corander (2016). Novel R pipeline for analyzing Biolog phenotypic microarray data [Dataset]. http://doi.org/10.5061/dryad.r98g7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 17, 2016
    Dataset provided by
    Wellcome Trust
    University of Helsinki
    Cardiff University
    Authors
    Minna Vehkala; Mikhail Shubin; Thomas R. Connor; Nicholas R. Thomson; Jukka Corander
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Data produced by Biolog Phenotype MicroArrays are longitudinal measurements of cells’ respiration on distinct substrates. We introduce a three-step pipeline to analyze phenotypic microarray data with novel procedures for grouping, normalization and effect identification. Grouping and normalization are standard problems in the analysis of phenotype microarrays defined as categorizing bacterial responses into active and non-active, and removing systematic errors from the experimental data, respectively. We expand existing solutions by introducing an important assumption that active and non-active bacteria manifest completely different metabolism and thus should be treated separately. Effect identification, in turn, provides new insights into detecting differing respiration patterns between experimental conditions, e.g. between different combinations of strains and temperatures, as not only the main effects but also their interactions can be evaluated. In the effect identification, the multilevel data are effectively processed by a hierarchical model in the Bayesian framework. The pipeline is tested on a data set of 12 phenotypic plates with bacterium Yersinia enterocolitica. Our pipeline is implemented in R language on the top of opm R package and is freely available for research purposes.

  20. Data and code archive for project "Tracing caffeine and its metabolite in...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jun 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chenhui Li; Mohamed Bayati; Shu-Yu Hsu; Hsin-Yeh Hsieh; Lindsi Wilfing; Anthony Belenchia; Sally A. Zemmer; Jessica Klutts; Mary Samuelson; Melissa Reynolds; Elizabeth Semkiw; Hwei-Yiing Johnson; Trevor Foley; Chris G. Wieberg; Jeff Wenzel; Terri D. Lyddon; Mary LePique; Clayton Rushford; Braxton Salcedo; Kara Young; Madalyn Graham; Reinier Suarez; Anarose Ford; Dagmara S. Antkiewicz; Kayley H. Janssen; Martin M. Shafer; Marc C. Johnson; Chung-Ho Lin; Sally Qasim; Chenhui Li; Mohamed Bayati; Shu-Yu Hsu; Hsin-Yeh Hsieh; Lindsi Wilfing; Anthony Belenchia; Sally A. Zemmer; Jessica Klutts; Mary Samuelson; Melissa Reynolds; Elizabeth Semkiw; Hwei-Yiing Johnson; Trevor Foley; Chris G. Wieberg; Jeff Wenzel; Terri D. Lyddon; Mary LePique; Clayton Rushford; Braxton Salcedo; Kara Young; Madalyn Graham; Reinier Suarez; Anarose Ford; Dagmara S. Antkiewicz; Kayley H. Janssen; Martin M. Shafer; Marc C. Johnson; Chung-Ho Lin; Sally Qasim (2023). Data and code archive for project "Tracing caffeine and its metabolite in wastewater to understand the spread of SARS-CoV-2" [Dataset]. http://doi.org/10.5281/zenodo.7730498
    Explore at:
    Dataset updated
    Jun 12, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Chenhui Li; Mohamed Bayati; Shu-Yu Hsu; Hsin-Yeh Hsieh; Lindsi Wilfing; Anthony Belenchia; Sally A. Zemmer; Jessica Klutts; Mary Samuelson; Melissa Reynolds; Elizabeth Semkiw; Hwei-Yiing Johnson; Trevor Foley; Chris G. Wieberg; Jeff Wenzel; Terri D. Lyddon; Mary LePique; Clayton Rushford; Braxton Salcedo; Kara Young; Madalyn Graham; Reinier Suarez; Anarose Ford; Dagmara S. Antkiewicz; Kayley H. Janssen; Martin M. Shafer; Marc C. Johnson; Chung-Ho Lin; Sally Qasim; Chenhui Li; Mohamed Bayati; Shu-Yu Hsu; Hsin-Yeh Hsieh; Lindsi Wilfing; Anthony Belenchia; Sally A. Zemmer; Jessica Klutts; Mary Samuelson; Melissa Reynolds; Elizabeth Semkiw; Hwei-Yiing Johnson; Trevor Foley; Chris G. Wieberg; Jeff Wenzel; Terri D. Lyddon; Mary LePique; Clayton Rushford; Braxton Salcedo; Kara Young; Madalyn Graham; Reinier Suarez; Anarose Ford; Dagmara S. Antkiewicz; Kayley H. Janssen; Martin M. Shafer; Marc C. Johnson; Chung-Ho Lin; Sally Qasim
    Description

    This dataset/code archive included all the data and R codes that were used to explore the universal and robust wastewater biomarkers for population normalization in the SARS-CoV-2 wastewater-based epidemiology. There are nine R code files to produce figures and tables. The data included:

    1. Raw data of weekly biomarkers (caffeine, paraxanthine, and PMMoV) wastewater concentrations, weekly new COVID-19 case numbers, SARS-CoV-2 N1/N2 copies in wastewater, wastewater flow rate
      • A total of 2,624 wastewater samples (41 weeks) were collected weekly from May 2021- April 2022 from 64 wastewater treatment plants across Missouri, US;
      • pMMoV data was only available from Sep 13 2021-April 2022 for Missouri data;
      • Validation dataset from 10 wastewater treatment plants across Wisconsin, US, to test the relationship between wastewater biomarkers and population.
    2. Downloaded Apple mobility data during the pandemic
    3. Validation dataset for wastewater flowrate estimation using paraxanthine concentrations.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.doc [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s001

Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.doc

Related Article
Explore at:
application/cdfv2Available download formats
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.

Search
Clear search
Close search
Google apps
Main menu