66 datasets found

f
Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data...
frontiersin.figshare.com
zip
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.zip [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.00400.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
n
Methods for normalizing microbiome data: an ecological perspective
data.niaid.nih.gov
datadryad.org
zip
Updated Oct 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tn8qs35
Dataset updated
Oct 30, 2018
Dataset provided by
University of New England
James Cook University
Authors
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Microbiome sequencing data often need to be normalized due to differences in read depths, and recommendations for microbiome analyses generally warn against using proportions or rarefying to normalize data and instead advocate alternatives, such as upper quartile, CSS, edgeR-TMM, or DESeq-VS. Those recommendations are, however, based on studies that focused on differential abundance testing and variance standardization, rather than community-level comparisons (i.e., beta diversity), Also, standardizing the within-sample variance across samples may suppress differences in species evenness, potentially distorting community-level patterns. Furthermore, the recommended methods use log transformations, which we expect to exaggerate the importance of differences among rare OTUs, while suppressing the importance of differences among common OTUs. 2. We tested these theoretical predictions via simulations and a real-world data set. 3. Proportions and rarefying produced more accurate comparisons among communities and were the only methods that fully normalized read depths across samples. Additionally, upper quartile, CSS, edgeR-TMM, and DESeq-VS often masked differences among communities when common OTUs differed, and they produced false positives when rare OTUs differed. 4. Based on our simulations, normalizing via proportions may be superior to other commonly used methods for comparing ecological communities.
n
Data from: A systematic evaluation of normalization methods and probe...
data.niaid.nih.gov
dataone.org
+2more
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc7v
Dataset updated
May 30, 2023
Dataset provided by
Universidade de São Paulo
University of Toronto
Hospital for Sick Children
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.
d
GC/MS Simulated Data Sets normalized using median scaling
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scholtens, Denise (2023). GC/MS Simulated Data Sets normalized using median scaling [Dataset]. http://doi.org/10.7910/DVN/OYOLXD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/OYOLXD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Scholtens, Denise
Description
1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using median scaling as described in Reisetter et al.
d
R script to reproduce \"Improved normalization of species count data in...
search.dataone.org
Updated Mar 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BonaRes Repository (2025). R script to reproduce \"Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities\".@en [Dataset]. https://search.dataone.org/view/sha256%3Aa934b23425b0e7e7d9d4278f89745fc842e75fdfe8b47de25c797034dadc1f51
Explore at:
Dataset updated
Mar 21, 2025
Dataset provided by
BonaRes Repository
Area covered

Description
R script to reproduce "Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities"..
Additional file 4: of DBNorm: normalizing high-density oligonucleotide...
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy (2023). Additional file 4: of DBNorm: normalizing high-density oligonucleotide microarray data based on distributions [Dataset]. http://doi.org/10.6084/m9.figshare.5648956.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5648956.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DBNorm installation. Describes how to install DBNorm via devtools in R. (TXT 4Â kb)
d
GC/MS Simulated Data Sets normalized using quantile normalization
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scholtens, Denise (2023). GC/MS Simulated Data Sets normalized using quantile normalization [Dataset]. http://doi.org/10.7910/DVN/R3P9SS
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/R3P9SS
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Scholtens, Denise
Description
1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using quantile normalization (Bolstad et al. 2003).
Naturalistic Neuroimaging Database
openneuro.org
Updated Apr 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v2.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds002837.v2.0.0
Dataset updated
Apr 20, 2021
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Overview

The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.

The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

v2.0 Changes

Overview

We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.

Normalization

Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:

# Generate a resting state (rs) timeseries (ts) # Install / load package to make fake fMRI ts # install.packages("neuRosim") library(neuRosim) # Generate a ts ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1) # 3dDetrend -normalize # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1" # Do for the full timeseries ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2)); # Do this again for a shorter version of the same timeseries ts.shorter.length <- length(ts.normalised.long)/4 ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2)); # By looking at the summaries, it can be seen that the median values become larger summary(ts.normalised.long) summary(ts.normalised.short) # Plot results for the long and short ts # Truncate the longer ts for plotting only ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length] # Give the plot a title title <- "3dDetrend -normalize for long (blue) and short (red) timeseries"; plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short))); # Add zero line lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey'); # 3dDetrend -normalize -polort 0 for long timeseries lines(ts.normalised.long.made.shorter, col='blue'); # 3dDetrend -normalize -polort 0 for short timeseries lines(ts.normalised.short, col='red');

Standardization/modernization

The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.

New afni_proc.py command line

The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

Effect on results

From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
d
Data from: Normalized Difference Vegetation Index for Fanno Creek, Oregon
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Normalized Difference Vegetation Index for Fanno Creek, Oregon [Dataset]. https://catalog.data.gov/dataset/normalized-difference-vegetation-index-for-fanno-creek-oregon
Explore at:
Dataset updated
Nov 1, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Oregon, Fanno Creek
Description
Fanno Creek is a tributary to the Tualatin River and flows though parts of the southwest Portland metropolitan area. The stream is heavily influenced by urban runoff and shows characteristic flashy streamflow and poor water quality commonly associated with urban streams. This data set represents the Normalized Difference Vegetation Index (NDVI), or "greenness" of the Fanno Creek floodplain study area. Aerial photography was used to isolate areas of vegetation based on comparing different bandwidths within the imagery. In this case, the NDVI is calculated as the quotient of the near infrared band minus the red band divided by the near infared plus the red band. NDVI = (NIR - R)/(NIR + R).
t
Supplemental materials to the conference paper "validating 111.1 million...
service.tib.eu
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Supplemental materials to the conference paper "validating 111.1 million marc records" [Dataset]. https://service.tib.eu/ldmservice/dataset/goe-doi-10-25625-amf8jc
Explore at:
Dataset updated
May 16, 2025
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
User guide To generate the reports: prerequisite: Java 8 runtime environment download metadata-qa-marc project as it is described at https://github.com/pkiraly/metadata-qa-marc (e.g. into ~/git/metadata-qa-marc directory) download the .sh and .R files from this project to a subdirectory (e.g. 'scripts') adjust the DIR variable in the [library-name].sh files according to your directory structure run-all.sh creates -details.csv and -summary.csv files into $DIR/_reports directory If you do not want to generate the reports, but would like to use the data files provided, download *.csv.gz files to a '_reports' directory. To generate Table 2. and 3. of the paper: prerequisite: R move normalize-summary.sh, distill-ids.sh, and normalize-ids.sh into $DIR/_reports directory cd $DIR/_reports ./normalize-summary.sh ./distill-ids.sh ./normalize-ids.sh Rscript evaluate-details.R Rscript evaluate-summary.R
Z
AI4PROFHEALTH - Automatic Silver Gazetteer for Named Entity Recognition and...
data.niaid.nih.gov
zenodo.org
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodríguez Miret, Jan (2024). AI4PROFHEALTH - Automatic Silver Gazetteer for Named Entity Recognition and Normalization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14210424
Explore at:
Dataset updated
Nov 25, 2024
Dataset provided by
Krallinger, Martin
Marsol Torrent, Sergi
Rodríguez Miret, Jan
Lima-López, Salvador
Becerra-Tomé, Alberto
Farré-Maduell, Eulàlia
Rodríguez Ortega, Miguel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises a professions gazetteer generated with automatically extracted terminology from the Mesinesp2 corpus, a manually annotated corpus in which domain experts have labeled a set of scientific literature, clinical trials, and patent abstracts, as well as clinical case reports.

A silver gazetteer for mention classification and normalization is created combining the predictions of automatic Named Entity Recognition models and normalization using Entity Linking to three controlled vocabularies SNOMED CT, NCBI and ESCO. The sources are 265,025 different documents, where 249,538 correspond to MESINESP2 Corpora and 15,487 to clinical cases from open clinical journals. From them, 5,682,000 mentions are extracted and 4,909,966 (86.42%) are normalized to any of the ontologies: SNOMED CT (4,909,966) for diseases, symptoms, drugs, locations, occupations, procedures and species; ESCO (215,140) for occupations; and NCBI (1,469,256) for species.

The repository contains a .tsv file with the following columns:

filenameid: A unique identifier combining the file name and mention span within the text. This ensures each extracted mention is uniquely traceable. Example: biblio-1000005#239#256 refers to a mention spanning characters 239–256 in the file with the name biblio-1000005.

span: The specific text span (mention) extracted from the document, representing a term or phrase identified in the dataset. Example: centro oncológico.

source: The origin of the document, indicating the corpus from which the mention was extracted. Possible values: mesinesp2, clinical_cases.

filename: The name of the file from which the mention was extracted. Example: biblio-1000005.

mention_class: Categories or semantic tags assigned to the mention, describing its type or context in the text. Example: ['ENFERMEDAD', 'SINTOMA'].

codes_esco: The normalized ontology codes from the European Skills, Competences, Qualifications, and Occupations (ESCO) vocabulary for the identified mention (if applicable). This field may be empty if no ESCO mapping exists. Example: 30629002.

terms_esco: The human-readable terms from the ESCO ontology corresponding to the codes_esco. Example: ['responsable de recursos', 'director de recursos', 'directora de recursos'].

codes_ncbi: The normalized ontology codes from the NCBI Taxonomy vocabulary for species (if applicable). This field may be empty if no NCBI mapping exists.

terms_ncbi: The human-readable terms from the NCBI Taxonomy vocabulary corresponding to the codes_ncbi. Example: ['Lacandoniaceae', 'Pandanaceae R.Br., 1810', 'Pandanaceae', 'Familia'].

codes_sct: The normalized ontology codes from SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms) vocabulary for diseases, symptoms, drugs, locations, occupations, procedures, and species (if applicable). Example: 22232009.

terms_sct: The human-readable terms from the SNOMED CT ontology corresponding to the codes_sct. Example: ['adjudicador de regulaciones del seguro nacional'].

sct_sem_tag: The semantic category tag assigned by SNOMED CT to describe the general classification of the mention. Example: environment.

Suggestion: If you load the dataset using python, it is recommended to read the columns containing lists as follows

import ast

df["mention_class"] = df["mention_class"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

License

This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). This means you are free to:

Share: Copy and redistribute the material in any medium or format.

Adapt: Remix, transform, and build upon the material for any purpose, even commercially.

Attribution Requirement: Please credit the dataset creators appropriately, provide a link to the license, and indicate if changes were made.

Contact

If you have any questions or suggestions, please contact us at:

Martin Krallinger ()

Additional resources and corpora

If you are interested, you might want to check out these corpora and resources:

MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts, different document collection)

MEDDOPROF corpus

Codes Reference List (for MEDDOPROF-NORM)

Annotation Guidelines

Occupations Gazetteer
f
Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jun 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paules, Richard S.; Ramaiahgari, Sreenivasa C.; Ferguson, Stephen S.; Auerbach, Scott S.; Bushel, Pierre R. (2020). Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.xlsx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000579048
Explore at:
Dataset updated
Jun 23, 2020
Authors
Paules, Richard S.; Ramaiahgari, Sreenivasa C.; Ferguson, Stephen S.; Auerbach, Scott S.; Bushel, Pierre R.
Description
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
Processed data - DegNorm: Normalization of generalized transcript...
zenodo.org
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bin Xiong; Yiben Yang; Frank R Fineis; Ji-Ping Wang; Bin Xiong; Yiben Yang; Frank R Fineis; Ji-Ping Wang (2020). Processed data - DegNorm: Normalization of generalized transcript degradation improves accuracy in RNA-seq analysis [Dataset]. http://doi.org/10.5281/zenodo.2595303
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2595303
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bin Xiong; Yiben Yang; Frank R Fineis; Ji-Ping Wang; Bin Xiong; Yiben Yang; Frank R Fineis; Ji-Ping Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Processed data from DegNorm:

"_raw.txt": raw read counts matrix;

"_DI.txt": Degradation index score matrix;

"_DegNorm.txt": normalized read counts matrix from DegNorm output;

"_coverage.Rdata": list of coverage matrix for the sample;

"_countsTIN.txt": TIN normalized counts.
The sensitivity of transcriptomics BMD modeling to the methods used for...
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
catalog.data.gov
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). The sensitivity of transcriptomics BMD modeling to the methods used for microarray data normalization [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/the-sensitivity-of-transcriptomics-bmd-modeling-to-the-methods-used-for-microarray-data-no
Explore at:
Dataset updated
Aug 20, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This dataset is a project file generated by BMDExpress 2.2 SW (Sciome, Research Triangle Park, NC). It contains gene expression data for livers of rats exposed to 4 chemicals (crude MCHM, neat MCHM, DMPT, p-toluidine) and kidneys of rats exposed to PPH. The project file includes normalized expression data (GeneChip Rat 230 2.0 Array) using 7 different pre-processing methods (RMA, GCRMA, MAS5.0, MAS5.0_noA calls, PLIER, PLIER16, and PLIER16_noA calls); differentially expressed probe-sets detected by William's method (p<0.05, and minimum fold change of 1.5); probeset-level and pathway-level BMD and BMDL values from transcriptomic dose-response modeling. This dataset is associated with the following publication: Mezencev, R., and S. Auerbach. The sensitivity of transcriptomics BMD modeling to the methods used for microarray data normalization. PLOS ONE. Public Library of Science, San Francisco, CA, USA, 15(5): e0232955, (2020).
Additional file 3: of DBNorm: normalizing high-density oligonucleotide...
springernature.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy (2023). Additional file 3: of DBNorm: normalizing high-density oligonucleotide microarray data based on distributions [Dataset]. http://doi.org/10.6084/m9.figshare.5648932.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5648932.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DBNorm test script. Code of how we test DBNorm package. (TXT 2Â kb)
e
Normalized Halpha line profiles of FGK stars - Dataset - B2FIND
b2find.eudat.eu
Updated Nov 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Normalized Halpha line profiles of FGK stars - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/e3a21cd3-7814-502b-9087-2a670561e73f
Explore at:
Dataset updated
Nov 4, 2023
Description
The determination of stellar effective temperature (T_eff_) in F, G, and K stars using Halpha profile fitting is a quite remarkable and powerful tool because it does not depend on reddening and is only slightly sensitive to other atmospheric parameters. Nevertheless, this technique is not frequently used because of the complex procedure needed to recover the profile of broad lines in echelle spectra. As a consequence, tests performed on different models have sometimes provided ambiguous results. The main aim of this work is to test the ability of the Halpha profile fitting technique to derive T_eff. We also aim to improve the applicability of this technique to echelle spectra and to test how well 1D+LTE models perform on a variety of F-K stars. We also apply the technique to HARPS spectra and test the reliability and the stability of the HARPS response over several years using the Sun. We have developed a normalization method for recovering undistorted Halpha profiles and we have first applied it to spectra acquired with the single-order coude instrument (resolution R=45000) at do Pico dos Dias Observatory to avoid the problem of blaze correction. The continuum location around Halpha is optimised using an iterative procedure, where the identification of minute telluric features is performed. A set of spectra was acquired with the MUSICOS echelle spectrograph (R=40000) to independently validate the normalization method. The accuracy of the method and of the 1D+LTE model is determined using coude/HARPS/MUSICOS spectra of the Sun and coude-only spectra of a sample of ten Gaia Benchmark Stars with T_eff_ determined from interferometric measurements. HARPS, coude, and MUSICOS spectra are used to determine T_eff_ of 43 sample stars. We find that a proper choice of spectral windows of fits plus the identification of telluric features allow for a very careful normalization of the spectra and produce reliable H{alpha} profiles. We also find that the most used solar atlases cannot be used as templates for Halpha temperature diagnostics without renormalization. The comparison with the Sun shows that H{alpha} profiles from 1D+LTE models underestimate the solar T_eff_ by 28K. We find the same agreement between Halpha and interferometry and between Halpha and Infrared Flux Method: a shallow dependency on metallicity according to the relation T_eff_=T_eff_^Halpha^-159[Fe/H]+28K within the metallicity range -0.70 to +0.40dex. The comparison with the Infrared Flux Method shows a scatter of 59K dominated by photometric errors (52K). In order to investigate the origin of this dependency, we analysed spectra from 3D models and found that they produce hotter temperatures, and that their use largely improves the agreement with the interferometric and Infrared Flux Method measurements. Finally, we find HARPS spectra to be fully suitable for Halpha profile temperature diagnostics; they are perfectly compatible with the coude spectra, and lead to the same T_eff_ for the Sun as that found when analysing HARPS spectra over a timespan of more than 7 years.
o
Data from: R package: SRS-Scaling with Ranked Subsampling
openagrar.de
Updated Mar 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Beule; Vitor Heidrich; Petr Karlovsky (2022). R package: SRS-Scaling with Ranked Subsampling [Dataset]. https://www.openagrar.de/receive/openagrar_mods_00078523
Explore at:
Dataset updated
Mar 27, 2022
Dataset provided by
Julius Kühn-Institute (JKI), Federal Research Centre for Cultivated Plants, Institute for Ecological Chemistry, Plant Analysis and Stored Product Production, Berlin, Germany
Authors
Lukas Beule; Vitor Heidrich; Petr Karlovsky
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Analysis of species count data in ecology often requires normalization to an identical sample size. Rarefying (random subsampling without replacement), which is a popular method for normalization, has been widely criticized for its poor reproducibility and potential distortion of the community structure. In the context of microbiome count data, researchers explicitly advised against the use of rarefying. An alternative to rarefying is scaling with ranked subsampling (SRS). SRS consists of two steps. In the first step, the total counts for all OTUs (operational taxonomic units) or species in each sample are divided by a scaling factor chosen in such a way that the sum of the scaled counts Cscaled equals Cmin. In the second step, the non-integer Cscaled values are converted into integers by an algorithm that we dub ranked subsampling. The Cscaled value for each OTU or species is split into the integer part Cint (Cint = floor(Cscaled)) and the fractional part Cfrac (Cfrac = Cscaled - Cints). Since the sum of Cint is smaller or equal to Cmin, additional delta C = Cmin - the sum of Cint counts have to be added to the library to reach the total count of Cmin. This is achieved as follows. OTUs are ranked in the descending order of their Cfrac values. Beginning with the OTU of the highest rank, single count per OTU is added to the normalized library until the total number of added counts reaches delta C and the sum of all counts in the normalized library equals Cmin. When the lowest Cfrag involved in picking delta C counts is shared by several OTUs, the OTUs used for adding a single count to the library are selected in the order of their Cint values. This selection minimizes the effect of normalization on the relative frequencies of OTUs. OTUs with identical Cfrag as well as Cint are sampled randomly without replacement. See Beule & Karlovsky (2020) doi:10.7717/peerj.9593
h
dagw-word-frequencies-normalized-by-domain
huggingface.co
Updated Jun 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center for Humanities Computing Aarhus (2025). dagw-word-frequencies-normalized-by-domain [Dataset]. https://huggingface.co/datasets/chcaa/dagw-word-frequencies-normalized-by-domain
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2025
Dataset authored and provided by
Center for Humanities Computing Aarhus
Description
Dataset Card for DAGW Word Frequencies (normalized)

Paper: Derczynski, L., Ciosici, M. R., Baglini, R., Christiansen, M. H., Dalsgaard, J. A., Fusaroli, R., ... & Varab, D. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 413-421). Point of Contact: Kenneth Enevoldsen (Kennethcenevoldsen (at) gmail (dot) com )

This is a list of word frequencies derived from the Danish Gigaword (collected before… See the full description on the dataset page: https://huggingface.co/datasets/chcaa/dagw-word-frequencies-normalized-by-domain.
N
Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust...
data.niaid.nih.gov
Updated May 15, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bacher R; Chu L; Kendziorski C; Swanson S (2019). Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust normalization of single-cell rna-seq data [Dataset]. https://data.niaid.nih.gov/resources?id=gse85917
Explore at:
Dataset updated
May 15, 2019
Dataset provided by
University of Florida
Authors
Bacher R; Chu L; Kendziorski C; Swanson S
Description
Normalization of RNA-sequencing data is essential for accurate downstream inference, but the assumptions upon which most methods are based do not hold in the single-cell setting. Consequently, applying existing normalization methods to single-cell RNA-seq data introduces artifacts that bias downstream analyses. To address this, we introduce SCnorm for accurate and efficient normalization of scRNA-seq data. Total 183 single cells (92 H1 cells, 91 H9 cells), sequenced twice, were used to evaluate SCnorm in normalizing single cell RNA-seq experiments. Total 48 bulk H1 samples were used to compare bulk and single cell properties. For single-cell RNA-seq, the identical single-cell indexed and fragmented cDNA were pooled at 96 cells per lane or at 24 cells per lane to test the effects of sequencing depth, resulting in approximately 1 million and 4 million mapped reads per cell in the two pooling groups, respectively.
d
Water-quality trends and trend component estimates for the Nation's rivers...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Water-quality trends and trend component estimates for the Nation's rivers and streams using Weighted Regressions on Time, Discharge, and Season (WRTDS) models and generalized flow normalization, 1972-2012 [Dataset]. https://catalog.data.gov/dataset/water-quality-trends-and-trend-component-estimates-for-the-nations-rivers-and-streams-1972
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Nonstationary streamflow due to environmental and human-induced causes can affect water quality over time, yet these effects are poorly accounted for in water-quality trend models. This data release provides instream water-quality trends and estimates of two components of change, for sites across the Nation previously presented in Oelsner et al. (2017). We used previously calibrated Weighted Regressions on Time, Discharge, and Season (WRTDS) models published in De Cicco et al. (2017) to estimate instream water-quality trends and associated uncertainties with the generalized flow normalization procedure available in EGRET version 3.0 (Hirsch et al., 2018a) and EGRETci version 2.0 (Hirsch et al., 2018b). The procedure allows for nonstationarity in the flow regime, whereas previous versions of EGRET assumed streamflow stationarity. Water-quality trends of annual mean concentrations and loads (also referred to as fluxes) are provided as an annual series and the change between the start and end year for four trend periods (1972-2012, 1982-2012, 1992-2012, and 2002-2012). Information about the sites, including the collecting agency and associated streamflow gage, and information about site selection and the data screening process can be found in Oelsner et al. (2017). This data release includes results for 19 water-quality parameters including nutrients (ammonia, nitrate, filtered and unfiltered orthophosphate, total nitrogen, total phosphorus), major ions (calcium, chloride, magnesium, potassium, sodium, sulfate), salinity indicators (specific conductance, total dissolved solids), carbon (alkalinity, dissolved organic carbon, total organic carbon), and sediment (total suspended solids, suspended-sediment concentration) at over 1,200 sites. Note, the number of parameters with data varies by site with most sites having data for 1-4 parameters. Each water-quality trend was parsed into two components of change: (1) the streamflow trend component (QTC) and (2) the watershed management trend component (MTC). The QTC is an indicator of the amount of change in the water-quality trend attributed to changes in the streamflow regime, and the MTC is an indicator of the amount of change in the water-quality trend that may be attributed to human actions and changes in point and non-point sources in a watershed. Note, the MTC is referred to as the concentration-discharge trend component (CQTC) in the EGRET version 3.0 software. For our work, we chose to refer to this trend component as the MTC because it provides a more conceptual description (Murphy and Sprague, 2019). The trend results presented here expand upon the results in De Cicco et al. (2017) and Oelsner et al. (2017), which were analyzed using flow-normalization under the stationary streamflow assumption. The results presented in this data release are intended to complement these previously published results and support investigations into natural and human effects on water-quality trends across the United States. Data preparation information and WRTDS model specifications are described in Oelsner et al. (2017) and Murphy and Sprague (2019). This work was completed as part of the National Water-Quality Assessment (NAWQA) project of the National Water-Quality Program. De Cicco, L.A., Sprague, L.A., Murphy, J.C., Riskin, M.L., Falcone, J.A., Stets, E.G., Oelsner, G.P., and Johnson, H.M., 2017, Water-quality and streamflow datasets used in the Weighted Regressions on Time, Discharge, and Season (WRTDS) models to determine trends in the Nation’s rivers and streams, 1972-2012 (ver. 1.1 July 7, 2017): U.S. Geological Survey data release, https://doi.org/10.5066/F7KW5D4H. Hirsch, R., De Cicco, L., Watkins, D., Carr, L., and Murphy, J., 2018a, EGRET: Exploration and Graphics for RivEr Trends, version 3.0, https://CRAN.R-project.org/package=EGRET. Hirsch, R., De Cicco, L., and Murphy, J., 2018b, EGRETci: Exploration and Graphics for RivEr Trends (EGRET) Confidence Intervals, version 2.0. https://CRAN.R-project.org/package=EGRETci. Murphy, J.C., and Sprague, L.A., 2019, Water-quality trends in US rivers: Exploring effects from streamflow trends and changes in watershed management: The Science of the total environment, ISSN: 1879-1026, Vol: 656, Page: 645-658, https://doi.org/10.1016/j.scitotenv.2018.11.255. Oelsner, G.P., Sprague, L.A., Murphy, J.C., Zuellig, R.E., Johnson, H.M., Ryberg, K.R., Falcone, J.A., Stets, E.G., Vecchia, A.V., Riskin, M.L., De Cicco, L.A., Mills, T.J., and Farmer, W.H., 2017, Water-quality trends in the Nation’s rivers and streams, 1972–2012—Data preparation, statistical methods, and trend results (ver. 2.0, October 2017): U.S. Geological Survey Scientific Investigations Report 2017–5006, 136 p., https://doi.org/10.3133/sir20175006.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.zip [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s002

Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.zip

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.3389/fgene.2019.00400.s002

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers

Authors

Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.

Clear search

Close search

Google apps

Main menu

Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data...

Methods for normalizing microbiome data: an ecological perspective

Data from: A systematic evaluation of normalization methods and probe...

GC/MS Simulated Data Sets normalized using median scaling

R script to reproduce \"Improved normalization of species count data in...

Additional file 4: of DBNorm: normalizing high-density oligonucleotide...

GC/MS Simulated Data Sets normalized using quantile normalization

Naturalistic Neuroimaging Database

Overview

v2.0 Changes

Data from: Normalized Difference Vegetation Index for Fanno Creek, Oregon

Supplemental materials to the conference paper "validating 111.1 million...

AI4PROFHEALTH - Automatic Silver Gazetteer for Named Entity Recognition and...

Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...

Processed data - DegNorm: Normalization of generalized transcript...

The sensitivity of transcriptomics BMD modeling to the methods used for...

Additional file 3: of DBNorm: normalizing high-density oligonucleotide...

Normalized Halpha line profiles of FGK stars - Dataset - B2FIND

Data from: R package: SRS-Scaling with Ranked Subsampling

dagw-word-frequencies-normalized-by-domain

Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust...

Water-quality trends and trend component estimates for the Nation's rivers...

Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.zipSee More Versions

Data_Sheet_2_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.zip