24 datasets found

q
REMNet Tutorial, R Part 5: Normalizing Microbiome Data in R 5.2.19
qubeshub.org
Updated Aug 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica Joyner (2019). REMNet Tutorial, R Part 5: Normalizing Microbiome Data in R 5.2.19 [Dataset]. http://doi.org/10.25334/M13H-XT81
Explore at:
Unique identifier
https://doi.org/10.25334/M13H-XT81
Dataset updated
Aug 28, 2019
Dataset provided by
QUBES
Authors
Jessica Joyner
Description
Video on normalizing microbiome data from the Research Experiences in Microbiomes Network
n
Data from: A systematic evaluation of normalization methods and probe...
data.niaid.nih.gov
dataone.org
+2more
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc7v
Dataset updated
May 30, 2023
Dataset provided by
Universidade de São Paulo
Hospital for Sick Children
University of Toronto
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.
Additional file 3: of DBNorm: normalizing high-density oligonucleotide...
springernature.figshare.com
txt
Updated Nov 30, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy (2017). Additional file 3: of DBNorm: normalizing high-density oligonucleotide microarray data based on distributions [Dataset]. http://doi.org/10.6084/m9.figshare.5648932.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5648932.v1
Dataset updated
Nov 30, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DBNorm test script. Code of how we test DBNorm package. (TXT 2Â kb)
Additional file 4: of DBNorm: normalizing high-density oligonucleotide...
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy (2023). Additional file 4: of DBNorm: normalizing high-density oligonucleotide microarray data based on distributions [Dataset]. http://doi.org/10.6084/m9.figshare.5648956.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5648956.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Qinxue Meng; Daniel Catchpoole; David Skillicorn; Paul Kennedy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DBNorm installation. Describes how to install DBNorm via devtools in R. (TXT 4Â kb)
d
Methods for normalizing microbiome data: an ecological perspective
datadryad.org
data.niaid.nih.gov
zip
Updated Oct 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tn8qs35
Dataset updated
Oct 30, 2018
Dataset provided by
Dryad
Authors
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
Time period covered
Oct 19, 2018
Description
Simulation script 1This R script will simulate two populations of microbiome samples and compare normalization methods.Simulation script 2This R script will simulate two populations of microbiome samples and compare normalization methods via PcOAs.Sample.OTU.distributionOTU distribution used in the paper: Methods for normalizing microbiome data: an ecological perspective
Dataset supporting: Normalizing and denoising protein expression data from...
nih.figshare.com
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew P. Mulé; Andrew J. Martins; John Tsang (2023). Dataset supporting: Normalizing and denoising protein expression data from droplet-based single cell profiling [Dataset]. http://doi.org/10.35092/yhjc.13370915.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.35092/yhjc.13370915.v2
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Matthew P. Mulé; Andrew J. Martins; John Tsang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data for reproducing analysis in the manuscript:Normalizing and denoising protein expression data from droplet-based single cell profilinglink to manuscript: https://www.biorxiv.org/content/10.1101/2020.02.24.963603v1

Data deposited here are for the purposes of reproducing the analysis results and figures reported in the manuscript above. These data are all publicly available downloaded and converted to R datasets prior to Dec 4, 2020. For a full description of all the data included in this repository and instructions for reproducing all analysis results and figures, please see the repository: https://github.com/niaid/dsb_manuscript.

For usage of the dsb R package for normalizing CITE-seq data please see the repository: https://github.com/niaid/dsb

If you use the dsb R package in your work please cite:Mulè MP, Martins AJ, Tsang JS. Normalizing and denoising protein expression data from droplet-based single cell profiling. bioRxiv. 2020;2020.02.24.963603.

General contact: John Tsang (john.tsang AT nih.gov)

Questions about software/code: Matt Mulè (mulemp AT nih.gov)
Naturalistic Neuroimaging Database
openneuro.org
Updated Apr 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v1.1.3
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds002837.v1.1.3
Dataset updated
Apr 20, 2021
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Overview

The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.

The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

v2.0 Changes

Overview

We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.

Normalization

Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:

# Generate a resting state (rs) timeseries (ts) # Install / load package to make fake fMRI ts # install.packages("neuRosim") library(neuRosim) # Generate a ts ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1) # 3dDetrend -normalize # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1" # Do for the full timeseries ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2)); # Do this again for a shorter version of the same timeseries ts.shorter.length <- length(ts.normalised.long)/4 ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2)); # By looking at the summaries, it can be seen that the median values become larger summary(ts.normalised.long) summary(ts.normalised.short) # Plot results for the long and short ts # Truncate the longer ts for plotting only ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length] # Give the plot a title title <- "3dDetrend -normalize for long (blue) and short (red) timeseries"; plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short))); # Add zero line lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey'); # 3dDetrend -normalize -polort 0 for long timeseries lines(ts.normalised.long.made.shorter, col='blue'); # 3dDetrend -normalize -polort 0 for short timeseries lines(ts.normalised.short, col='red');

Standardization/modernization

The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.

New afni_proc.py command line

The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

Effect on results

From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
d
R script to reproduce \"Improved normalization of species count data in...
search.dataone.org
Updated Mar 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BonaRes Repository (2025). R script to reproduce \"Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities\".@en [Dataset]. https://search.dataone.org/view/sha256%3Aa934b23425b0e7e7d9d4278f89745fc842e75fdfe8b47de25c797034dadc1f51
Explore at:
Dataset updated
Mar 21, 2025
Dataset provided by
BonaRes Repository
Area covered

Description
R script to reproduce "Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities"..
New size-normalised weight (SNW) data
doi.pangaea.de
resodate.org
html, tsv
Updated Jan 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruby Barrett (2025). New size-normalised weight (SNW) data [Dataset]. http://doi.org/10.1594/PANGAEA.973571
Explore at:
tsv, htmlAvailable download formats
Unique identifier
https://doi.org/10.1594/PANGAEA.973571
Dataset updated
Jan 13, 2025
Dataset provided by
PANGAEA
Authors
Ruby Barrett
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Sep 1, 1984 - Oct 19, 1999
Area covered

Variables measured
LATITUDE, Data type, ELEVATION, LONGITUDE, Sample ID, Event label, Size fraction, Sieve-based weight, Number of specimens, DEPTH, sediment/rock, and 5 more
Description
This table includes the new SNW data produced for this manuscript. The foraminiferal weight data is normalized using the measurement-based weight (MBW) method of Barker (2002). SNW measurements were collected from Atlantic core-tops and sediment cores for G. truncatulinoides, G. ruber, O. universa, N. pachyderma, N. incompta and G. bulloides.
N
Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust...
data.niaid.nih.gov
Updated May 15, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bacher R; Chu L; Kendziorski C; Swanson S (2019). Single cell RNA-seq data of human hESCs to evaluate SCnorm: robust normalization of single-cell rna-seq data [Dataset]. https://data.niaid.nih.gov/resources?id=gse85917
Explore at:
Dataset updated
May 15, 2019
Dataset provided by
University of Florida
Authors
Bacher R; Chu L; Kendziorski C; Swanson S
Description
Normalization of RNA-sequencing data is essential for accurate downstream inference, but the assumptions upon which most methods are based do not hold in the single-cell setting. Consequently, applying existing normalization methods to single-cell RNA-seq data introduces artifacts that bias downstream analyses. To address this, we introduce SCnorm for accurate and efficient normalization of scRNA-seq data. Total 183 single cells (92 H1 cells, 91 H9 cells), sequenced twice, were used to evaluate SCnorm in normalizing single cell RNA-seq experiments. Total 48 bulk H1 samples were used to compare bulk and single cell properties. For single-cell RNA-seq, the identical single-cell indexed and fragmented cDNA were pooled at 96 cells per lane or at 24 cells per lane to test the effects of sequencing depth, resulting in approximately 1 million and 4 million mapped reads per cell in the two pooling groups, respectively.
d
(high-temp) No 8. Metadata Analysis (16S rRNA/ITS) Output
search.dataone.org
smithsonian.figshare.com
+1more
Updated Aug 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarrod Scott (2024). (high-temp) No 8. Metadata Analysis (16S rRNA/ITS) Output [Dataset]. https://search.dataone.org/view/urn%3Auuid%3A718e0794-b5ff-4919-95ef-4a90a7890a5b
Explore at:
Dataset updated
Aug 15, 2024
Dataset provided by
Smithsonian Research Data Repository
Authors
Jarrod Scott
Description
Output files from the 8. Metadata Analysis Workflow page of the SWELTR high-temp study. In this workflow, we compared environmental metadata with microbial communities. The workflow is split into two parts.

metadata_ssu18_wf.rdata : Part 1 contains all variables and objects for the 16S rRNA analysis. To see the Objects, in R run _load("metadata_ssu18_wf.rdata", verbose=TRUE)_

metadata_its18_wf.rdata : Part 2 contains all variables and objects for the ITS analysis. To see the Objects, in R run _load("metadata_its18_wf.rdata", verbose=TRUE)_
Additional files:

In both workflows, we run the following steps:

1) Metadata Normality Tests: Shapiro-Wilk Normality Test to test whether each matadata parameter is normally distributed.
2) Normalize Parameters: R package bestNormalize to find and execute the best normalizing transformation.
3) Split Metadata parameters into groups: a) Environmental and edaphic properties, b) Microbial functional responses, and c) Temperature adaptation properties.
4) Autocorrelation Tests: Test all possible pair-wise comparisons, on both normalized and non-normalized data sets, for each group.
5) Remove autocorrelated parameters from each group.
6) Dissimilarity Correlation Tests: Use Mantel Tests to see if any on the metadata groups are significantly correlated with the community data.
7) Best Subset of Variables: Determine which of the metadata parameters from each group are the most strongly correlated with the community data. For this we use the bioenv function from the vegan package.
8) Distance-based Redundancy Analysis: Ordination analysis of samples and metadata vector overlays using capscale, also from the vegan package.

Source code for the workflow can be found here:
https://github.com/sweltr/high-temp/blob/master/metadata.Rmd
f
Table_1_Comparison of Normalization Methods for Analysis of TempO-Seq...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jun 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bushel, Pierre R.; Ramaiahgari, Sreenivasa C.; Auerbach, Scott S.; Paules, Richard S.; Ferguson, Stephen S. (2020). Table_1_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.XLSX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000579045
Explore at:
Dataset updated
Jun 23, 2020
Authors
Bushel, Pierre R.; Ramaiahgari, Sreenivasa C.; Auerbach, Scott S.; Paules, Richard S.; Ferguson, Stephen S.
Description
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
d
Data from: Quantitative proteomics reveals rapid divergence in the...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jun 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erin McCullough; Caitlin McDonough; Scott Pitnick; Steve Dorus (2020). Quantitative proteomics reveals rapid divergence in the postmating response of female reproductive tracts among sibling species [Dataset]. http://doi.org/10.5061/dryad.8cz8w9gm8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.8cz8w9gm8
Dataset updated
Jun 24, 2020
Dataset provided by
Dryad
Authors
Erin McCullough; Caitlin McDonough; Scott Pitnick; Steve Dorus
Time period covered
Jun 24, 2020
Description
Fertility depends, in part, on interactions between male and female reproductive proteins inside the female reproductive tract (FRT) that mediate postmating changes in female behavior, morphology, and physiology. Coevolution between interacting proteins within species may drive reproductive incompatibilities between species, yet the mechanisms underlying postmating-prezygotic isolating barriers remain poorly resolved. Here, we used quantitative proteomics in sibling Drosophila species to investigate the molecular composition of the FRT environment and its role in mediating species-specific postmating responses. We found that (1) FRT proteomes in D. simulans and D. mauritiana virgin females express unique combinations of secreted proteins and are enriched for distinct functional categories, (2) mating induces substantial changes to the FRT proteome in D. mauritiana but not in D. simulans, and (3) the D. simulans FRT pr...
Group level size-normalised weight data
doi.pangaea.de
html, tsv
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruby Barrett (2025). Group level size-normalised weight data [Dataset]. http://doi.org/10.1594/PANGAEA.973592
Explore at:
html, tsvAvailable download formats
Unique identifier
https://doi.org/10.1594/PANGAEA.973592
Dataset updated
Jan 13, 2025
Dataset provided by
PANGAEA
Authors
Ruby Barrett
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Variables measured
Basin, Ecogroup, LATITUDE, Salinity, Author(s), Data type, ELEVATION, LONGITUDE, Phosphate, Sample ID, and 9 more
Description
This dataset contains a compilation of published and new SNW data with corresponding sea surface (≤ 20 m) environmental data extracted from CMIP6 that are used in the group level Bayesian regression modelling.
Species level size-normalised weight data for at depth analysis
doi.pangaea.de
html, tsv
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruby Barrett (2025). Species level size-normalised weight data for at depth analysis [Dataset]. http://doi.org/10.1594/PANGAEA.973594
Explore at:
tsv, htmlAvailable download formats
Unique identifier
https://doi.org/10.1594/PANGAEA.973594
Dataset updated
Jan 13, 2025
Dataset provided by
PANGAEA
Authors
Ruby Barrett
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Variables measured
Basin, Ecogroup, LATITUDE, Salinity, Author(s), Data type, ELEVATION, LONGITUDE, Phosphate, Sample ID, and 9 more
Description
This dataset contains a compilation of published and new SNW data with corresponding environmental data extracted from CMIP6 that are used in the at depth species level Bayesian regression modelling. Environmental data for G. truncatulinoides comes from 200m depth, all other environmental data is from the sea surface (≤ 20 m).
Data from: Normalisation of Early Modern Science: Digitized Corpus of 17th-...
zenodo.org
recerca.uoc.edu
+1more
Updated Sep 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Sangiacomo; Andrea Sangiacomo; Raluca Tanasescu; Raluca Tanasescu; Silvia Donker; Silvia Donker; Hugo D. Hogenbirk; Hugo D. Hogenbirk (2023). Normalisation of Early Modern Science: Digitized Corpus of 17th- and 18th-Century Sources [Dataset]. http://doi.org/10.5281/zenodo.8351598
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8351598
Dataset updated
Sep 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrea Sangiacomo; Andrea Sangiacomo; Raluca Tanasescu; Raluca Tanasescu; Silvia Donker; Silvia Donker; Hugo D. Hogenbirk; Hugo D. Hogenbirk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a digitized corpus of early modern natural philosophy works that underlie the European Research Commission-funded Starting Grant “The Normalisation of Natural Philosophy: How Teaching Practices Shaped the Evolution of Early Modern Science,” (grant agreement No. 801653 NaturalPhilosophy), led by Dr. Andrea Sangiacomo at the Faculty of Philosophy at the University of Groningen.

The methodology we used for the digitization of the present dataset is described in the paper:

A. Sangiacomo, H. Hogenbirk, R. Tanasescu, A. Karaisl, N White. 2022. “Reading in the Mist: High-Quality Optical Character Recognition Based on Early Modern Digitized Books.” Digital Scholarship in the Humanities. https://bit.ly/3vwvwKI

The inventory of the present dataset is available at DOI: 10.5281/zenodo.5566681

The methodology behind the retrieval, cleaning, and annotation of the above inventory is described in the paper:

Sangiacomo, Andrea; Tanasescu, Raluca; Donker, Silvia; Hogenbirk, Hugo. 2021. “Mapping the Evolution of Early Modern Natural Philosophy: Corpus Collection and Authority Acknowledgement,” published in the Annals of Science (DOI: 10.1080/00033790.2021.1992502; permanent link: https://doi.org/10.1080/00033790.2021.1992502

The dictionaries from which we selected the data in worksheets 2-5 in the inventory are the following:

Wiep van Bunge, Henri Krop, Bart Leeuwenburgh, Paul Schuurman, Han van Ruler and Michiel Wielema, Dictionary of Seventeenth- and Eighteenth-Century Dutch Philosophers (London: Bloomsbury, 2003);

John Yolton, Valdimir Price and John Stephens. Dictionary of Eighteenth-Century British Philosophers (London: Bloomsbury, 1999);

Andrew Pyle. Dictionary of Seventeenth-Century British Philosophers (London: Bloomsbury, 2000);

Luc Foisneau. Dictionary of Seventeenth-Century French Philosophers (London: Bloomsbury, 2008);

Heiner F. Klemme and Manfred Kuehn. Dictionary of Eighteenth-Century Philosophers (London: Bloomsbury, 2011).

University of Groningen Team:

Andrea Sangiacomo (principal investigator)

Raluca Tanasescu (postdoctoral researcher)

Silvia Donker and Hugo Hogenbirk (PhD students)

Cristian A. Marocico (scientific programmer, Center for Information Technology)

Wim Breakman (bibliographer, University of Groningen Library)
R
TCGA case study for ASTERICS
entrepot.recherche.data.gouv.fr
csv +4
Updated Sep 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathalie Vialaneix; Nathalie Vialaneix (2022). TCGA case study for ASTERICS [Dataset]. http://doi.org/10.15454/YNMQUY
Explore at:
text/x-r-source(1088), csv(2148636), type/x-r-syntax(864), csv(1003176), csv(2752164), csv(1003170), csv(33405040), csv(812120), txt(8901), text/comma-separated-values(808595)Available download formats
Unique identifier
https://doi.org/10.15454/YNMQUY
Dataset updated
Sep 26, 2022
Dataset provided by
Recherche Data Gouv
Authors
Nathalie Vialaneix; Nathalie Vialaneix
License
https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.15454/YNMQUYhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.15454/YNMQUY
Time period covered
Sep 15, 2020 - Aug 18, 2021
Dataset funded by
Région Occitanie
Description
This dataset is issued from the public repository TCGA (https://portal.gdc.cancer.gov/) and contain several files, each corresponding to a given omic on the same individuals with breast cancer. Raw data have been obtained from the mixOmics case study described in http://mixomics.org/mixdiablo/case-study-tcga/ [link accessed on August 18, 2021] and were made available by the package authors at http://mixomics.org/wp-content/uploads/2016/08/TCGA.normalised.mixDIABLO.RData_.zip (R data format). Data in the zip file had been normalised for technical biases by the package authors. Data from the train and test sets were exported as TXT/CSV files and completed with miRNA expression on the smae individuals and toy datasets to handle missing value cases and alike. They serve as a basis for the illustration of the web data analysis tool ASTERICS (Project 20008788 funded by Région Occitanie).
d
Microbial counts, Picophytoplankton from the R/V Melville IronEx II cruise...
search.dataone.org
bco-dmo.org
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ken Johnson; Kenneth H. Coale; Evelyn Armstrong (2021). Microbial counts, Picophytoplankton from the R/V Melville IronEx II cruise in the Equatorial Pacific Ocean in 1995 (IronEx II project) [Dataset]. https://search.dataone.org/view/sha256%3A5bab1e8a40607041681bb503a8d84c670843134f482b384459a44ea087ee6658
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Biological and Chemical Oceanography Data Management Office (BCO-DMO)
Authors
Ken Johnson; Kenneth H. Coale; Evelyn Armstrong
Description
Microbial Counts - Picophytoplankton

Data were normalized with the following values:

# values used for normalizing from "out" by group # group fals(rel) redFL(rel) FL/fals ratio # group1 0.09 0.62 7.19 # group2 0.92 0.61 6.84 #
d
Microbial counts, Eukaryotes from the R/V Melville IronEx II cruise in the...
search.dataone.org
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenneth H. Coale; Ken Johnson; Evelyn Armstrong (2021). Microbial counts, Eukaryotes from the R/V Melville IronEx II cruise in the Equatorial Pacific Ocean in 1995 (IronEx II project) [Dataset]. https://search.dataone.org/view/sha256%3A0a421420bf7715f2ca68243b90eae214c895d22c62e5bd9240695e49ce3dbf0e
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Biological and Chemical Oceanography Data Management Office (BCO-DMO)
Authors
Kenneth H. Coale; Ken Johnson; Evelyn Armstrong
Description
Microbial counts - Eukaryote

Data were normalized with the following values:

# values used for normalizing from "out" by group # group fals(rel) redFL(rel) FL/fals ratio # group1 0.45 0.64 1.58 # group2 2.55 9.01 3.57 # group3 0.33 7.94 27.74 # group4 nd nd nd #
France Weekly Real Estate Listings 2022-2023
kaggle.com
zip
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Dragunov (2024). France Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/france-weekly-real-estate-listings-2022-2023
Explore at:
zip(2750497 bytes)Available download formats
Dataset updated
Apr 3, 2024
Authors
Artur Dragunov
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
France
Description
These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.

The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.

For exact column descriptions, see columns for France_clean_unique.csv and my thesis.

Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.