95 datasets found

Data from: A systematic evaluation of normalization methods and probe...
zenodo.org
data.niaid.nih.gov
+1more
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra; E. J. Parra; H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc7v
Dataset updated
May 31, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra; E. J. Parra; H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Background

The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.

Methods

This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson's correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.

Results

The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson's correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).
f
Comparison of normalization approaches for gene expression studies completed...
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L. Fridley (2023). Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing [Dataset]. http://doi.org/10.1371/journal.pone.0206312
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0206312
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L. Fridley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Normalization of RNA-Seq data has proven essential to ensure accurate inferences and replication of findings. Hence, various normalization methods have been proposed for various technical artifacts that can be present in high-throughput sequencing transcriptomic studies. In this study, we set out to compare the widely used library size normalization methods (UQ, TMM, and RLE) and across sample normalization methods (SVA, RUV, and PCA) for RNA-Seq data using publicly available data from The Cancer Genome Atlas (TCGA) cervical cancer study. Additionally, an extensive simulation study was completed to compare the performance of the across sample normalization methods in estimating technical artifacts. Lastly, we investigated the effect of reduction in degrees of freedom in the normalized data and their impact on downstream differential expression analysis results. Based on this study, the TMM and RLE library size normalization methods give similar results for CESC dataset. In addition, the simulated datasets results show that the SVA (“BE”) method outperforms the other methods (SVA “Leek”, PCA) by correctly estimating the number of latent artifacts. Moreover, ignoring the loss of degrees of freedom due to normalization results in an inflated type I error rates. We recommend adjusting not only for library size differences but also the assessment of known and unknown technical artifacts in the data, and if needed, complete across sample normalization. In addition, we suggest that one includes the known and estimated latent artifacts in the design matrix to correctly account for the loss in degrees of freedom, as opposed to completing the analysis on the post-processed normalized data.
d
Methods for normalizing microbiome data: an ecological perspective
search.dataone.org
data.niaid.nih.gov
+1more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2025). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.tn8qs35
Dataset updated
Apr 11, 2025
Dataset provided by
Dryad Digital Repository
Authors
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
Time period covered
Oct 24, 2019
Description
Microbiome sequencing data often need to be normalized due to differences in read depths, and recommendations for microbiome analyses generally warn against using proportions or rarefying to normalize data and instead advocate alternatives, such as upper quartile, CSS, edgeR-TMM, or DESeq-VS. Those recommendations are, however, based on studies that focused on differential abundance testing and variance standardization, rather than community-level comparisons (i.e., beta diversity), Also, standardizing the within-sample variance across samples may suppress differences in species evenness, potentially distorting community-level patterns. Furthermore, the recommended methods use log transformations, which we expect to exaggerate the importance of differences among rare OTUs, while suppressing the importance of differences among common OTUs. 2. We tested these theoretical predictions via simulations and a real-world data set. 3. Proportions and rarefying produced more accurate compariso...
f
GTEx (Genotype-Tissue Expression) data normalized
figshare.com
data.4tu.nl
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erdogan Taskesen (2023). GTEx (Genotype-Tissue Expression) data normalized [Dataset]. http://doi.org/10.4121/uuid:ec5bfa66-5531-482a-904f-b693aa999e8b
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:ec5bfa66-5531-482a-904f-b693aa999e8b
Dataset updated
Jun 2, 2023
Dataset provided by
4TU.ResearchData
Authors
Erdogan Taskesen
License
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
Description
This is a normalized dataset from the original RNAseq dataset downloaded from Genotype-Tissue Expression (GTEx) project: www.gtexportal.org: RNA-SeQCv1.1.8 gene rpkm Pilot V3 patch1. The data was used to analyze how tissue samples are related to each other in terms of gene expression data The data can be used to get insights in how gene expression levels behave in in the different human tissues.
d
Raw and Normalized Foraminiferal Data for Chincoteague Bay and the Marshes...
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Raw and Normalized Foraminiferal Data for Chincoteague Bay and the Marshes of Assateague Island and the Adjacent Vicinity, Maryland and Virginia- July 2014 [Dataset]. https://catalog.data.gov/dataset/raw-and-normalized-foraminiferal-data-for-chincoteague-bay-and-the-marshes-of-assateague-i
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Virginia, Assateague Island, Maryland, Chincoteague Bay
Description
Foraminiferal samples were collected from Chincoteague Bay, Newport Bay, and Tom’s Cove as well as the marshes on the back-barrier side of Assateague Island and the Delmarva (Delaware-Maryland-Virginia) mainland by U.S. Geological Survey (USGS) researchers from the St. Petersburg Coastal and Marine Science Center in March, April (14CTB01), and October (14CTB02) 2014. Samples were also collected by the Woods Hole Coastal and Marine Science Center (WHCMSC) in July 2014 and shipped to the St. Petersburg office for processing. The dataset includes raw foraminiferal and normalized counts for the estuarine grab samples (G), terrestrial surface samples (S), and inner shelf grab samples (G). For further information regarding data collection and sample site coordinates, processing methods, or related datasets, please refer to USGS Data Series 1060 (https://doi.org/10.3133/ds1060), USGS Open-File Report 2015–1219 (https://doi.org/10.3133/ofr20151219), and USGS Open-File Report 2015-1169 (https://doi.org/10.3133/ofr20151169). Downloadable data are available as Excel spreadsheets, comma-separated values text files, and formal Federal Geographic Data Committee metadata.
Individual FPKM normalized data from samples of bovine corpus luteum in...
figshare.com
txt
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megan Mezera (2023). Individual FPKM normalized data from samples of bovine corpus luteum in pregnancy and regression [Dataset]. http://doi.org/10.6084/m9.figshare.14294600.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14294600.v1
Dataset updated
Jun 5, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Megan Mezera
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FPKM normalized data from whole transcriptome sequencing of corpus luteum tissue from lactating holstein cows in the following physiologic states: late luteal phase (control), early regression, late regression, first month pregnancy (day 20), second month pregnancy (day 55+/-3 days)
Intermediate data for TE calculation
zenodo.org
bin, csv
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yue Liu; Yue Liu (2025). Intermediate data for TE calculation [Dataset]. http://doi.org/10.5281/zenodo.10373032
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10373032
Dataset updated
May 9, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yue Liu; Yue Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes intermediate data from RiboBase that generates translation efficiency (TE). The code to generate the files can be found at https://github.com/CenikLab/TE_model.

We uploaded demo HeLa .ribo files, but due to the large storage requirements of the full dataset, I recommend contacting Dr. Can Cenik directly to request access to the complete version of RiboBase if you need the original data.

The detailed explanation for each file:

human_flatten_ribo_clr.rda: ribosome profiling clr normalized data with GEO GSM ids in columns and genes in rows in human.

human_flatten_rna_clr.rda: matched RNA-seq clr normalized data with GEO GSM ids in columns and genes in rows in human.

human_flatten_te_clr.rda: TE clr data with GEO GSM ids in columns and genes in rows in human.

human_TE_cellline_all_plain.csv: TE clr data with genes in rows and cell lines in rows in human.

human_RNA_rho_new.rda: matched RNA-seq proportional similarity data as genes by genes matrix in human.

human_TE_rho.rda: TE proportional similarity data as genes by genes matrix in human.

mouse_flatten_ribo_clr.rda: ribosome profiling clr normalized data with GEO GSM ids in columns and genes in rows in mouse.

mouse_flatten_rna_clr.rda: matched RNA-seq clr normalized data with GEO GSM ids in columns and genes in rows in mouse.

mouse_flatten_te_clr.rda: TE clr data with GEO GSM ids in columns and genes in rows in mouse.

mouse_TE_cellline_all_plain.csv: TE clr data with genes in rows and cell lines in rows in mouse.

mouse_RNA_rho_new.rda: matched RNA-seq proportional similarity data as genes by genes matrix in mouse.

mouse_TE_rho.rda: TE proportional similarity data as genes by genes matrix in mouse.

All the data was passed quality control. There are 1054 mouse samples and 835 mouse samples:
* coverage > 0.1 X
* CDS percentage > 70%
* R2 between RNA and RIBO >= 0.188 (remove outliers)

All ribosome profiling data here is non-dedup winsorizing data paired with RNA-seq dedup data without winsorizing (even though it names as flatten, it just the same format of the naming)

####code
If you need to read rda data please use load("rdaname.rda") with R

If you need to calculate proportional similarity from clr data:
library(propr)
human_TE_homo_rho <- propr:::lr2rho(as.matrix(clr_data))
rownames(human_TE_homo_rho) <- colnames(human_TE_homo_rho) <- rownames(clr_data)
f
LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time...
plos.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas (2023). LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time PCR (qPCR) Data as an Alternative to Reference Gene Based Methods [Dataset]. http://doi.org/10.1371/journal.pone.0135852
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0135852
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ΔΔCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.
Z
Python Time Normalized Superposed Epoch Analysis (SEAnorm) Example Data Set
data.niaid.nih.gov
Updated Jul 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Walton, Sam D. (2022). Python Time Normalized Superposed Epoch Analysis (SEAnorm) Example Data Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6835136
Explore at:
Dataset updated
Jul 15, 2022
Dataset provided by
Murphy, Kyle R.
Walton, Sam D.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Solar Wind Omni and SAMPEX ( Solar Anomalous and Magnetospheric Particle Explorer) datasets used in examples for SEAnorm, a time normalized superposed epoch analysis package in python.

Both data sets are stored as either a HDF5 or a compressed csv file (csv.bz2) which contain a Pandas DataFrame of either the Solar Wind Omni and SAMPEX data sets. The data sets where written with pandas.DataFrame.to_hdf() and pandas.DataFrame.to_csv() using a compression level of 9. The DataFrames can be read using pandas.DataFrame.read_hdf( ) or pandas.DataFrame.read_csv( ) depending on the file format.

The Solar Wind Omni data sets contains solar wind velocity (V) and dynamic pressure (P), the southward interplanetary magnetic field in Geocentric Solar Ecliptic System (GSE) coordinates (B_Z_GSE), the auroral electrojet index (AE), and the Sym-H index all at 1 minute cadence.

The SAMPEX data set contains electron flux from the Proton/Electron Telescope (PET) at two energy channels 1.5-6.0 MeV (ELO) and 2.5-14 MeV (EHI) at an approximate 6 second cadence.
E
Data from: Dataset of normalised Slovene text KonvNormSl 1.0
live.european-language-grid.eu
binary format
Updated Sep 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Dataset of normalised Slovene text KonvNormSl 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8217
Explore at:
binary formatAvailable download formats
Dataset updated
Sep 18, 2016
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Data used in the experiments described in:

Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany.
https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf
(https://www.linguistics.rub.de/konvens16/)

Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.

There are four datasets:
- goo300k-bohoric: historical Slovene, hard case (<1850)
- goo300k-gaj: historical Slovene, easy case (1850 - 1900)
- tweet-L3: Slovene tweets, hard case (non-standard language)
- tweet-L1: Slovene tweets, easy case (mostly standard language)

The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/).

The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
f
Example to illustrate the concept of normalized weight (or probability) of...
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rithun Mukherjee; Perry Evans; Larry N. Singh; Sridhar Hannenhalli (2023). Example to illustrate the concept of normalized weight (or probability) of ancestral assignments. [Dataset]. http://doi.org/10.1371/journal.pone.0055521.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0055521.t001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Rithun Mukherjee; Perry Evans; Larry N. Singh; Sridhar Hannenhalli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a simplified situation of two species related through a common ancestor, where the evolutionary tree has just one internal node representing the ancestor, with four possible ancestral assignments. For a sample PWM with 5 sites aligned over the two species, we provide representative values (in blue) for the probability of the tree corresponding to each site given a particular ancestral assignment. From these we work out the overall probability of an ancestral assignment given the data (last column). For details, see the text in the Materials and Methods section that references this table.
o
The time-series gene expression data in PMA stimulated THP-1
omicsdi.org
datamed.org
xml
Updated Jan 1, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carsten Daub,David Hume,Kate Schroder,Harukazu Suzuki,Jun Kawai,Atsutaka Kubosaki,John Quackenbush,Takahiro Suzuki,Fumi Hori,Yoshihide Hayashizaki,Kayoko Murakami,Jess Mar,Yasumasa Kimura,Katharine Irvine (2018). The time-series gene expression data in PMA stimulated THP-1 [Dataset]. https://www.omicsdi.org/dataset/arrayexpress-repository/E-GEOD-15528
Explore at:
xmlAvailable download formats
Dataset updated
Jan 1, 2018
Authors
Carsten Daub,David Hume,Kate Schroder,Harukazu Suzuki,Jun Kawai,Atsutaka Kubosaki,John Quackenbush,Takahiro Suzuki,Fumi Hori,Yoshihide Hayashizaki,Kayoko Murakami,Jess Mar,Yasumasa Kimura,Katharine Irvine
Variables measured
Transcriptomics,Multiomics
Description
(1) qPCR Gene Expression Data The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Undifferentiated cells were harvested in TRIzol reagent at the beginning of the LPS time-course. One biological replicate was prepared for each time point. Total RNA was purified from TRIzol lysates according to manufacturer’s instructions. Genespecific primer pairs were designed using Primer3 software, with an optimal primer size of 20 bases, amplification size of 140bp, and annealing temperature of 60°C. Primer sequences were designed for 2,396 candidate genes including four potential controls: GAPDH, beta actin (ACTB), beta-2-microglobulin (B2M), phosphoglycerate kinase 1 (PGK1). The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA, USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate. Reactions were carried out in 20μL volumes in 384-well plates; each reaction contained: 0.5 U of HotStar Taq DNA polymerase (Qiagen) and the manufacturer’s 1× amplification buffer adjusted to a final concentration of 1mM MgCl2, 160μM dNTPs, 1/38000 SYBR Green I (Molecular Probes), 7% DMSO, 0.4% ROX Reference Dye (Invitrogen), 300 nM of each primer (forward and reverse), and 2μL of 40-fold diluted first-strand cDNA synthesis reaction mixture (12.5ng total RNA equivalent). Polymerase activation at 95ºC for 15 min was followed by 40 cycles of 15 s at 94ºC, 30 s at 60ºC, and 30 s at 72ºC. The dissociation curve analysis, which evaluates each PCR product to be amplified from single cDNA, was carried out in accordance with the manufacturer’s protocol. Expression levels were reported as Ct values. The large number of genes assayed and the replicates measures required that samples be distributed across multiple amplification plates, with an average of twelve plates per sample. Because it was envisioned that GAPDH would serve as a single-gene normalization control, this gene was included on each plate. All primer pairs were replicated in triplicates. Raw qPCR expression measures were quantified using Applied Biosystems SDS software and reported as Ct values. The Ct value represents the number of cycles or rounds of amplification required for the fluorescence of a gene or primer pair to surpass an arbitrary threshold. The magnitude of the Ct value is inversely proportional to the expression level so that a gene expressed at a high level will have a low Ct value and vice versa. Replicate Ct values were combined by averaging, with additional quality control constraints imposed by a standard filtering method developed by the RIKEN group for the preprocessing of their qPCR data. Briefly this method entails: 1. Sort the triplicate Ct values in ascending order: Ct1, Ct2, Ct3. Calculate differences between consecutive Ct values: difference1 = Ct2 – Ct1 and difference2 = Ct3 – Ct2. 2. Four regions are defined (where Region4 overrides the other regions): Region1: difference ≦ 0.2, Region2: 0.2 < difference ≦ 1.0, Region3: 1.0 < difference, Region4: one of the Ct values in the difference calculation is 40 If difference1 and difference2 fall in the same region, then the three replicate Ct values are averaged to give a final representative measure. If difference1 and difference2 are in different regions, then the two replicate Ct values that are in the small number region are averaged instead. This particular filtering method is specific to the data set we used here and does not represent a part of the normalization procedure itself; Alternate methods of filtering can be applied if appropriate prior to normalization. Moreover while the presentation in this manuscript has used Ct values as an example, any measure of transcript abundance, including those corrected for primer efficiency can be used as input to our data-driven methods. (2) Quantile Normalization Algorithm Quantile normalization proceeds in two stages. First, if samples are distributed across multiple plates, normalization is applied to all of the genes assayed for each sample to remove plate-to-plate effects by enforcing the same quantile distribution on each plate. Then, an overall quantile normalization is applied between samples, assuring that each sample has the same distribution of expression values as all of the other samples to be compared. A similar approach using quantile ormalization has been previously described in the context of microarray normalization. Briefly, our method entails the following steps: i) qPCR data from a single RNA sample are stored in a matrix M of dimension k (maximum number of genes or primer pairs on a plate) rows by p (number of plates) columns. Plates with differing numbers of genes are made equivalent by padded plates with missing values to constrain M to a rectangular structure. ii) Each column is sorted into ascending order and stored in matrix M’. The sorted columns correspond to the quantile distribution of each plate. The missing values are placed at the end of each ordered column. All calculations in quantile normalization are performed on non-missing values. iii) The average quantile distribution is calculated by taking the average of each row in M’. Each column in M’ is replaced by this average quantile distribution and rearranged to have the same ordering as the original row order in M. This gives the within-sample normalized data from one RNA sample. iv) Steps analogous to 1 – 3 are repeated for each sample. Between-sample normalization is performed by storing the within-normalized data as a new matrix N of dimension k (total number of genes, in our example k = 2,396) rows by n (number of samples) columns. Steps 2 and 3 are then applied to this matrix. (3) Rank-Invariant Set Normalization Algorithm We describe an extension of this method for use on qPCR data with any number of experimental conditions or samples in which we identify a set of stably-expressed genes from within the measured expression data and then use these to adjust expression between samples. Briefly, i) qPCR data from all samples are stored in matrix R of dimension g (total number of genes or primer pairs used for all plates) rows by s (total number of samples). ii) We first select gene sets that are rank-invariant across a single sample compared to a common reference. The reference may be chosen in a variety of ways, depending on the experimental design and aims of the experiment. As described in Tseng et al., the reference may be designated as a particular sample from the experiment (e.g. time zero in a time course experiment), the average or median of all samples, or selecting the sample which is closest to the average or median of all samples. Genes are considered to be rank-invariant if they retain their ordering or rank with respect to expression across the experimental sample versus the common reference sample. We collect sets of rank-invariant genes for all of the s pairwise comparisons, relative to a common reference. We take the intersection of all s sets to obtain the final set of rank-invariant genes that is used for normalization. iii) Let αj represent the average expression value of the rank-invariant genes in sample j. (α1, …, αs) then represents the vector of rank-invariant average expression values for all conditions 1 to s iv) We calculate the scale f The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Total RNA was purifed from TRIzol lysates according to manufacturer’s instructions. The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA,USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate.
t
Single non-normalized data of electron probe analyses of all glass shard...
service.tib.eu
doi.pangaea.de
+1more
Updated Nov 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Single non-normalized data of electron probe analyses of all glass shard samples from the Seward Peninsula and the Lipari obsidian reference standard [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-859554
Explore at:
Dataset updated
Nov 30, 2024
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered
Seward Peninsula
Description
Permafrost degradation influences the morphology, biogeochemical cycling and hydrology of Arctic landscapes over a range of time scales. To reconstruct temporal patterns of early to late Holocene permafrost and thermokarst dynamics, site-specific palaeo-records are needed. Here we present a multi-proxy study of a 350-cm-long permafrost core from a drained lake basin on the northern Seward Peninsula, Alaska, revealing Lateglacial to Holocene thermokarst lake dynamics in a central location of Beringia. Use of radiocarbon dating, micropalaeontology (ostracods and testaceans), sedimentology (grain-size analyses, magnetic susceptibility, tephra analyses), geochemistry (total nitrogen and carbon, total organic carbon, d13Corg) and stable water isotopes (d18O, dD, d excess) of ground ice allowed the reconstruction of several distinct thermokarst lake phases. These include a pre-lacustrine environment at the base of the core characterized by the Devil Mountain Maar tephra (22 800±280 cal. a BP, Unit A), which has vertically subsided in places due to subsequent development of a deep thermokarst lake that initiated around 11 800 cal. a BP (Unit B). At about 9000 cal. a BP this lake transitioned from a stable depositional environment to a very dynamic lake system (Unit C) characterized by fluctuating lake levels, potentially intermediate wetland development, and expansion and erosion of shore deposits. Complete drainage of this lake occurred at 1060 cal. a BP, including post-drainage sediment freezing from the top down to 154 cm and gradual accumulation of terrestrial peat (Unit D), as well as uniform upward talik refreezing. This core-based reconstruction of multiple thermokarst lake generations since 11 800 cal. a BP improves our understanding of the temporal scales of thermokarst lake development from initiation to drainage, demonstrates complex landscape evolution in the ice-rich permafrost regions of Central Beringia during the Lateglacial and Holocene, and enhances our understanding of biogeochemical cycles in thermokarst-affected regions of the Arctic.
s
Data from: Breast cancer patient-derived whole-tumor cell culture model for...
figshare.scilifelab.se
researchdata.se
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinsong Chen; Emmanouil Sifakis; Johan Hartman (2025). Data from: Breast cancer patient-derived whole-tumor cell culture model for efficient drug profiling and treatment response prediction [Dataset]. http://doi.org/10.17044/scilifelab.21516993.v1
Explore at:
Unique identifier
https://doi.org/10.17044/scilifelab.21516993.v1
Dataset updated
Jan 15, 2025
Dataset provided by
Karolinska Institutet
Authors
Xinsong Chen; Emmanouil Sifakis; Johan Hartman
License
https://www.scilifelab.se/data/restricted-access/https://www.scilifelab.se/data/restricted-access/
Description
Dataset Description This record is a collection of Whole-genome sequencing (WGS), RNA sequencing (RNA-seq), NanoString's nCounter® Breast Cancer 360 (BC360) Panel and cell viability assay data, generated as part of the study “Breast cancer patient-derived whole-tumor cell culture model for efficient drug profiling and treatment response prediction" by Chen et al., 2022. The WGS dataset contains raw sequencing data (BAM files) from tumor scraping cells (TSCs) at the time of surgical resection, derived whole-tumor cell (WTC) cultures from each patient's specimen, and normal skin biopsy for germline control, from five (5) breast cancer (BC) patients. Genomic DNA samples were isolated by using the QIAamp DNA mini kit (QIAGEN). The library was prepared by using Illumina TruSeq PCR-free (350 bp) according to the manufacturer’s protocol. The bulk DNA samples were then sequenced by Illumina Hiseq X and processed via the Science for Life Laboratory CAW workflow version 1.2.362 (Stockholm, Sweden; https://github.com/SciLifeLab/Sarek). The RNA-seq dataset contains raw sequencing data (fastq files) from the TSC pellets at the time of surgical resection, and the pellets of derived WTC cultures with or without tamoxifen metabolites treatment (1 nM 4OHT and 25 nM Z-Endoxifen), from 16 BC patients. 2000 ng RNA was extracted using the RNeasy mini kit (QIAGEN) from each sample, and 1 μg of total RNA was used for rRNA depletion using RiboZero (Illumina). Stranded RNA-seq libraries were constructed using TruSeq Stranded Total RNA Library Prep Kit (Illumina), and paired-end sequencing was performed on HiSeq 2500 with a 2 x 126 setup using the Science for Life Laboratory platform (Stockholm, Sweden). The NanoString's nCounter® BC360 Panel dataset contains normalized data from FFPE tissue samples of 43 BC patients. RNA was extracted from the macrodissected sections using the High Pure FFPET RNA Isolation Kit (Roche) following the manufacturer's protocols. Then, 200 ng of RNA per sample were loaded and further analyzed according to the manufacturer’s recommendation on a NanoString nCounter® system using the Breast Cancer 360 code set, which is comprised of 18 housekeeping genes and 752 target genes covering key pathways in tumor biology, microenvironment, and immune response. Raw data was assessed using several quality assurance (QA) metrics to measure imaging quality, oversaturation, and overall signal-to-noise ratio. All samples satisfying QA metric checks were background corrected (background thresholding) using the negative probes and normalized with their mean minus two standard deviations. The background-corrected data were then normalized by calculating the geometric mean of five housekeeper genes, namely ACTB, MRPL19, PSMC4, RPLP0, and SF3A1. The cell viability assay dataset for the main study contains drug sensitivity score (DSS) values for each of the tested drugs derived from the WTC spheroids of 45 BC patients. For patient DP-45, multiple regions were sampled to establish WTCs and perform drug profiling. For the neoadjuvant setting validation study, DSS values correspond to WTCs of 15 BC patients. For the drug profiling assay, each compound covered five concentrations ranging from 10 μM to 1 nM (2 μM to 0.2 nM for trastuzumab and pertuzumab) in 10-fold dilutions and was dispensed using the acoustic liquid handling system Echo 550 (Labcyte Inc) to make spotted 384-well plates. For the neoadjuvant setting validation assay, we updated the cyclophosphamide into its active metabolite form 4-hydroperoxy cyclophosphamide (4-OOH-cyclophosphamide). Each relevant compound covered eight concentrations ranging from 10 μM to 1 nM (2 μM to 0.2 nM for trastuzumab and pertuzumab) and was dispensed using the Tecan D300e Digital Dispenser (Tecan) to make spotted 384-well plates. In both experiment settings, a total volume of 40 nl of each compound condition was dispensed into each well, for limiting the final DMSO concentration to 0.1% during the treatment period. Further details on the cell viability assay, as well as the DSS estimation are available in the Materials & Methods part of Chen et al., 2022.
U
Water normalized geochemistry data for marine ferromanganese crusts and...
data.usgs.gov
catalog.data.gov
Updated Oct 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katlin Adamczyk; Kira Mizell (2024). Water normalized geochemistry data for marine ferromanganese crusts and phosphorite minerals in the Southern California Borderland [Dataset]. http://doi.org/10.5066/P1QURTBR
Explore at:
Unique identifier
https://doi.org/10.5066/P1QURTBR
Dataset updated
Oct 26, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Katlin Adamczyk; Kira Mizell
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Oct 27, 2020 - Aug 6, 2021
Area covered
California
Description
Ferromanganese crust and phosphorite minerals were collected using remotely operated vehicles in the Southern California Borderland during two separate research cruises – NOAA Ocean Exploration Trust cruise NA124 onboard the E/V Nautilus in 2020, and Schmidt Ocean Institute cruise FK210726 onboard the R/V Falkor in 2021. Ferromanganese crust and phosphorite samples were described and subsampled for geochemical analysis at the USGS Pacific Coastal and Marine Science Center. Geochemical analyses were completed by outside commercial laboratories, and the results were provided to the USGS. Geochemical data, bulk and layer thickness information, as well as location information (latitude, longitude, depth) for each sample are provided here. The geochemical Borderland work was funded by the Pacific Coastal and Marine Science Center and ship time was funded by NOAA Office of Ocean Exploration and Research (grant number NA19OAR110305).
f
Data from: Targeted Workflow Investigating Variations in the Tear Proteome...
acs.figshare.com
bin
Updated Aug 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maggy Lépine; Oriana Zambito; Lekha Sleno (2023). Targeted Workflow Investigating Variations in the Tear Proteome by Liquid Chromatography Tandem Mass Spectrometry [Dataset]. http://doi.org/10.1021/acsomega.3c03186.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1021/acsomega.3c03186.s001
Dataset updated
Aug 14, 2023
Dataset provided by
ACS Publications
Authors
Maggy Lépine; Oriana Zambito; Lekha Sleno
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Proteins in tears have an important role in eye health and have been shown as a promising source of disease biomarkers. The goal of this study was to develop a robust, sensitive, and targeted method for profiling tear proteins to examine the variability within a group of healthy volunteers over three days. Inter-individual and inter-day variabilities were examined to contribute to understanding the normal variations in the tear proteome, as well as to establish which proteins may be better candidates as eventual biomarkers of specific diseases. Tear samples collected on Schirmer strips were subjected to bottom-up proteomics, and resulting peptides were analyzed using an optimized targeted method measuring 226 proteins by liquid chromatography-scheduled multiple reaction monitoring. This method was developed using an in-house database of identified proteins from tears compiled from high-resolution data-dependent liquid chromatography tandem mass spectrometry data. The measurement of unique peptide signals can help better understand the dynamics of each of these proteins in tears. Some interesting trends were seen in specific pathways or protein classes, including higher variabilities for those involved in glycolysis, glutathione metabolism, and cytoskeleton proteins and lower variation for those involving the degradation of the extracellular matrix. The overall aim of this study was to contribute to the field of tear proteomics with the development of a novel and targeted method that is highly amenable to the clinical laboratory using high flow LC and commonly used triple quadrupole mass spectrometry while ensuring that protein quantitation was reported based on unique peptides for each protein and robust peak areas with data normalization. These results report on variabilities on over 200 proteins that are robustly detected in tear samples from healthy volunteers with a simple sample preparation procedure.
Raw and normalized count data for "Probabilistic cell-type assignment of...
zenodo.org
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kieran Campbell; Kieran Campbell (2020). Raw and normalized count data for "Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling" [Dataset]. http://doi.org/10.5281/zenodo.3372746
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3372746
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kieran Campbell; Kieran Campbell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SingleCellExperiment objects containing raw and normalized counts, as well as reduced dimension representations and cell type annotations for both the follicular lymphoma samples (sce_follicular_annotated_final.rds) and high grade serous ovarian cancer samples (sce_hgsc_annotated_final.rds) as detailed in the paper.
Data from: The Chord-Normalized Expected Species Shared (CNESS)-distance...
zenodo.org
datadryad.org
bin
Updated Jun 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi Zou; Yi Zou; Jan Axmacher; Jan Axmacher (2022). Data from: The Chord-Normalized Expected Species Shared (CNESS)-distance represents a superior measure of species turnover patterns [Dataset]. http://doi.org/10.5061/dryad.v41ns1rrp
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.v41ns1rrp
Dataset updated
Jun 2, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yi Zou; Yi Zou; Jan Axmacher; Jan Axmacher
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
1. Measures of β-diversity characterizing the difference in species composition between samples are commonly used in ecological studies. Nonetheless, commonly used dissimilarity measures require high sample completeness, or at least similar sample sizes between samples. In contrast, the Chord-Normalized Expected Species Shared (CNESS) dissimilarity measure calculates the probability of collecting the same set of species in random samples of a standardized size, and hence is not sensitive to completeness or size of compared samples. To date, this index has enjoyed limited use due to difficulties in its calculation and scarcity of studies systematically comparing it with other measures.

2. Here, we developed a novel R function that enables users to calculate ESS (Expected Species Shared)-associated measures. We evaluate the performance of the CNESS index based on simulated datasets of known species distribution structure, and compared CNESS with more widespread dissimilarity measures (Bray-Curtis index, Chao-Sørensen index, and proportionality based Euclidean distances) for varying sample completeness and sample sizes.

3. Simulation results indicated that for small sample size (m) values, CNESS chiefly reflects similarities in dominant species, while selecting large m values emphasizes differences in the overall species assemblages. Permutation tests revealed that CNESS has a consistently low CV (coefficient of variation) even where sample completeness varies, while the Chao-Sørensen index has a high CV particularly for low sampling completeness. CNESS distances are also more robust than other indices with regards to undersampling, particularly when chiefly rare species are shared between two assemblages.

4. Our results emphasize the superiority of CNESS for comparisons of samples diverging in sample completeness and size, which is particularly important in studies of highly mobile and species-rich taxa where sample completeness is often low. Via changes in the sample size parameter m, CNESS furthermore cannot only provide insights into the similarity of the overall distribution structure of shared species, but also into the differences in dominant and rare species, hence allowing additional, valuable insights beyond the capability of more widespread measures.
Normalized subject indexing data of K10plus library union catalog...
zenodo.org
application/gzip +1
Updated Dec 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jakob Voß; Jakob Voß (2024). Normalized subject indexing data of K10plus library union catalog (2022-09-30) [Dataset]. http://doi.org/10.5281/zenodo.7307966
Explore at:
application/gzip, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7307966
Dataset updated
Dec 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jakob Voß; Jakob Voß
Description
This dataset contains normalized subject indexing data of K10plus library union catalog. It includes links between bibliographic records in K10plus and concepts (subjects or classes) from controlled vocabularies:

kxp-subjects_2022-09-30.tsv.gz: TSV format

kxp-subjects_2022-09-30.nt.gz: RDF format (in form of NTriples)

vocabularies.json: information about vocabularies

K10plus

K10plus is a union catalog of German libraries, run by library service centers BSZ and VZG since 2019. The catalog contains bibliographic data of the majority of academic libraries in Germany. Bibliographic records in K10plus are uniquely identified by a PPN identifier.

Several APIs exist to retrieve more data for a record via its PPN, e.g. link into K10plus OPAC:

https://opac.k10plus.de/PPNSET?PPN={PPN}

Retrieve full record in MARC/XML format:

https://unapi.k10plus.de/?format=marcxml&id=opac-de-627:ppn:{PPN}

Get formatted citation for display:

https://ws.gbv.de/suggest/csl2?citationstyle=ieee&language=en&database=opac-de-627&query=pica.ppn=${PPN}

APIs to look up more data from a notation or identifier of a vocabulary can be found in https://bartoc.org/. For instance BK class 58.55 can be retrieved via DANTE API:

https://api.dante.gbv.de/data?uri=http%3A%2F%2Furi.gbv.de%2Fterminology%2Fbk%2F58.55

See vocabularies.json for mapping of vocabulary symbol to BARTOC URI and additional information.

Statistics

The TSV dataset is 24,367,895 records and 84,408,705 links to concepts.

Number of concepts per vocabulary:

asb 5337 stw 105054 nlm 134271 ssd 155548 kab 161737 sfb 441508 sdnb 4637639 lcc 5466762 ddc 9483999 rvk 10305961 bk 13613274 gnd 39897615

Number of RDF Triples: 84,408,705

TSV

The .tsv file contains three tab-separated columns:

Bibliographic record identifier (PPN)

Vocabulary symbol

Notation or identifier in the vocabulary

An example:

010000011 bk 58.55 010000011 gnd 4036582-7

Record 010000011 is indexed with class 58.55 from Basic Classification and with authority record 4036582-7 from Integrated authority file.

RDF

The NTriples file contains the same information as given in TSV file but identifiers are mapped to URIs. An example:

<http://uri.gbv.de/document/opac-de-627:ppn:010000011> <http://purl.org/dc/terms/subject> <http://d-nb.info/gnd/4036582-7> . <http://uri.gbv.de/document/opac-de-627:ppn:010000011> <http://purl.org/dc/terms/subject> <http://uri.gbv.de/terminology/bk/58.55> .

API

Number of PICA records per identifier from selected vocabularies and co-occurrences can be queried from https://coli-conc.gbv.de/occurrences/.

Changelog

2022-11-11: New dump. Fixed PPN URIs and broken UTF-8 encoding

2022-08-24: Fixed GND URIs, added LCC and KAB (https://doi.org/10.5281/zenodo.7018350)

2022-08-24: First version (https://doi.org/10.5281/zenodo.7016626)

License and provenance

All data is public domain but references are welcome. See https://coli-conc.gbv.de/ for related projects and documentation.

The data has been derived from a larger datase of all subject indexing data, published at https://doi.org/10.5281/zenodo.6817455.

This dataset has been created with public scripts from git repository https://github.com/gbv/k10plus-subjects.
Sample dataset for the models trained and tested in the paper 'Can AI be...
zenodo.org
zip
Updated Aug 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12934521
Dataset updated
Aug 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

To use the data, clone the corresponding repository and unzip this zip file in the data folder.

Facebook

Twitter

Click to copy link

Link copied

Cite

H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra; E. J. Parra; H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v

Data from: A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.cnp5hqc7v

Dataset updated

May 31, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra; E. J. Parra; H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Background

The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.

Methods

This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson's correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.

Results

The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson's correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).

Clear search

Close search

Google apps

Main menu

Data from: A systematic evaluation of normalization methods and probe...

Comparison of normalization approaches for gene expression studies completed...

Methods for normalizing microbiome data: an ecological perspective

GTEx (Genotype-Tissue Expression) data normalized

Raw and Normalized Foraminiferal Data for Chincoteague Bay and the Marshes...

Individual FPKM normalized data from samples of bovine corpus luteum in...

Intermediate data for TE calculation

LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time...

Python Time Normalized Superposed Epoch Analysis (SEAnorm) Example Data Set

Data from: Dataset of normalised Slovene text KonvNormSl 1.0

Example to illustrate the concept of normalized weight (or probability) of...

The time-series gene expression data in PMA stimulated THP-1

Single non-normalized data of electron probe analyses of all glass shard...

Data from: Breast cancer patient-derived whole-tumor cell culture model for...

Water normalized geochemistry data for marine ferromanganese crusts and...

Data from: Targeted Workflow Investigating Variations in the Tear Proteome...

Raw and normalized count data for "Probabilistic cell-type assignment of...

Data from: The Chord-Normalized Expected Species Shared (CNESS)-distance...

Normalized subject indexing data of K10plus library union catalog...

Sample dataset for the models trained and tested in the paper 'Can AI be...

Data from: A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data