Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).
Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.
All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The goal of metabolomics is to measure the entire range of small organic molecules in biological samples. In liquid chromatography–mass spectrometry-based metabolomics, formidable analytical challenges remain in removing the nonbiological factors that affect chromatographic peak areas. These factors include sample matrix-induced ion suppression, chromatographic quality, and analytical drift. The combination of these factors is referred to as obscuring variation. Some metabolomics samples can exhibit intense obscuring variation due to matrix-induced ion suppression, rendering large amounts of data unreliable and difficult to interpret. Existing normalization techniques have limited applicability to these sample types. Here we present a data normalization method to minimize the effects of obscuring variation. We normalize peak areas using a batch-specific normalization process, which matches measured metabolites with isotope-labeled internal standards that behave similarly during the analysis. This method, called best-matched internal standard (B-MIS) normalization, can be applied to targeted or untargeted metabolomics data sets and yields relative concentrations. We evaluate and demonstrate the utility of B-MIS normalization using marine environmental samples and laboratory grown cultures of phytoplankton. In untargeted analyses, B-MIS normalization allowed for inclusion of mass features in downstream analyses that would have been considered unreliable without normalization due to obscuring variation. B-MIS normalization for targeted or untargeted metabolomics is freely available at https://github.com/IngallsLabUW/B-MIS-normalization.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Background
The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods
This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson's correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results
The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson's correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example of normalizing the word ‘foooooooooood’ and ‘welllllllllllll’ using the proposed method and four other normalization methods.
Simulation script 1This R script will simulate two populations of microbiome samples and compare normalization methods.Simulation script 2This R script will simulate two populations of microbiome samples and compare normalization methods via PcOAs.Sample.OTU.distributionOTU distribution used in the paper: Methods for normalizing microbiome data: an ecological perspective
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias - Table 1
One of the body fluids often used in metabolomics studies is urine. The peak intensities of metabolites in urine are affected by the urine history of an individual resulting in dilution differences. This requires therefore normalization of the data to correct for such differences. Two normalization techniques are commonly applied to urine samples prior to their further statistical analysis. First, AUC normalization aims to normalize a group of signals with peaks by standardizing the area under the curve (AUC) within a sample to the median, mean or any other proper representation of the amount of dilution. The second approach uses specific end-product metabolites such as creatinine and all intensities within a sample are expressed relative to the creatinine intensity. Another way of looking at urine metabolomics data is by realizing that the ratios between peak intensities are the information-carrying features. This opens up possibilities to use another class of data analysis techniques designed to deal with such ratios: compositional data analysis. In this approach special transformations are defined to deal with the ratio problem. In essence, it comes down to using another distance measure than the Euclidian Distance that is used in the conventional analysis of metabolomics data. We will illustrate using this type of approach in combination with three-way methods (i.e. PARAFAC) to be used in cases where samples of some biological material are measured at multiple time points. Aim of the paper is to develop PARAFAC modeling of three-way metabolomics data in the context of compositional data and compare this with standard normalization techniques for the specific case of urine metabolomics data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Normalization of RNA-Seq data has proven essential to ensure accurate inferences and replication of findings. Hence, various normalization methods have been proposed for various technical artifacts that can be present in high-throughput sequencing transcriptomic studies. In this study, we set out to compare the widely used library size normalization methods (UQ, TMM, and RLE) and across sample normalization methods (SVA, RUV, and PCA) for RNA-Seq data using publicly available data from The Cancer Genome Atlas (TCGA) cervical cancer study. Additionally, an extensive simulation study was completed to compare the performance of the across sample normalization methods in estimating technical artifacts. Lastly, we investigated the effect of reduction in degrees of freedom in the normalized data and their impact on downstream differential expression analysis results. Based on this study, the TMM and RLE library size normalization methods give similar results for CESC dataset. In addition, the simulated datasets results show that the SVA (“BE”) method outperforms the other methods (SVA “Leek”, PCA) by correctly estimating the number of latent artifacts. Moreover, ignoring the loss of degrees of freedom due to normalization results in an inflated type I error rates. We recommend adjusting not only for library size differences but also the assessment of known and unknown technical artifacts in the data, and if needed, complete across sample normalization. In addition, we suggest that one includes the known and estimated latent artifacts in the design matrix to correctly account for the loss in degrees of freedom, as opposed to completing the analysis on the post-processed normalized data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Analyzing big data is a task encountered across disciplines. Addressing the challenges inherent in dealing with big data necessitate solutions that cover its three defining properties: volume, variety, and velocity. However, what is less understood is the treatment of the data that must be completed even before any analysis can begin. Specifically, there is often a non-trivial amount of time and resources that are utilized to the end of retrieving and preprocessing big data. This problem, known collectively as data integration, is a term frequently used for the general problem of taking data in some initial form and transforming it into a desired form. Examples of this include the rearranging of fields, changing the form of expression of one or more fields, altering the boundary notation of records and/or fields, encrypting or decrypting records and/or fields, parsing non-record data and organizing it into a record-oriented form, etc. In this work, we present our progress in creating a benchmarking suite that characterizes a diverse set of data integration applications.
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
This is a normalized dataset from the original RNAseq dataset downloaded from Genotype-Tissue Expression (GTEx) project: www.gtexportal.org: RNA-SeQCv1.1.8 gene rpkm Pilot V3 patch1. The data was used to analyze how tissue samples are related to each other in terms of gene expression data The data can be used to get insights in how gene expression levels behave in in the different human tissues.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Although the basic structure of logit-mixture models is well understood, important identification and normalization issues often get overlooked. This paper addresses issues related to the identification of parameters in logit-mixture models containing normally distributed error components associated with alternatives or nests of alternatives (normal error component logit mixture, or NECLM, models). NECLM models include special cases such as unrestricted, fixed covariance matrices; alternative-specific variances; nesting and cross-nesting structures; and some applications to panel data. A general framework is presented for determining which parameters are identified as well as what normalization to impose when specifying NECLM models. It is generally necessary to specify and estimate NECLM models at the levels, or structural, form. This precludes working with utility differences, which would otherwise greatly simplify the identification and normalization process. Our results show that identification is not always intuitive; for example, normalization issues present in logit-mixture models are not present in analogous probit models. To identify and properly normalize the NECLM, we introduce the equality condition, an addition to the standard order and rank conditions. The identifying conditions are worked through for a number of special cases, and our findings are demonstrated with empirical examples using both synthetic and real data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of samples per conditions in the Full and Validation data sets.
(1) qPCR Gene Expression Data The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Undifferentiated cells were harvested in TRIzol reagent at the beginning of the LPS time-course. One biological replicate was prepared for each time point. Total RNA was purified from TRIzol lysates according to manufacturer’s instructions. Genespecific primer pairs were designed using Primer3 software, with an optimal primer size of 20 bases, amplification size of 140bp, and annealing temperature of 60°C. Primer sequences were designed for 2,396 candidate genes including four potential controls: GAPDH, beta actin (ACTB), beta-2-microglobulin (B2M), phosphoglycerate kinase 1 (PGK1). The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA, USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate. Reactions were carried out in 20μL volumes in 384-well plates; each reaction contained: 0.5 U of HotStar Taq DNA polymerase (Qiagen) and the manufacturer’s 1× amplification buffer adjusted to a final concentration of 1mM MgCl2, 160μM dNTPs, 1/38000 SYBR Green I (Molecular Probes), 7% DMSO, 0.4% ROX Reference Dye (Invitrogen), 300 nM of each primer (forward and reverse), and 2μL of 40-fold diluted first-strand cDNA synthesis reaction mixture (12.5ng total RNA equivalent). Polymerase activation at 95ºC for 15 min was followed by 40 cycles of 15 s at 94ºC, 30 s at 60ºC, and 30 s at 72ºC. The dissociation curve analysis, which evaluates each PCR product to be amplified from single cDNA, was carried out in accordance with the manufacturer’s protocol. Expression levels were reported as Ct values. The large number of genes assayed and the replicates measures required that samples be distributed across multiple amplification plates, with an average of twelve plates per sample. Because it was envisioned that GAPDH would serve as a single-gene normalization control, this gene was included on each plate. All primer pairs were replicated in triplicates. Raw qPCR expression measures were quantified using Applied Biosystems SDS software and reported as Ct values. The Ct value represents the number of cycles or rounds of amplification required for the fluorescence of a gene or primer pair to surpass an arbitrary threshold. The magnitude of the Ct value is inversely proportional to the expression level so that a gene expressed at a high level will have a low Ct value and vice versa. Replicate Ct values were combined by averaging, with additional quality control constraints imposed by a standard filtering method developed by the RIKEN group for the preprocessing of their qPCR data. Briefly this method entails: 1. Sort the triplicate Ct values in ascending order: Ct1, Ct2, Ct3. Calculate differences between consecutive Ct values: difference1 = Ct2 – Ct1 and difference2 = Ct3 – Ct2. 2. Four regions are defined (where Region4 overrides the other regions): Region1: difference ≦ 0.2, Region2: 0.2 < difference ≦ 1.0, Region3: 1.0 < difference, Region4: one of the Ct values in the difference calculation is 40 If difference1 and difference2 fall in the same region, then the three replicate Ct values are averaged to give a final representative measure. If difference1 and difference2 are in different regions, then the two replicate Ct values that are in the small number region are averaged instead. This particular filtering method is specific to the data set we used here and does not represent a part of the normalization procedure itself; Alternate methods of filtering can be applied if appropriate prior to normalization. Moreover while the presentation in this manuscript has used Ct values as an example, any measure of transcript abundance, including those corrected for primer efficiency can be used as input to our data-driven methods. (2) Quantile Normalization Algorithm Quantile normalization proceeds in two stages. First, if samples are distributed across multiple plates, normalization is applied to all of the genes assayed for each sample to remove plate-to-plate effects by enforcing the same quantile distribution on each plate. Then, an overall quantile normalization is applied between samples, assuring that each sample has the same distribution of expression values as all of the other samples to be compared. A similar approach using quantile ormalization has been previously described in the context of microarray normalization. Briefly, our method entails the following steps: i) qPCR data from a single RNA sample are stored in a matrix M of dimension k (maximum number of genes or primer pairs on a plate) rows by p (number of plates) columns. Plates with differing numbers of genes are made equivalent by padded plates with missing values to constrain M to a rectangular structure. ii) Each column is sorted into ascending order and stored in matrix M’. The sorted columns correspond to the quantile distribution of each plate. The missing values are placed at the end of each ordered column. All calculations in quantile normalization are performed on non-missing values. iii) The average quantile distribution is calculated by taking the average of each row in M’. Each column in M’ is replaced by this average quantile distribution and rearranged to have the same ordering as the original row order in M. This gives the within-sample normalized data from one RNA sample. iv) Steps analogous to 1 – 3 are repeated for each sample. Between-sample normalization is performed by storing the within-normalized data as a new matrix N of dimension k (total number of genes, in our example k = 2,396) rows by n (number of samples) columns. Steps 2 and 3 are then applied to this matrix. (3) Rank-Invariant Set Normalization Algorithm We describe an extension of this method for use on qPCR data with any number of experimental conditions or samples in which we identify a set of stably-expressed genes from within the measured expression data and then use these to adjust expression between samples. Briefly, i) qPCR data from all samples are stored in matrix R of dimension g (total number of genes or primer pairs used for all plates) rows by s (total number of samples). ii) We first select gene sets that are rank-invariant across a single sample compared to a common reference. The reference may be chosen in a variety of ways, depending on the experimental design and aims of the experiment. As described in Tseng et al., the reference may be designated as a particular sample from the experiment (e.g. time zero in a time course experiment), the average or median of all samples, or selecting the sample which is closest to the average or median of all samples. Genes are considered to be rank-invariant if they retain their ordering or rank with respect to expression across the experimental sample versus the common reference sample. We collect sets of rank-invariant genes for all of the s pairwise comparisons, relative to a common reference. We take the intersection of all s sets to obtain the final set of rank-invariant genes that is used for normalization. iii) Let αj represent the average expression value of the rank-invariant genes in sample j. (α1, …, αs) then represents the vector of rank-invariant average expression values for all conditions 1 to s iv) We calculate the scale f The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Total RNA was purifed from TRIzol lysates according to manufacturer’s instructions. The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA,USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).
This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.
This sample dataset also includes files relative to metadata, static data, normalization, and plotting.
To use the data, clone the corresponding repository and unzip this zip file in the data folder.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MEDDOPROF corpus is a collection of 2000 clinical cases from over 20 different specialties annotated with professions, employment statuses and other work-related activities. It is used for the MEDDOPROF Shared Task on occupations and employment status detection and normalization in Spanish medical documents, which will be celebrated as part of IberLEF 2021.
The sample set is composed of 15 clinical cases extracted from the training set from four different specialties: radiology, oncology, psychiatry and occupational health. The files are distributed as follows:
For the subtask 1 (MEDDOPROF-NER), annotations are distributed in Brat standoff format with PROFESION/SITUACION_LABORAL tags only.
For the subtask 2 (MEDDOPROF-CLASS), annotations are distributed in Brat standoff format with PACIENTE/FAMILIAR/SANITARIO/OTROS tags only.
For the subtask 3 (MEDDOPROF-NORM), annotations are distributed in a tab-separated file (TSV) with a code column that includes the mapping of entities to ESCO and SNOMED-CT.
For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es
The label-free quantitative mass spectrometry methods, in particular the SWATH-MS approach, have gained popularity and became a powerful technique for comparison of large datasets. In the present work, it is introduced the use of recombinant proteins as internal standards for untargeted label-free methods. The proposed internal standard strategy reveals a similar intragroup normalization capacity when compared with the most common normalization methods, with the additional advantage of maintaining the overall proteome changes between groups (which are lost using the methods referred above). Thus, proving to be able to maintain a good performance even when large qualitative and quantitative differences in sample composition are observed, such as the ones induced by biological regulation (as observed in secretome and other biofluids’ analyses) or by enrichment approaches (such as immunopurifications). Moreover, it corresponds to a cost-effective alternative, easier to implement than the current stable-isotope labeling internal standards, therefore being an appealing strategy for large quantitative screening, as clinical cohorts for biomarker discovery.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Data used in the experiments described in:
Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany.
https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf
(https://www.linguistics.rub.de/konvens16/)
Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.
There are four datasets:
- goo300k-bohoric: historical Slovene, hard case (<1850)
- goo300k-gaj: historical Slovene, easy case (1850 - 1900)
- tweet-L3: Slovene tweets, hard case (non-standard language)
- tweet-L1: Slovene tweets, easy case (mostly standard language)
The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/).
The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example of normalizing the word ‘aaaaaaannnnnndddd’ using the proposed method and four other normalization methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a GATE-simulated data using a cylindrical PET scanner (based on the GATE cylindrical PET example) and a NEMA-like phantom. Included are the sinograms created by OMEGA software, for both TOF and non-TOF cases as mat-files as well as normalization correction coefficients. You can open these mat-files in MATLAB, Octave, Python, Julia or in practically any other language. This data can be also used as a testing data for OMEGA. Included is also the ground truth image (i.e. the source image), attenuation image as created by GATE, as well as the original ROOT files. The ROOT file package also contains the original macros that give details on the scanner and the phantom.
The sinogram data contains several different sinograms. raw_SinM is the raw sinogram with no modifications, SinM contains normalization and randoms correction precorrected (not available for TOF data), SinDelayed contains delayed coincidences, SinTrues the trues, SinRandoms the true randoms, SinScatter the true scattered photons, appliedCorrections show the corrections applied to SinM and RandProp and ScatterProp show whether variance reduction or smoothing was applied to the delayed coincidences or (not present) scatter estimation data. The attenuation data is already correctly scaled and is saved as a MetaImage file.
Also included is the ground truth, or original source, image. This is saved as the variable C. RA is the randoms image while SC is the scatter image. The latter two are in singles mode, i.e. they show the locations of the photons that either were random (two different events) or scattered along the way.
The normalization data works for other measurements with the same scanner as long as the sinogram dimensions remain the same. OMEGA will automatically use the normalization data if it's present in the mat-files folder.
This new version includes also the detector coordinates for each measurement. This is mainly for OMEGA-example showcasing the use of custom data. See, for example, https://github.com/villekf/OMEGA/blob/master/main-files/custom_detector_exampleSimple.m
Normalization of RNA-sequencing data is essential for accurate downstream inference, but the assumptions upon which most methods are based do not hold in the single-cell setting. Consequently, applying existing normalization methods to single-cell RNA-seq data introduces artifacts that bias downstream analyses. To address this, we introduce SCnorm for accurate and efficient normalization of scRNA-seq data. Total 183 single cells (92 H1 cells, 91 H9 cells), sequenced twice, were used to evaluate SCnorm in normalizing single cell RNA-seq experiments. Total 48 bulk H1 samples were used to compare bulk and single cell properties. For single-cell RNA-seq, the identical single-cell indexed and fragmented cDNA were pooled at 96 cells per lane or at 24 cells per lane to test the effects of sequencing depth, resulting in approximately 1 million and 4 million mapped reads per cell in the two pooling groups, respectively.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).
Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.
All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.