11 datasets found
  1. Data from: A systematic evaluation of normalization methods and probe...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Universidade de São Paulo
    Hospital for Sick Children
    University of Toronto
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  2. f

    Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach (2023). Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.xlsx [Dataset]. http://doi.org/10.3389/fgene.2020.00594.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.

  3. DEMANDE Dataset

    • zenodo.org
    • researchdiscovery.drexel.edu
    zip
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph A. Gallego-Mejia; Joseph A. Gallego-Mejia; Fabio A Gonzalez; Fabio A Gonzalez (2023). DEMANDE Dataset [Dataset]. http://doi.org/10.5281/zenodo.7822851
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joseph A. Gallego-Mejia; Joseph A. Gallego-Mejia; Fabio A Gonzalez; Fabio A Gonzalez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the features and probabilites of ten different functions. Each dataset is saved using numpy arrays. \item The data set \textit{Arc} corresponds to a two-dimensional random sample drawn from a random vector $$X=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)=\mathcal{N}(x_2|0,4)\mathcal{N}(x_1|0.25x_2^2,1)$$ where $$\mathcal{N}(u|\mu,\sigma^2)$$ denotes the density function of a normal distribution with mean $$\mu$$ and variance $$\sigma^2$$. \cite{Papamakarios2017} used this data set to evaluate his neural density estimation methods. \item The data set \textit{Potential 1} corresponds to a two-dimensional random sample drawn from a random vector $$X=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)=\frac{1}{2}\left(\frac{||x||-2}{0.4}\right)^2 - \ln{\left(\exp\left\{-\frac{1}{2}\left[\frac{x_1-2}{0.6}\right]^2\right\}+\exp\left\{-\frac{1}{2}\left[\frac{x_1+2}{0.6}\right]^2\right\}\right)}$$ with a normalizing constant of approximately 6.52 calculated by Monte Carlo integration. \item The data set \textit{Potential 2} corresponds to a two-dimensional random sample drawn from a random vector $$X=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)=\frac{1}{2}\left[ \frac{x_2-w_1(x)}{0.4}\right]^2$$ where $$w_1(x)=\sin{(\frac{2\pi x_1}{4})}$$ with a normalizing constant of approximately 8 calculated by Monte Carlo integration. \item The data set \textit{Potential 3} corresponds to a two-dimensional random sample drawn from a random vector $$x=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)= - \ln{\left(\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)}{0.35}\right]^2\right\}+\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)+w_2(x)}{0.35}^2\right]\right\}\right)}$$ where $$w_1(x)=\sin{(\frac{2\pi x_1}{4})}$$ and $$w_2(x)=3 \exp \left\{-\frac{1}{2}\left[ \frac{x_1-1}{0.6}\right]^2\right\}$$ with a normalizing constant of approximately 13.9 calculated by Monte Carlo integration. \item The data set \textit{Potential 4} corresponds to a two-dimensional random sample drawn from a random vector $$x=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)= - \ln{\left(\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)}{0.4}\right]^2\right\}+\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)+w_3(x)}{0.35}^2\right]\right\}\right)}$$ where $$w_1(x)=\sin{(\frac{2\pi x_1}{4})}$$, $$w_3(x)=3 \sigma \left(\left[ \frac{x_1-1}{0.3}\right]^2\right)$$, and $$\sigma(x)= \frac{1}{1+\exp(x)}$$ with a normalizing constant of approximately 13.9 calculated by Monte Carlo integration. \item The data set \textit{2D mixture} corresponds to a two-dimensional random sample drawn from the random vector $$x=(X_1, X_2)$$ with a probability density function given by $$f(x) = \frac{1}{2}\mathcal{N}(x|\mu_1,\Sigma_1) + \frac{1}{2}\mathcal{N}(x|\mu_2,\Sigma_2)$$ with means and covariance matrices $$\mu_1 = [1, -1]^T$$, $$\mu_2 = [-2, 2]^T$$, $$\Sigma_1=\left[\begin{array}{cc} 1 & 0 \\ 0 & 2 \end{array}\right]$$, and $$\Sigma_1=\left[\begin{array}{cc} 2 & 0 \\ 0 & 1 \end{array}\right]$$ \item The data set \textit{10D-mixture} corresponds to a 10-dimensional random sample drawn from the random vector $$x=(X_1,\cdots,X_{10})$$ with a mixture of four diagonal normal probability density functions $$\mathcal{N}(X_i|\mu_i, \sigma_i)$$, where each $$\mu_i$$ is drawn uniformly in the interval $$[-0.5,0.5]$$, and the $$\sigma_i$$ is drawn uniformly in the interval $$[-0.01, 0.5]$$. Each diagonal normal probability density has the same probability of being drawn $$1/4$$.

  4. Landsat 8-9 Normalized Difference Vegetation Index (NDVI) Colorized

    • hub.arcgis.com
    Updated Aug 11, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2016). Landsat 8-9 Normalized Difference Vegetation Index (NDVI) Colorized [Dataset]. https://hub.arcgis.com/datasets/f6bb66f1c11e467f9a9a859343e27cf8
    Explore at:
    Dataset updated
    Aug 11, 2016
    Dataset authored and provided by
    Esrihttp://esri.com/
    Area covered
    Description

    This layer includes Landsat 8 and 9 imagery rendered on-the-fly as NDVI Colorized for use in visualization and analysis. This layer is time enabled and includes a number of band combinations and indices rendered on demand. The imagery includes eight multispectral bands from the Operational Land Imager (OLI) and two bands from the Thermal Infrared Sensor (TIRS). It is updated daily with new imagery directly sourced from the USGS Landsat collection on AWS.Geographic CoverageGlobal Land Surface.Polar regions are available in polar-projected Imagery Layers: Landsat Arctic Views and Landsat Antarctic Views.Temporal CoverageThis layer is updated daily with new imagery.Working in tandem, Landsat 8 and 9 revisit each point on Earth's land surface every 8 days.Most images collected from January 2015 to present are included.Approximately 5 images for each path/row from 2013 and 2014 are also included.Product LevelThe Landsat 8 and 9 imagery in this layer is comprised of Collection 2 Level-1 data.The imagery has Top of Atmosphere (TOA) correction applied.TOA is applied using the radiometric rescaling coefficients provided the USGS.The TOA reflectance values (ranging 0 – 1 by default) are scaled using a range of 0 – 10,000.Image Selection/FilteringA number of fields are available for filtering, including Acquisition Date, Estimated Cloud Cover, and Product ID.To isolate and work with specific images, either use the ‘Image Filter’ to create custom layers or add a ‘Query Filter’ to restrict the default layer display to a specified image or group of images.Visual RenderingDefault rendering is NDVI Colorized, calculated as (b5 - b4) / (b5 + b4) with a colormap applied.Raster Functions enable on-the-fly rendering of band combinations and calculated indices from the source imagery.The DRA version of each layer enables visualization of the full dynamic range of the images.Other pre-defined Raster Functions can be selected via the renderer drop-down or custom functions can be created.This layer is part of a larger collection of Landsat Imagery Layers that you can use to perform a variety of mapping analysis tasks.Pre-defined functions: Natural Color with DRA, Agriculture with DRA, Geology with DRA, Color Infrared with DRA, Bathymetric with DRA, Short-wave Infrared with DRA, Normalized Difference Moisture Index Colorized, NDVI Raw, NDVI Colorized, NBR Raw15 meter Landsat Imagery Layers are also available: Panchromatic and Pansharpened.Multispectral BandsThe table below lists all available multispectral OLI bands. NDVI Colorized consumes bands 4 and 5.BandDescriptionWavelength (µm)Spatial Resolution (m)1Coastal aerosol0.43 - 0.45302Blue0.45 - 0.51303Green0.53 - 0.59304Red0.64 - 0.67305Near Infrared (NIR)0.85 - 0.88306SWIR 11.57 - 1.65307SWIR 22.11 - 2.29308Cirrus (in OLI this is band 9)1.36 - 1.38309QA Band (available with Collection 1)*NA30*More about the Quality Assessment BandTIRS BandsBandDescriptionWavelength (µm)Spatial Resolution (m)10TIRS110.60 - 11.19100 * (30)11TIRS211.50 - 12.51100 * (30)*TIRS bands are acquired at 100 meter resolution, but are resampled to 30 meter in delivered data product.Additional Usage NotesImage exports are limited to 4,000 columns x 4,000 rows per request.This dynamic imagery layer can be used in Web Maps and ArcGIS Pro as well as web and mobile applications using the ArcGIS REST APIs.WCS and WMS compatibility means this imagery layer can be consumed as WCS or WMS services.The Landsat Explorer App is another way to access and explore the imagery.This layer is part of a larger collection of Landsat Imagery Layers.Data SourceLandsat imagery is sourced from the U.S. Geological Survey (USGS) and the National Aeronautics and Space Administration (NASA). Data is hosted by the Amazon Web Services as part of their Public Data Sets program.For information, see Landsat 8 and Landsat 9.

  5. r

    SILO Patched Point data for Narrabri (54120) and Gunnedah (55023) stations...

    • researchdata.edu.au
    • data.gov.au
    • +2more
    Updated Apr 8, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2016). SILO Patched Point data for Narrabri (54120) and Gunnedah (55023) stations in the Namoi subregion [Dataset]. https://researchdata.edu.au/silo-patched-point-namoi-subregion/2980714
    Explore at:
    Dataset updated
    Apr 8, 2016
    Dataset provided by
    data.gov.au
    Authors
    Bioregional Assessment Program
    Area covered
    Narrabri, Gunnedah, Namoi River
    Description

    Abstract

    This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied.

    SILO is a Queensland Government database containing continuous daily climate data for Australia from 1889 to present. Gridded datasets are constructed by spatially interpolating the observed point data. Continuous point datasets are constructed by supplementing the available point data with interpolated estimates when observed data are missing.

    Purpose

    SILO provides climate datasets that are ready to use. Raw observational data typically contain missing data and are only available at the location of meteorological recording stations. SILO provides point datasets with no missing data and gridded datasets which cover mainland Australia and some islands.

    Dataset History

    Lineage statement:

    (A)\tProcessing System Version History

    \*\tPrior to 2001

    The interpolation system used the algorithm detailed in Jeffrey et al.1

    \*\t2001-2009

    The normalisation procedure was modified. Observational rainfall, when accumulated over a sufficient period and raised to an appropriate fractional power, is (to a reasonable approximation) normally distributed. In the original procedure the fractional power was fixed at 0.5 and a normal distribution was fitted to the transformed data using a maximum likelihood technique. A Kolmogorov-Smirnov test was used to test the goodness of fit, with a threshold value of 0.8. In 2001 the procedure was modified to allow the fractional power to vary between 0.4 and 0.6. The normalisation parameters (fractional power, mean and standard deviation) at each station were spatially interpolated using a thin plate smoothing spline.

    \*\t2009-2011

    The normalisation procedure was modified. The Kolmogorov-Smirnov test was removed, enabling normalisation parameters to be computed for all stations having sufficient data. Previously parameters were only computed for those stations having data that were adequately modelled by a normal distribution, as determined by the Kolmogorov-Smirnov test.

    \*\tJanuary 2012 - November 2012

    The normalisation procedure was modified:

    o\tThe Kolmogorov-Smirnoff test was reintroduced, with a threshold value of 0.1.

    o\tData from Bellenden Ker Top station were included in the computation of normalisation parameters. The station was previously omitted on the basis of having insufficient data. It was forcibly included to ensure the steep rainfall gradient in the region was reflected in the normalisation parameters.

    o\tThe elevation data used when interpolating normalisation parameters were modified. Previously a mean elevation was assigned to each station, taken from the nearest grid cell in a 0.05° 0.05° digital elevation model. The procedure was modified to use the actual station elevation instead of the mean. In mountainous regions the discrepancy was substantial and cross validation tests showed a significant improvement in error statistics.

    o\tThe station data are normalised using: (i) a power parameter extracted from the nearest pixel in the gridded power surface. The surface was obtained by interpolating the power parameters fitted at station locations using a maximum likelihood algorithm; and (ii) mean and standard deviation parameters which had been fitted at station locations using a smoothing spline. Mean and standard deviation parameters were fitted at the subset of stations having at least 40 years of data, using a maximum likelihood algorithm. The fitted data were then spatially interpolated to construct: (a) gridded mean and standard deviation surfaces (for use in a subsequent de-normalisation procedure); and (b) interpolated estimates of the parameters at all station locations (not just the subset having long data records). The parameters fitted using maximum likelihood (at the subset of stations having long data records) may differ from those fitted by the interpolation algorithm, owing to the smoothing nature of the spline algorithm which was used. Previously, station data were normalised using mean and standard deviation parameters which were taken from the nearest pixel in the respective mean and standard deviation surfaces.

    \*\tNovember 2012 - May 2013

    The algorithm used for selecting monthly rainfall data for interpolation was modified. Prior to November 2012, the system was as follows:

    o\tAccumulated monthly rainfall was computed by the Bureau of Meteorology;

    o\tRainfall accumulations spanning the end of a month were assigned to the last month included in the accumulation period;

    o\tA monthly rainfall value was provided for all stations which submitted at least one daily report. Zero rainfall was assumed for all missing values; and

    o\tSILO imposed a complex set of ad-hoc rules which aimed to identify stations which had ceased reporting in real time. In such cases it would not be appropriate to assume zero rainfall for days when a report was not available. The rules were only applied when processing data for January 2001 and onwards.

    In November 2012 a modified algorithm was implemented:

    o\tSILO computed the accumulated monthly rainfall by summing the daily reports;

    o\tRainfall accumulations spanning the end of a month were discarded;

    o\tA monthly rainfall value was not computed for a given station if any day throughout the month was not accounted for - either through a daily report or an accumulation; and

    o\tThe SILO ad-hoc rules were not applied.

    \*\tMay 2013 - current

    The algorithm used for selecting monthly rainfall data for interpolation was modified. The modified algorithm is only applied to datasets for the period October 2001 - current and is as follows:

    o\tSILO computes the accumulated monthly rainfall by summing the daily reports;

    o\tRainfall accumulations spanning the end of a month are pro-rata distributed onto the two months included in the accumulation period;

    o\tA monthly rainfall value is computed for all stations which have at least 21 days accounted for throughout the month. Zero rainfall is assumed for all missing values; and

    o\tThe SILO ad-hoc rules are applied when processing data for January 2001 and onwards.

    Datasets for the period January 1889-September 2001 are prepared using the system that was in effect prior to November 2012.

    Lineage statement:

    (A)\tProcessing System Version History

    No changes have been made to the processing system since SILO's inception.

    (B)\tMajor Historical Data Updates

    \*\tAll observational data and station coordinates were updated in 2009.

    \*\tStation coordinates were updated on 26 January 2012.

    Process step:

    The observed data are interpolated using a tri-variate thin plate smoothing spline, with latitude, longitude and elevation as independent variables.4 A two-pass interpolation system is used. All available observational data are interpolated in the first pass and residuals computed for all data points. The residual is the difference between the observed and interpolated values. Data points with high residuals may be indicative of erroneous data and are excluded from a subsequent interpolation which generates the final gridded surface. The surface covers the region 112˚E - 154˚E, 10˚S - 44˚S on a regular 0.05˚ × 0.05˚grid and is restricted to land areas on mainland Australia and some islands.

    Gridded datasets for the period 1957-current are obtained by interpolation of the raw data.

    Gridded datasets for the period 1957-current are obtained by interpolation of the raw data. Gridded datasets for the period 1889-1956 were constructed using an anomaly interpolation technique. The daily departure from the long term mean is interpolated, and the gridded dataset is constructed by adding the gridded anomaly to the gridded long term mean. The long term means were constructed using data from the period 1957-2001. The anomaly interpolation technique is described in Rayner et al.6

    The observed and interpolated datasets evolve as new data becomes available and the existing data are improved through quality control procedures. Modifications gradually decrease over time, with most datasets undergoing little change 12 months after the date of observation.

    Dataset Citation

    "Queensland Department of Science, Information Technology, Innovation and the Arts" (2013) SILO Patched Point data for Narrabri (54120) and Gunnedah (55023) stations in the Namoi subregion. Bioregional Assessment Source Dataset. Viewed 29 September 2017, http://data.bioregionalassessments.gov.au/dataset/0a018b43-58d3-4b9e-b339-4dae8fd54ce8.

  6. l

    Data from: CsEnVi Pairwise Parallel Corpora

    • lindat.cz
    • live.european-language-grid.eu
    • +1more
    Updated Nov 10, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duc Tam Hoang; Ondřej Bojar (2015). CsEnVi Pairwise Parallel Corpora [Dataset]. https://lindat.cz/repository/xmlui/handle/11234/1-1595?locale-attribute=cs
    Explore at:
    Dataset updated
    Nov 10, 2015
    Authors
    Duc Tam Hoang; Ondřej Bojar
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:

    • OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents. The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series. The nature of the bitexts are paraphrasing of each other's meaning, rather than translations.

    • TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015.

    The size of the original corpora collected from OPUS and TED talks is as follows:

            CS/VI              EN/VI
    

    Sentence 1337199/1337199 2035624/2035624 Word 9128897/12073975 16638364/17565580
    Unique word 224416/68237 91905/78333

    We improve the quality of the corpora in two steps: normalizing and filtering.

    In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly. In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs.

    The size of cleaned corpora as published is as follows:

            CS/VI              EN/VI
    

    Sentence 1091058/1091058 1113177/1091058 Word 6718184/7646701 8518711/8140876 Unique word 195446/59737 69513/58286

    The corpora are used as training data in [2].

    References: [1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey. [2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015

  7. Z

    Data from: Citation data of arXiv eprints and the associated...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keisuke Okamura (2024). Citation data of arXiv eprints and the associated quantitatively-and-temporally normalised impact metrics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5803961
    Explore at:
    Dataset updated
    Jan 7, 2024
    Dataset provided by
    Keisuke Okamura
    Hitoshi Koshiba
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data collection

    This dataset contains information on the eprints posted on arXiv from its launch in 1991 until the end of 2019 (1,589,006 unique eprints), plus the data on their citations and the associated impact metrics. Here, eprints include preprints, conference proceedings, book chapters, data sets and commentary, i.e. every electronic material that has been posted on arXiv.

    The content and metadata of the arXiv eprints were retrieved from the arXiv API (https://arxiv.org/help/api/) as of 21st January 2020, where the metadata included data of the eprint’s title, author, abstract, subject category and the arXiv ID (the arXiv’s original eprint identifier). In addition, the associated citation data were derived from the Semantic Scholar API (https://api.semanticscholar.org/) from 24th January 2020 to 7th February 2020, containing the citation information in and out of the arXiv eprints and their published versions (if applicable). Here, whether an eprint has been published in a journal or other means is assumed to be inferrable, albeit indirectly, from the status of the digital object identifier (DOI) assignment. It is also assumed that if an arXiv eprint received cpre and cpub citations until the data retrieval date (7th February 2020) before and after it is assigned a DOI, respectively, then the citation count of this eprint is recorded in the Semantic Scholar dataset as cpre + cpub. Both the arXiv API and the Semantic Scholar datasets contained the arXiv ID as metadata, which served as a key variable to merge the two datasets.

    The classification of research disciplines is based on that described in the arXiv.org website (https://arxiv.org/help/stats/2020_by_area/). There, the arXiv subject categories are aggregated into several disciplines, of which we restrict our attention to the following six disciplines: Astrophysics (‘astro-ph’), Computer Science (‘comp-sci’), Condensed Matter Physics (‘cond-mat’), High Energy Physics (‘hep’), Mathematics (‘math’) and Other Physics (‘oth-phys’), which collectively accounted for 98% of all the eprints. Those eprints tagged to multiple arXiv disciplines were counted independently for each discipline. Due to this overlapping feature, the current dataset contains a cumulative total of 2,011,216 eprints.

    Some general statistics and visualisations per research discipline are provided in the original article (Okamura, to appear), where the validity and limitations associated with the dataset are also discussed.

    Description of columns (variables)

    arxiv_id : arXiv ID

    category : Research discipline

    pre_year : Year of posting v1 on arXiv

    pub_year : Year of DOI acquisition

    c_tot : No. of citations acquired during 1991–2019

    c_pre : No. of citations acquired before and including the year of DOI acquisition

    c_pub : No. of citations acquired after the year of DOI acquisition

    c_yyyy (yyyy = 1991, …, 2019) : No. of citations acquired in the year yyyy (with ‘yyyy’ running from 1991 to 2019)

    gamma : The quantitatively-and-temporally normalised citation index

    gamma_star : The quantitatively-and-temporally standardised citation index

    Note: The definition of the quantitatively-and-temporally normalised citation index (γ; ‘gamma’) and that of the standardised citation index (γ*; ‘gamma_star’) are provided in the original article (Okamura, to appear). Both indices can be used to compare the citational impact of papers/eprints published in different research disciplines at different times.

    Data files

    A comma-separated values file (‘arXiv_impact.csv’) and a Stata file (‘arXiv_impact.dta’) are provided, both containing the same information.

  8. Data from: Advancing Fifth Percentile Hazard Concentration Estimation Using...

    • acs.figshare.com
    zip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander K. Dhond; Mace G. Barron (2023). Advancing Fifth Percentile Hazard Concentration Estimation Using Toxicity-Normalized Species Sensitivity Distributions [Dataset]. http://doi.org/10.1021/acs.est.2c06857.s005
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Alexander K. Dhond; Mace G. Barron
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The species sensitivity distribution (SSD) is an internationally accepted approach to hazard estimation using the probability distribution of toxicity values that is representative of the sensitivity of a group of species to a chemical. Application of SSDs in ecological risk assessment has been limited by insufficient taxonomic diversity of species to estimate a statistically robust fifth percentile hazard concentration (HC5). We used the toxicity-normalized SSD (SSDn) approach, (Lambert, F. N.; Raimondo, S.; Barron, M. G. Environ. Sci. Technol.2022,56, 8278–8289), modified to include all possible normalizing species, to estimate HC5 values for acute toxicity data for groups of carbamate and organophosphorous insecticides. We computed mean and variance of single chemical HC5 values for each chemical using leave-one-out (LOO) variance estimation and compared them to SSDn and conventionally estimated HC5 values. SSDn-estimated HC5 values showed low uncertainty and high accuracy compared to single-chemical SSDs when including all possible combinations of normalizing species within the chemical-taxa grouping (carbamate-all species, carbamate-fish, organophosphate-fish, and organophosphate-invertebrate). The SSDn approach is recommended for estimating HC5 values for compounds with insufficient species diversity for HC5 computation or high uncertainty in estimated single-chemical HC5 values. Furthermore, the LOO variance approach provides SSD practitioners with a simple computational method to estimate confidence intervals around an HC5 estimate that is nearly identical to the conventionally estimated HC5.

  9. TrafficDator Madrid

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iván Gómez; Sergio Ilarri; Iván Gómez; Sergio Ilarri (2024). TrafficDator Madrid [Dataset]. http://doi.org/10.5281/zenodo.10435154
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Iván Gómez; Sergio Ilarri; Iván Gómez; Sergio Ilarri
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    Madrid
    Description

    Data Origin: This dataset was generated using information from the Community of Madrid, including traffic data collected by multiple sensors located throughout the city, as well as work calendar and meteorological data, all provided by the Community.

    Data Type: The data consists of traffic measurements in Madrid from June 1, 2022, to September 30, 2023. Each record includes information on the date, time, location (longitude and latitude), traffic intensity, and associated road and weather conditions (e.g., whether it is a working day, holiday, information on wind, temperature, precipitation, etc.).

    Technical Details:

    • Data Preprocessing: We utilized advanced techniques for cleaning and normalizing traffic data collected from sensors across Madrid. This included handling outliers and missing values to ensure data quality.

    • Geospatial Analysis: We used GeoPandas and OSMnx to map traffic data points onto Madrid's road network. This process involved processing spatial attributes such as street lanes and speed limits to add context to the traffic data.

    • Meteorological Data Integration: We incorporated Madrid's weather data, including temperature, precipitation, and wind speed. Understanding the impact of weather conditions on traffic patterns was crucial in this step.

    • Traffic Data Clustering: We implemented K-Means clustering to identify patterns in traffic data. This approach facilitated the selection of representative sensors from each cluster, focusing on the most relevant data points.

    • Calendar Integration: We combined the traffic data with the work calendar to distinguish between different types of days. This provided insights into traffic variations on working days and holidays.

    • Comprehensive Analysis Approach: The analysis was conducted using Python libraries such as Pandas, NumPy, scikit-learn, and Shapely. It covered data from the years 2022 and 2023, focusing on the unique characteristics of the Madrid traffic dataset.

    • Data Structure: Each row of the dataset represents an individual measurement from a traffic sensor, including:
      • id: Unique sensor identifier.
      • date: Date and time of the measurement.
      • longitude and latitude: Geographical coordinates of the sensor.
      • day type: Information about the day being a working day, holiday, or festive Sunday.
      • intensity: Measured traffic intensity.
      • Additional data like wind, temperature, precipitation, etc.

    Purpose of the Dataset: This dataset is useful for traffic analysis, urban mobility studies, infrastructure planning, and research related to traffic behavior under different environmental and temporal conditions.

    Acknowledgment and Funding:

    • This dataset was obtained as part of the R&D project PID2020-113037RB-I00, funded by MCIN/AEI/10.13039/501100011033.
    • In addition to the NEAT-AMBIENCE project, support from the Department of Science, University, and Knowledge Society of the Government of Aragon (Government of Aragon: group reference T64_23R, COSMOS research group) is also acknowledged.
    • For academic and research purposes, please reference this dataset using its DOI for proper attribution and tracking.
  10. d

    ScienceBase Item Summary Page

    • datadiscoverystudio.org
    Updated Jun 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). ScienceBase Item Summary Page [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/48c2b51b458e4528bd05b02c76ba14e7/html
    Explore at:
    Dataset updated
    Jun 27, 2018
    Area covered
    Description

    Link to the ScienceBase Item Summary page for the item described by this metadata record. Service Protocol: Link to the ScienceBase Item Summary page for the item described by this metadata record. Application Profile: Web Browser. Link Function: information

  11. i

    Environmental assessment of pig slurry management after local...

    • iepnb.es
    • pre.iepnb.es
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Environmental assessment of pig slurry management after local characterization and normalization - Dataset - CKAN [Dataset]. https://iepnb.es/catalogo/dataset/environmental-assessment-of-pig-slurry-management-after-local-characterization-and-normalizatio
    Explore at:
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Due to its environmental impact, pig slurry management is of great importance in the Region of Murcia, Spain, where pig production is considerable. The current slurry management system consists in its direct use in agricultural land, yet this entails an important associated problem, i.e., the limits of nitrogen imposed by legislation (<170 kg N ha(-1) yr(-1)). The use of constructed wetlands affords another possibility, achieving a reduction of physicochemical parameters of up to 80%. This paper presents the comparison of both the alternatives by means of life cycle assessment, and reports a major impact in the categories of acidification and eutrophication potential. However, the abiotic depletion potential could be minimized by avoiding the application of fertilizers and irrigation water. For the category of global warming potential, the wetland building displayed a negative role as compared to direct use. After normalization of the data, the main environmental problem for both management alternatives proved to be eutrophication, followed by acidification and global warming potential. The use of the great amounts of pig slurry produced in the Region of Murcia generates a problem of eutrophication in the Valle del Guadalentin, the Campo de Cartagena and the Mar Menor lagoon. For the purposes of normalizing, a database is needed to take all the inputs and outputs into account, in order to establish more realistic scenarios for the use and management of pig slurries in different geographical areas. (C) 2012 Elsevier Ltd. All rights reserved.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Organization logo

Data from: A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data

Related Article
Explore at:
zipAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Universidade de São Paulo
Hospital for Sick Children
University of Toronto
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

Search
Clear search
Close search
Google apps
Main menu