68 datasets found
  1. Data from: A systematic evaluation of normalization methods and probe...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Universidade de São Paulo
    Hospital for Sick Children
    University of Toronto
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  2. d

    Normalized Foraminiferal Data for Chincoteague Bay and the Marshes of...

    • catalog.data.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Normalized Foraminiferal Data for Chincoteague Bay and the Marshes of Assateague Island and the Adjacent Vicinity, Maryland and Virginia- Fall 2014 [Dataset]. https://catalog.data.gov/dataset/normalized-foraminiferal-data-for-chincoteague-bay-and-the-marshes-of-assateague-island-an-100f0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Assateague Island, Virginia, Maryland, Chincoteague Bay
    Description

    Foraminiferal samples were collected from Chincoteague Bay, Newport Bay, and Tom’s Cove as well as the marshes on the back-barrier side of Assateague Island and the Delmarva (Delaware-Maryland-Virginia) mainland by U.S. Geological Survey (USGS) researchers from the St. Petersburg Coastal and Marine Science Center in March, April (14CTB01), and October (14CTB02) 2014. Samples were also collected by the Woods Hole Coastal and Marine Science Center (WHCMSC) in July 2014 and shipped to the St. Petersburg office for processing. The dataset includes raw foraminiferal and normalized counts for the estuarine grab samples (G), terrestrial surface samples (S), and inner shelf grab samples (G). For further information regarding data collection and sample site coordinates, processing methods, or related datasets, please refer to USGS Data Series 1060 (https://doi.org/10.3133/ds1060), USGS Open-File Report 2015–1219 (https://doi.org/10.3133/ofr20151219), and USGS Open-File Report 2015-1169 (https://doi.org/10.3133/ofr20151169). Downloadable data are available as Excel spreadsheets, comma-separated values text files, and formal Federal Geographic Data Committee metadata.

  3. Z

    Data from: Methods for normalizing microbiome data: an ecological...

    • data.niaid.nih.gov
    • datadryad.org
    Updated May 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huerlimann, Roger (2022). Data from: Methods for normalizing microbiome data: an ecological perspective [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4950179
    Explore at:
    Dataset updated
    May 30, 2022
    Dataset provided by
    Huerlimann, Roger
    Alford, Ross A.
    Schwarzkopf, Lin
    Bower, Deborah S.
    McKnight, Donald T.
    Zenger, Kyall R.
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description
    1. Microbiome sequencing data often need to be normalized due to differences in read depths, and recommendations for microbiome analyses generally warn against using proportions or rarefying to normalize data and instead advocate alternatives, such as upper quartile, CSS, edgeR-TMM, or DESeq-VS. Those recommendations are, however, based on studies that focused on differential abundance testing and variance standardization, rather than community-level comparisons (i.e., beta diversity), Also, standardizing the within-sample variance across samples may suppress differences in species evenness, potentially distorting community-level patterns. Furthermore, the recommended methods use log transformations, which we expect to exaggerate the importance of differences among rare OTUs, while suppressing the importance of differences among common OTUs. 2. We tested these theoretical predictions via simulations and a real-world data set. 3. Proportions and rarefying produced more accurate comparisons among communities and were the only methods that fully normalized read depths across samples. Additionally, upper quartile, CSS, edgeR-TMM, and DESeq-VS often masked differences among communities when common OTUs differed, and they produced false positives when rare OTUs differed. 4. Based on our simulations, normalizing via proportions may be superior to other commonly used methods for comparing ecological communities.
  4. 4

    GTEx (Genotype-Tissue Expression) data normalized

    • data.4tu.nl
    • figshare.com
    zip
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erdogan Taskesen, GTEx (Genotype-Tissue Expression) data normalized [Dataset]. http://doi.org/10.4121/uuid:ec5bfa66-5531-482a-904f-b693aa999e8b
    Explore at:
    zipAvailable download formats
    Dataset provided by
    TU Delft
    Authors
    Erdogan Taskesen
    License

    https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use

    Description

    This is a normalized dataset from the original RNAseq dataset downloaded from Genotype-Tissue Expression (GTEx) project: www.gtexportal.org: RNA-SeQCv1.1.8 gene rpkm Pilot V3 patch1. The data was used to analyze how tissue samples are related to each other in terms of gene expression data The data can be used to get insights in how gene expression levels behave in in the different human tissues.

  5. S

    Normalized Difference Water Index for Douro Valley based on MODIS

    • data.subak.org
    csv
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Normalized Difference Water Index for Douro Valley based on MODIS [Dataset]. https://data.subak.org/dataset/normalized-difference-water-index-for-douro-valley-based-on-modis
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    GMV Aerospace and Defense
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Douro River
    Description

    Normalized Difference Water Index (NDWI) calculated using MODIS09 imagery provided by USGS/EROS.The original MODIS09 bands were used as data source and then the ndwi was calculated.

    The Normalized Difference Water Index (NDWI) (Gao, 1996) is a satellite-derived index from the Near-Infrared (NIR) and Short Wave Infrared (SWIR) channels. Its usefulness for drought monitoring and early warning has been demonstrated in different studies (e.g., Gu et al., 2007; Ceccato et al., 2002). It is computed using the near infrared (NIR) and the short wave infrared (SWIR) reflectance, which makes it sensitive to changes in liquid water content and in spongy mesophyll of vegetation canopies (Gao, 1996 ; Ceccato et al., 2001).

    https://edo.jrc.ec.europa.eu/documents/factsheets/factsheet_ndwi.pdf

    Each compressed file contains the NDVI by years. Internally each file have an ISO TC 211 metadata with a complete geographical description.

    spatial resolution: 463.313m

    format: GeoTiff

    reference system:SR-ORG 6842

    To easily manage the data, each file follow name structure:

    YYYYMMDD_medgold_workpackage_AoI_sensor_index.

    YYYYMMDD: Imagery acquisition date

    medgold: Project name

    sensor: sensor name

    workpackage: sectorial workpackage name ( WP2 - Olive Oil Sector, WP3 – Wine Grape sector, WP4- Durum wheat pasta sector)

    Aoi: Andalusia, Douro Valley.

    Index: NDVI, NMDI, NDWI

    Example:

    20000218_medgold_wp3_douro_MOD09A1_ndwi

  6. f

    Example to illustrate the concept of normalized weight (or probability) of...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rithun Mukherjee; Perry Evans; Larry N. Singh; Sridhar Hannenhalli (2023). Example to illustrate the concept of normalized weight (or probability) of ancestral assignments. [Dataset]. http://doi.org/10.1371/journal.pone.0055521.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rithun Mukherjee; Perry Evans; Larry N. Singh; Sridhar Hannenhalli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a simplified situation of two species related through a common ancestor, where the evolutionary tree has just one internal node representing the ancestor, with four possible ancestral assignments. For a sample PWM with 5 sites aligned over the two species, we provide representative values (in blue) for the probability of the tree corresponding to each site given a particular ancestral assignment. From these we work out the overall probability of an ancestral assignment given the data (last column). For details, see the text in the Materials and Methods section that references this table.

  7. f

    Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach (2023). Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.xlsx [Dataset]. http://doi.org/10.3389/fgene.2020.00594.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.

  8. Z

    Python Time Normalized Superposed Epoch Analysis (SEAnorm) Example Data Set

    • data.niaid.nih.gov
    Updated Jul 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walton, Sam D. (2022). Python Time Normalized Superposed Epoch Analysis (SEAnorm) Example Data Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6835136
    Explore at:
    Dataset updated
    Jul 15, 2022
    Dataset provided by
    Walton, Sam D.
    Murphy, Kyle R.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Solar Wind Omni and SAMPEX ( Solar Anomalous and Magnetospheric Particle Explorer) datasets used in examples for SEAnorm, a time normalized superposed epoch analysis package in python.

    Both data sets are stored as either a HDF5 or a compressed csv file (csv.bz2) which contain a Pandas DataFrame of either the Solar Wind Omni and SAMPEX data sets. The data sets where written with pandas.DataFrame.to_hdf() and pandas.DataFrame.to_csv() using a compression level of 9. The DataFrames can be read using pandas.DataFrame.read_hdf( ) or pandas.DataFrame.read_csv( ) depending on the file format.

    The Solar Wind Omni data sets contains solar wind velocity (V) and dynamic pressure (P), the southward interplanetary magnetic field in Geocentric Solar Ecliptic System (GSE) coordinates (B_Z_GSE), the auroral electrojet index (AE), and the Sym-H index all at 1 minute cadence.

    The SAMPEX data set contains electron flux from the Proton/Electron Telescope (PET) at two energy channels 1.5-6.0 MeV (ELO) and 2.5-14 MeV (EHI) at an approximate 6 second cadence.

  9. f

    Data_Sheet_2_Ensuring That Fundamentals of Quantitative Microbiology Are...

    • figshare.com
    txt
    Updated Jun 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip J. Schmidt; Ellen S. Cameron; Kirsten M. Müller; Monica B. Emelko (2023). Data_Sheet_2_Ensuring That Fundamentals of Quantitative Microbiology Are Reflected in Microbial Diversity Analyses Based on Next-Generation Sequencing.csv [Dataset]. http://doi.org/10.3389/fmicb.2022.728146.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Philip J. Schmidt; Ellen S. Cameron; Kirsten M. Müller; Monica B. Emelko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diversity analysis of amplicon sequencing data has mainly been limited to plug-in estimates calculated using normalized data to obtain a single value of an alpha diversity metric or a single point on a beta diversity ordination plot for each sample. As recognized for count data generated using classical microbiological methods, amplicon sequence read counts obtained from a sample are random data linked to source properties (e.g., proportional composition) by a probabilistic process. Thus, diversity analysis has focused on diversity exhibited in (normalized) samples rather than probabilistic inference about source diversity. This study applies fundamentals of statistical analysis for quantitative microbiology (e.g., microscopy, plating, and most probable number methods) to sample collection and processing procedures of amplicon sequencing methods to facilitate inference reflecting the probabilistic nature of such data and evaluation of uncertainty in diversity metrics. Following description of types of random error, mechanisms such as clustering of microorganisms in the source, differential analytical recovery during sample processing, and amplification are found to invalidate a multinomial relative abundance model. The zeros often abounding in amplicon sequencing data and their implications are addressed, and Bayesian analysis is applied to estimate the source Shannon index given unnormalized data (both simulated and experimental). Inference about source diversity is found to require knowledge of the exact number of unique variants in the source, which is practically unknowable due to library size limitations and the inability to differentiate zeros corresponding to variants that are actually absent in the source from zeros corresponding to variants that were merely not detected. Given these problems with estimation of diversity in the source even when the basic multinomial model is valid, diversity analysis at the level of samples with normalized library sizes is discussed.

  10. E

    Dataset of normalised Slovene text KonvNormSl 1.0

    • live.european-language-grid.eu
    • clarin.si
    binary format
    Updated Sep 18, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Dataset of normalised Slovene text KonvNormSl 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8217
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Sep 18, 2016
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Data used in the experiments described in:

    Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany.

    https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf

    (https://www.linguistics.rub.de/konvens16/)

    Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.

    There are four datasets:

    - goo300k-bohoric: historical Slovene, hard case (<1850)

    - goo300k-gaj: historical Slovene, easy case (1850 - 1900)

    - tweet-L3: Slovene tweets, hard case (non-standard language)

    - tweet-L1: Slovene tweets, easy case (mostly standard language)

    The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/).

    The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.

  11. Sample dataset for the models trained and tested in the paper 'Can AI be...

    • zenodo.org
    zip
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

    This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

    This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

    To use the data, clone the corresponding repository and unzip this zip file in the data folder.

  12. Z

    Data from: The Chord-Normalized Expected Species Shared (CNESS)-distance...

    • data.niaid.nih.gov
    • datadryad.org
    Updated Jun 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Axmacher, Jan (2022). Data from: The Chord-Normalized Expected Species Shared (CNESS)-distance represents a superior measure of species turnover patterns [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4995823
    Explore at:
    Dataset updated
    Jun 2, 2022
    Dataset provided by
    Axmacher, Jan
    Zou, Yi
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description
    1. Measures of β-diversity characterizing the difference in species composition between samples are commonly used in ecological studies. Nonetheless, commonly used dissimilarity measures require high sample completeness, or at least similar sample sizes between samples. In contrast, the Chord-Normalized Expected Species Shared (CNESS) dissimilarity measure calculates the probability of collecting the same set of species in random samples of a standardized size, and hence is not sensitive to completeness or size of compared samples. To date, this index has enjoyed limited use due to difficulties in its calculation and scarcity of studies systematically comparing it with other measures.

    2. Here, we developed a novel R function that enables users to calculate ESS (Expected Species Shared)-associated measures. We evaluate the performance of the CNESS index based on simulated datasets of known species distribution structure, and compared CNESS with more widespread dissimilarity measures (Bray-Curtis index, Chao-Sørensen index, and proportionality based Euclidean distances) for varying sample completeness and sample sizes.

    3. Simulation results indicated that for small sample size (m) values, CNESS chiefly reflects similarities in dominant species, while selecting large m values emphasizes differences in the overall species assemblages. Permutation tests revealed that CNESS has a consistently low CV (coefficient of variation) even where sample completeness varies, while the Chao-Sørensen index has a high CV particularly for low sampling completeness. CNESS distances are also more robust than other indices with regards to undersampling, particularly when chiefly rare species are shared between two assemblages.

    4. Our results emphasize the superiority of CNESS for comparisons of samples diverging in sample completeness and size, which is particularly important in studies of highly mobile and species-rich taxa where sample completeness is often low. Via changes in the sample size parameter m, CNESS furthermore cannot only provide insights into the similarity of the overall distribution structure of shared species, but also into the differences in dominant and rare species, hence allowing additional, valuable insights beyond the capability of more widespread measures.

  13. U

    Water normalized geochemistry data for marine ferromanganese crusts and...

    • data.usgs.gov
    • catalog.data.gov
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katlin Adamczyk; Kira Mizell (2024). Water normalized geochemistry data for marine ferromanganese crusts and phosphorite minerals in the Southern California Borderland [Dataset]. http://doi.org/10.5066/P1QURTBR
    Explore at:
    Dataset updated
    Dec 26, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Katlin Adamczyk; Kira Mizell
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Oct 27, 2020 - Aug 6, 2021
    Area covered
    California
    Description

    Ferromanganese crust and phosphorite minerals were collected using remotely operated vehicles in the Southern California Borderland during two separate research cruises – NOAA Ocean Exploration Trust cruise NA124 onboard the E/V Nautilus in 2020, and Schmidt Ocean Institute cruise FK210726 onboard the R/V Falkor in 2021. Ferromanganese crust and phosphorite samples were described and subsampled for geochemical analysis at the USGS Pacific Coastal and Marine Science Center. Geochemical analyses were completed by outside commercial laboratories, and the results were provided to the USGS. Geochemical data, bulk and layer thickness information, as well as location information (latitude, longitude, depth) for each sample are provided here. The geochemical Borderland work was funded by the Pacific Coastal and Marine Science Center and ship time was funded by NOAA Office of Ocean Exploration and Research (grant number NA19OAR110305).

  14. S

    Normalized Difference Water Index for Andalusia Region based on MODIS

    • data.subak.org
    • data.niaid.nih.gov
    csv
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GMV Aerospace and Defense (2023). Normalized Difference Water Index for Andalusia Region based on MODIS [Dataset]. https://data.subak.org/dataset/normalized-difference-water-index-for-andalusia-region-based-on-modis
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    GMV Aerospace and Defense
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Normalized Difference Water Index (NDWI) calculated using MODIS09 imagery provided by USGS/EROS.The original MODIS09 bands were used as data source and then the ndwi was calculated.

    The Normalized Difference Water Index (NDWI) (Gao, 1996) is a satellite-derived index from the Near-Infrared (NIR) and Short Wave Infrared (SWIR) channels. Its usefulness for drought monitoring and early warning has been demonstrated in different studies (e.g., Gu et al., 2007; Ceccato et al., 2002). It is computed using the near infrared (NIR) and the short wave infrared (SWIR) reflectance, which makes it sensitive to changes in liquid water content and in spongy mesophyll of vegetation canopies (Gao, 1996 ; Ceccato et al., 2001).

    https://edo.jrc.ec.europa.eu/documents/factsheets/factsheet_ndwi.pdf

    Each compressed file contains the NDVI by years. Internally each file have an ISO TC 211 metadata with a complete geographical description.

    spatial resolution: 463.313m

    format: GeoTiff

    reference system:SR-ORG 6842

    To easily manage the data, each file follow name structure:

    YYYYMMDD_medgold_workpackage_AoI_sensor_index.

    YYYYMMDD: Imagery acquisition date

    medgold: Project name

    sensor: sensor name

    workpackage: sectorial workpackage name ( WP2 - Olive Oil Sector, WP3 – Wine Grape sector, WP4- Durum wheat pasta sector)

    Aoi: Andalusia, Douro Valley.

    Index: NDVI, NMDI, NDWI

    Example:

    20000218_medgold_wp2_andalusia_MOD09A1_ndwi

  15. COVID 19 Dataset

    • kaggle.com
    Updated Sep 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Gupta (2020). COVID 19 Dataset [Dataset]. https://www.kaggle.com/rahulgupta21/datahub-covid19/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 23, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rahul Gupta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Coronavirus disease 2019 (COVID-19) time series listing confirmed cases, reported deaths and reported recoveries. Data is disaggregated by country (and sometimes subregion). Coronavirus disease (COVID-19) is caused by the Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) and has had a worldwide effect. On March 11 2020, the World Health Organization (WHO) declared it a pandemic, pointing to the over 118,000 cases of the Coronavirus illness in over 110 countries and territories around the world at the time.

    This dataset includes time series data tracking the number of people affected by COVID-19 worldwide, including:

    confirmed tested cases of Coronavirus infection the number of people who have reportedly died while sick with Coronavirus the number of people who have reportedly recovered from it

    Content

    Data is in CSV format and updated daily. It is sourced from this upstream repository maintained by the amazing team at Johns Hopkins University Center for Systems Science and Engineering (CSSE) who have been doing a great public service from an early point by collating data from around the world.

    We have cleaned and normalized that data, for example tidying dates and consolidating several files into normalized time series. We have also added some metadata such as column descriptions and data packaged it.

  16. Normalized subject indexing data of K10plus library union catalog

    • zenodo.org
    application/gzip +1
    Updated Dec 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jakob Voß; Jakob Voß (2024). Normalized subject indexing data of K10plus library union catalog [Dataset]. http://doi.org/10.5281/zenodo.8116348
    Explore at:
    application/gzip, jsonAvailable download formats
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jakob Voß; Jakob Voß
    Description

    This dataset contains normalized subject indexing data of K10plus library union catalog. It includes links between bibliographic records in K10plus and concepts (subjects or classes) from controlled vocabularies:

    • kxp-subjects.tsv.gz: TSV format
    • kxp-subjects.nt.gz: RDF format (in form of NTriples)
    • vocabularies.json: information about vocabularies
    • stats.json: statistics (number of records, subjects per vocabulary etc.)

    The dataset is based on a K10plus database dump at 2023-06-30.

    K10plus

    K10plus is a union catalog of German libraries, run by library service centers BSZ and VZG since 2019. The catalog contains bibliographic data of the majority of academic libraries in Germany. Bibliographic records in K10plus are uniquely identified by a PPN identifier.

    Several APIs exist to retrieve more data for a record via its PPN, e.g. link into K10plus OPAC:

    https://opac.k10plus.de/PPNSET?PPN={PPN}

    Retrieve full record in MARC/XML format:

    https://unapi.k10plus.de/?format=marcxml&id=opac-de-627:ppn:{PPN}

    Get formatted citation for display:

    https://ws.gbv.de/suggest/csl2?citationstyle=ieee&language=en&database=opac-de-627&query=pica.ppn=${PPN}

    APIs to look up more data from a notation or identifier of a vocabulary can be found in https://bartoc.org/. For instance BK class 58.55 can be retrieved via DANTE API:

    https://api.dante.gbv.de/data?uri=http%3A%2F%2Furi.gbv.de%2Fterminology%2Fbk%2F58.55

    See vocabularies.json for mapping of vocabulary symbol to BARTOC URI and additional information.

    Statistics (stats.json):

    {
     "records": 24217319,
     "links": 85125454,
     "triples": 151621266,
     "subjects": {
      "gnd": 40286966,
      "bk": 13862540,
      "rvk": 10160482,
      "ddc": 9314821,
      "lcc": 5481479,
      "sdnb": 4776006,
      "sfb": 474683,
      "asb": 201470,
      "kab": 165575,
      "ssd": 159570,
      "nlm": 135890,
      "stw": 105972
     }
    }

    TSV

    The .tsv file contains three tab-separated columns:

    1. Bibliographic record identifier (PPN)
    2. Vocabulary symbol
    3. Notation or identifier in the vocabulary

    An example:

    010000011 bk 58.55
    010000011 gnd 4036582-7

    Record 010000011 is indexed with class 58.55 from Basic Classification and with authority record 4036582-7 from Integrated authority file.

    RDF

    The NTriples file contains the same information as given in TSV file but identifiers are mapped to URIs. An example:

    <http://uri.gbv.de/document/opac-de-627:ppn:010000011> <http://purl.org/dc/terms/subject> <http://d-nb.info/gnd/4036582-7> .
    <http://uri.gbv.de/document/opac-de-627:ppn:010000011> <http://purl.org/dc/terms/subject> <http://uri.gbv.de/terminology/bk/58.55> .

    Changelog

    • 2023-05-07: New dump. Number of records slightly reduced because K10plus cleaned up duplicate records.
    • 2023-04-13: New dump, added stats.json
    • 2023-01-20: New dump
    • 2022-09-11: New dump, fixed PPN URIs and broken UTF-8 encoding
    • 2022-08-24: Fixed GND URIs, added LCC and KAB (https://doi.org/10.5281/zenodo.7018350)
    • 2022-08-24: First version (https://doi.org/10.5281/zenodo.7016626)

    License and provenance

    All data is public domain but references are welcome. See https://coli-conc.gbv.de/ for related projects and documentation.

    The data has been derived from a larger datase of all subject indexing data, possibly published at https://doi.org/10.5281/zenodo.6817455.

    This dataset has been created with public scripts from git repository https://github.com/gbv/k10plus-subjects.

  17. t

    Planktonic foraminiferal area normalized shell weight data and model...

    • service.tib.eu
    Updated Nov 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Planktonic foraminiferal area normalized shell weight data and model carbonate system simulations for the 20th century California Current ecosystem - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-909101
    Explore at:
    Dataset updated
    Nov 30, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    California
    Description

    These data are in association with a study conducted in the Santa Barbara Basin, California, USA that were used to develop a 20th century record of ocean acidification in the California Current Ecosystem. These data are based on the planktonic foraminifera species G. bulloides' shell characteristics, specifically area-normalized shell weight (ANSW) and include both core sample population average values and the individual data used to calculate sample averages. Additional information on collection of ANSW data can be obtained in Osborne et al., (2016) publication in Paleoceanography (doi:10.1002/2016PA002933). Carbonate system calculations for the downcore record determined based on the proxy CO32- values using the CO2SYS program are also included in this archive. Stable isotope geochemistry data measured by IRMS for individuals used in the ANSW and two additional study species (N. dutertrei) and N. incompta) are also included. Core radioisotope geochemistry data used to develop an age core for the multi-core record are included in this archive.

  18. P

    MNIST Dataset

    • paperswithcode.com
    Updated Nov 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y. LeCun; L. Bottou; Y. Bengio; P. Haffner (2021). MNIST Dataset [Dataset]. https://paperswithcode.com/dataset/mnist
    Explore at:
    Dataset updated
    Nov 16, 2021
    Authors
    Y. LeCun; L. Bottou; Y. Bengio; P. Haffner
    Description

    The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

  19. d

    Data from: Cooperation and coexpression: how coexpression networks shift in...

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Mar 19, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sathvik X. Palakurty; John R. Stinchcombe; Michelle E. Afkhami (2018). Cooperation and coexpression: how coexpression networks shift in response to multiple mutualists [Dataset]. http://doi.org/10.5061/dryad.2hj343f
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 19, 2018
    Dataset provided by
    Dryad
    Authors
    Sathvik X. Palakurty; John R. Stinchcombe; Michelle E. Afkhami
    Time period covered
    2018
    Description

    Differential Coexpression ScriptThis script contains the use of previously normalized data to execute the DiffCoEx computational pipeline on an experiment with four treatment groups.differentialCoexpression.rNormalized Transformed Expression Count DataNormalized, transformed expression count data of Medicago truncatula and mycorrhizal fungi is given as an R data frame where the columns denote different genes and rows denote different samples. This data is used for downstream differential coexpression analyses.Expression_Data.zipNormalization and Transformation of Raw Count Data ScriptRaw count data is transformed and normalized with available R packages and RNA-Seq best practices.dataPrep.rRaw_Count_Data_Mycorrhizal_FungiRaw count data from HtSeq for mycorrhizal fungi reads are later transformed and normalized for use in differential coexpression analysis. 'R+' indicates that the sample was obtained from a plant grown in the presence of both mycorrhizal fungi and rhizobia. 'R-' indicate...

  20. S

    Normalized Multi-band Drought Index for Douro Valley based on MODIS

    • data.subak.org
    csv
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Normalized Multi-band Drought Index for Douro Valley based on MODIS [Dataset]. https://data.subak.org/dataset/normalized-multi-band-drought-index-for-douro-valley-based-on-modis
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    GMV Aerospace and Defense
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Douro River
    Description

    Normalized Multi-band Drought Index (NMDI) calculated using MODIS09 imagery provided by USGS/EROS.The original MODIS09 bands were used as data source and then the nmdi was calculated.

    NMDI uses the 860 nm channel as the reference; instead of using a single liquid water absorption channel, however, it uses the difference between two liquid water absorption channels centered at 1640 nm and 2130 nm as the soil and vegetation moisture sensitive band. Analysis revealed that by combining information from multiple near infrared, and short wave infrared channels, NMDI has enhanced the sensitivity to drought severity, and is well suited to estimate both soil and vegetation moisture.( Lingli Wang, 2007)

    https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2007GL031021

    Each compressed file contains the NDVI by years. Internally each file have an ISO TC 211 metadata with a complete geographical description.

    spatial resolution: 463.313m

    format: GeoTiff

    reference system:SR-ORG 6842

    To easily manage the data, each file follow name structure:

    YYYYMMDD_medgold_workpackage_AoI_sensor_index.

    YYYYMMDD: Imagery acquisition date

    medgold: Project name

    sensor: sensor name

    workpackage: sectorial workpackage name ( WP2 - Olive Oil Sector, WP3 – Wine Grape sector, WP4- Durum wheat pasta sector)

    Aoi: Andalusia, Douro Valley.

    Index: NDVI, NMDI, NDWI

    Example:

    20000218_medgold_wp3_douro_MOD09A1_nmdi

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Organization logo

Data from: A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data

Related Article
Explore at:
zipAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Universidade de São Paulo
Hospital for Sick Children
University of Toronto
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

Search
Clear search
Close search
Google apps
Main menu