https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background
The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods
This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results
The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).
Methods
Study Participants and Samples
The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.
All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.
Blood Collection and Processing
Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.
Characterization of DNA Methylation using the EPIC array
Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).
Processing and Analysis of DNA Methylation Data
The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.
Normalization Methods Evaluated
The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.
Foraminiferal samples were collected from Chincoteague Bay, Newport Bay, and Tom’s Cove as well as the marshes on the back-barrier side of Assateague Island and the Delmarva (Delaware-Maryland-Virginia) mainland by U.S. Geological Survey (USGS) researchers from the St. Petersburg Coastal and Marine Science Center in March, April (14CTB01), and October (14CTB02) 2014. Samples were also collected by the Woods Hole Coastal and Marine Science Center (WHCMSC) in July 2014 and shipped to the St. Petersburg office for processing. The dataset includes raw foraminiferal and normalized counts for the estuarine grab samples (G), terrestrial surface samples (S), and inner shelf grab samples (G). For further information regarding data collection and sample site coordinates, processing methods, or related datasets, please refer to USGS Data Series 1060 (https://doi.org/10.3133/ds1060), USGS Open-File Report 2015–1219 (https://doi.org/10.3133/ofr20151219), and USGS Open-File Report 2015-1169 (https://doi.org/10.3133/ofr20151169). Downloadable data are available as Excel spreadsheets, comma-separated values text files, and formal Federal Geographic Data Committee metadata.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
This is a normalized dataset from the original RNAseq dataset downloaded from Genotype-Tissue Expression (GTEx) project: www.gtexportal.org: RNA-SeQCv1.1.8 gene rpkm Pilot V3 patch1. The data was used to analyze how tissue samples are related to each other in terms of gene expression data The data can be used to get insights in how gene expression levels behave in in the different human tissues.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Normalized Difference Water Index (NDWI) calculated using MODIS09 imagery provided by USGS/EROS.The original MODIS09 bands were used as data source and then the ndwi was calculated.
The Normalized Difference Water Index (NDWI) (Gao, 1996) is a satellite-derived index from the Near-Infrared (NIR) and Short Wave Infrared (SWIR) channels. Its usefulness for drought monitoring and early warning has been demonstrated in different studies (e.g., Gu et al., 2007; Ceccato et al., 2002). It is computed using the near infrared (NIR) and the short wave infrared (SWIR) reflectance, which makes it sensitive to changes in liquid water content and in spongy mesophyll of vegetation canopies (Gao, 1996 ; Ceccato et al., 2001).
https://edo.jrc.ec.europa.eu/documents/factsheets/factsheet_ndwi.pdf
Each compressed file contains the NDVI by years. Internally each file have an ISO TC 211 metadata with a complete geographical description.
spatial resolution: 463.313m
format: GeoTiff
reference system:SR-ORG 6842
To easily manage the data, each file follow name structure:
YYYYMMDD_medgold_workpackage_AoI_sensor_index.
YYYYMMDD: Imagery acquisition date
medgold: Project name
sensor: sensor name
workpackage: sectorial workpackage name ( WP2 - Olive Oil Sector, WP3 – Wine Grape sector, WP4- Durum wheat pasta sector)
Aoi: Andalusia, Douro Valley.
Index: NDVI, NMDI, NDWI
Example:
20000218_medgold_wp3_douro_MOD09A1_ndwi
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a simplified situation of two species related through a common ancestor, where the evolutionary tree has just one internal node representing the ancestor, with four possible ancestral assignments. For a sample PWM with 5 sites aligned over the two species, we provide representative values (in blue) for the probability of the tree corresponding to each site given a particular ancestral assignment. From these we work out the overall probability of an ancestral assignment given the data (last column). For details, see the text in the Materials and Methods section that references this table.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Solar Wind Omni and SAMPEX ( Solar Anomalous and Magnetospheric Particle Explorer) datasets used in examples for SEAnorm, a time normalized superposed epoch analysis package in python.
Both data sets are stored as either a HDF5 or a compressed csv file (csv.bz2) which contain a Pandas DataFrame of either the Solar Wind Omni and SAMPEX data sets. The data sets where written with pandas.DataFrame.to_hdf() and pandas.DataFrame.to_csv() using a compression level of 9. The DataFrames can be read using pandas.DataFrame.read_hdf( ) or pandas.DataFrame.read_csv( ) depending on the file format.
The Solar Wind Omni data sets contains solar wind velocity (V) and dynamic pressure (P), the southward interplanetary magnetic field in Geocentric Solar Ecliptic System (GSE) coordinates (B_Z_GSE), the auroral electrojet index (AE), and the Sym-H index all at 1 minute cadence.
The SAMPEX data set contains electron flux from the Proton/Electron Telescope (PET) at two energy channels 1.5-6.0 MeV (ELO) and 2.5-14 MeV (EHI) at an approximate 6 second cadence.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diversity analysis of amplicon sequencing data has mainly been limited to plug-in estimates calculated using normalized data to obtain a single value of an alpha diversity metric or a single point on a beta diversity ordination plot for each sample. As recognized for count data generated using classical microbiological methods, amplicon sequence read counts obtained from a sample are random data linked to source properties (e.g., proportional composition) by a probabilistic process. Thus, diversity analysis has focused on diversity exhibited in (normalized) samples rather than probabilistic inference about source diversity. This study applies fundamentals of statistical analysis for quantitative microbiology (e.g., microscopy, plating, and most probable number methods) to sample collection and processing procedures of amplicon sequencing methods to facilitate inference reflecting the probabilistic nature of such data and evaluation of uncertainty in diversity metrics. Following description of types of random error, mechanisms such as clustering of microorganisms in the source, differential analytical recovery during sample processing, and amplification are found to invalidate a multinomial relative abundance model. The zeros often abounding in amplicon sequencing data and their implications are addressed, and Bayesian analysis is applied to estimate the source Shannon index given unnormalized data (both simulated and experimental). Inference about source diversity is found to require knowledge of the exact number of unique variants in the source, which is practically unknowable due to library size limitations and the inability to differentiate zeros corresponding to variants that are actually absent in the source from zeros corresponding to variants that were merely not detected. Given these problems with estimation of diversity in the source even when the basic multinomial model is valid, diversity analysis at the level of samples with normalized library sizes is discussed.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Data used in the experiments described in:
Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany.
https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf
(https://www.linguistics.rub.de/konvens16/)
Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.
There are four datasets:
- goo300k-bohoric: historical Slovene, hard case (<1850)
- goo300k-gaj: historical Slovene, easy case (1850 - 1900)
- tweet-L3: Slovene tweets, hard case (non-standard language)
- tweet-L1: Slovene tweets, easy case (mostly standard language)
The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/).
The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).
This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.
This sample dataset also includes files relative to metadata, static data, normalization, and plotting.
To use the data, clone the corresponding repository and unzip this zip file in the data folder.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Measures of β-diversity characterizing the difference in species composition between samples are commonly used in ecological studies. Nonetheless, commonly used dissimilarity measures require high sample completeness, or at least similar sample sizes between samples. In contrast, the Chord-Normalized Expected Species Shared (CNESS) dissimilarity measure calculates the probability of collecting the same set of species in random samples of a standardized size, and hence is not sensitive to completeness or size of compared samples. To date, this index has enjoyed limited use due to difficulties in its calculation and scarcity of studies systematically comparing it with other measures.
Here, we developed a novel R function that enables users to calculate ESS (Expected Species Shared)-associated measures. We evaluate the performance of the CNESS index based on simulated datasets of known species distribution structure, and compared CNESS with more widespread dissimilarity measures (Bray-Curtis index, Chao-Sørensen index, and proportionality based Euclidean distances) for varying sample completeness and sample sizes.
Simulation results indicated that for small sample size (m) values, CNESS chiefly reflects similarities in dominant species, while selecting large m values emphasizes differences in the overall species assemblages. Permutation tests revealed that CNESS has a consistently low CV (coefficient of variation) even where sample completeness varies, while the Chao-Sørensen index has a high CV particularly for low sampling completeness. CNESS distances are also more robust than other indices with regards to undersampling, particularly when chiefly rare species are shared between two assemblages.
Our results emphasize the superiority of CNESS for comparisons of samples diverging in sample completeness and size, which is particularly important in studies of highly mobile and species-rich taxa where sample completeness is often low. Via changes in the sample size parameter m, CNESS furthermore cannot only provide insights into the similarity of the overall distribution structure of shared species, but also into the differences in dominant and rare species, hence allowing additional, valuable insights beyond the capability of more widespread measures.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Ferromanganese crust and phosphorite minerals were collected using remotely operated vehicles in the Southern California Borderland during two separate research cruises – NOAA Ocean Exploration Trust cruise NA124 onboard the E/V Nautilus in 2020, and Schmidt Ocean Institute cruise FK210726 onboard the R/V Falkor in 2021. Ferromanganese crust and phosphorite samples were described and subsampled for geochemical analysis at the USGS Pacific Coastal and Marine Science Center. Geochemical analyses were completed by outside commercial laboratories, and the results were provided to the USGS. Geochemical data, bulk and layer thickness information, as well as location information (latitude, longitude, depth) for each sample are provided here. The geochemical Borderland work was funded by the Pacific Coastal and Marine Science Center and ship time was funded by NOAA Office of Ocean Exploration and Research (grant number NA19OAR110305).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Normalized Difference Water Index (NDWI) calculated using MODIS09 imagery provided by USGS/EROS.The original MODIS09 bands were used as data source and then the ndwi was calculated.
The Normalized Difference Water Index (NDWI) (Gao, 1996) is a satellite-derived index from the Near-Infrared (NIR) and Short Wave Infrared (SWIR) channels. Its usefulness for drought monitoring and early warning has been demonstrated in different studies (e.g., Gu et al., 2007; Ceccato et al., 2002). It is computed using the near infrared (NIR) and the short wave infrared (SWIR) reflectance, which makes it sensitive to changes in liquid water content and in spongy mesophyll of vegetation canopies (Gao, 1996 ; Ceccato et al., 2001).
https://edo.jrc.ec.europa.eu/documents/factsheets/factsheet_ndwi.pdf
Each compressed file contains the NDVI by years. Internally each file have an ISO TC 211 metadata with a complete geographical description.
spatial resolution: 463.313m
format: GeoTiff
reference system:SR-ORG 6842
To easily manage the data, each file follow name structure:
YYYYMMDD_medgold_workpackage_AoI_sensor_index.
YYYYMMDD: Imagery acquisition date
medgold: Project name
sensor: sensor name
workpackage: sectorial workpackage name ( WP2 - Olive Oil Sector, WP3 – Wine Grape sector, WP4- Durum wheat pasta sector)
Aoi: Andalusia, Douro Valley.
Index: NDVI, NMDI, NDWI
Example:
20000218_medgold_wp2_andalusia_MOD09A1_ndwi
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Coronavirus disease 2019 (COVID-19) time series listing confirmed cases, reported deaths and reported recoveries. Data is disaggregated by country (and sometimes subregion). Coronavirus disease (COVID-19) is caused by the Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) and has had a worldwide effect. On March 11 2020, the World Health Organization (WHO) declared it a pandemic, pointing to the over 118,000 cases of the Coronavirus illness in over 110 countries and territories around the world at the time.
This dataset includes time series data tracking the number of people affected by COVID-19 worldwide, including:
confirmed tested cases of Coronavirus infection the number of people who have reportedly died while sick with Coronavirus the number of people who have reportedly recovered from it
Data is in CSV format and updated daily. It is sourced from this upstream repository maintained by the amazing team at Johns Hopkins University Center for Systems Science and Engineering (CSSE) who have been doing a great public service from an early point by collating data from around the world.
We have cleaned and normalized that data, for example tidying dates and consolidating several files into normalized time series. We have also added some metadata such as column descriptions and data packaged it.
This dataset contains normalized subject indexing data of K10plus library union catalog. It includes links between bibliographic records in K10plus and concepts (subjects or classes) from controlled vocabularies:
kxp-subjects.tsv.gz
: TSV formatkxp-subjects.nt.gz
: RDF format (in form of NTriples)vocabularies.json
: information about vocabulariesstats.json
: statistics (number of records, subjects per vocabulary etc.)The dataset is based on a K10plus database dump at 2023-06-30.
K10plus
K10plus is a union catalog of German libraries, run by library service centers BSZ and VZG since 2019. The catalog contains bibliographic data of the majority of academic libraries in Germany. Bibliographic records in K10plus are uniquely identified by a PPN identifier.
Several APIs exist to retrieve more data for a record via its PPN, e.g. link into K10plus OPAC:
https://opac.k10plus.de/PPNSET?PPN={PPN}
Retrieve full record in MARC/XML format:
https://unapi.k10plus.de/?format=marcxml&id=opac-de-627:ppn:{PPN}
Get formatted citation for display:
https://ws.gbv.de/suggest/csl2?citationstyle=ieee&language=en&database=opac-de-627&query=pica.ppn=${PPN}
APIs to look up more data from a notation or identifier of a vocabulary can be found in https://bartoc.org/. For instance BK class 58.55
can be retrieved via DANTE API:
https://api.dante.gbv.de/data?uri=http%3A%2F%2Furi.gbv.de%2Fterminology%2Fbk%2F58.55
See vocabularies.json
for mapping of vocabulary symbol to BARTOC URI and additional information.
Statistics (stats.json):
{
"records": 24217319,
"links": 85125454,
"triples": 151621266,
"subjects": {
"gnd": 40286966,
"bk": 13862540,
"rvk": 10160482,
"ddc": 9314821,
"lcc": 5481479,
"sdnb": 4776006,
"sfb": 474683,
"asb": 201470,
"kab": 165575,
"ssd": 159570,
"nlm": 135890,
"stw": 105972
}
}
TSV
The .tsv
file contains three tab-separated columns:
An example:
010000011 bk 58.55
010000011 gnd 4036582-7
Record 010000011
is indexed with class 58.55
from Basic Classification and with authority record 4036582-7
from Integrated authority file.
RDF
The NTriples file contains the same information as given in TSV file but identifiers are mapped to URIs. An example:
<http://uri.gbv.de/document/opac-de-627:ppn:010000011> <http://purl.org/dc/terms/subject> <http://d-nb.info/gnd/4036582-7> .
<http://uri.gbv.de/document/opac-de-627:ppn:010000011> <http://purl.org/dc/terms/subject> <http://uri.gbv.de/terminology/bk/58.55> .
Changelog
License and provenance
All data is public domain but references are welcome. See https://coli-conc.gbv.de/ for related projects and documentation.
The data has been derived from a larger datase of all subject indexing data, possibly published at https://doi.org/10.5281/zenodo.6817455.
This dataset has been created with public scripts from git repository https://github.com/gbv/k10plus-subjects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data are in association with a study conducted in the Santa Barbara Basin, California, USA that were used to develop a 20th century record of ocean acidification in the California Current Ecosystem. These data are based on the planktonic foraminifera species G. bulloides' shell characteristics, specifically area-normalized shell weight (ANSW) and include both core sample population average values and the individual data used to calculate sample averages. Additional information on collection of ANSW data can be obtained in Osborne et al., (2016) publication in Paleoceanography (doi:10.1002/2016PA002933). Carbonate system calculations for the downcore record determined based on the proxy CO32- values using the CO2SYS program are also included in this archive. Stable isotope geochemistry data measured by IRMS for individuals used in the ANSW and two additional study species (N. dutertrei) and N. incompta) are also included. Core radioisotope geochemistry data used to develop an age core for the multi-core record are included in this archive.
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
Differential Coexpression ScriptThis script contains the use of previously normalized data to execute the DiffCoEx computational pipeline on an experiment with four treatment groups.differentialCoexpression.rNormalized Transformed Expression Count DataNormalized, transformed expression count data of Medicago truncatula and mycorrhizal fungi is given as an R data frame where the columns denote different genes and rows denote different samples. This data is used for downstream differential coexpression analyses.Expression_Data.zipNormalization and Transformation of Raw Count Data ScriptRaw count data is transformed and normalized with available R packages and RNA-Seq best practices.dataPrep.rRaw_Count_Data_Mycorrhizal_FungiRaw count data from HtSeq for mycorrhizal fungi reads are later transformed and normalized for use in differential coexpression analysis. 'R+' indicates that the sample was obtained from a plant grown in the presence of both mycorrhizal fungi and rhizobia. 'R-' indicate...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Normalized Multi-band Drought Index (NMDI) calculated using MODIS09 imagery provided by USGS/EROS.The original MODIS09 bands were used as data source and then the nmdi was calculated.
NMDI uses the 860 nm channel as the reference; instead of using a single liquid water absorption channel, however, it uses the difference between two liquid water absorption channels centered at 1640 nm and 2130 nm as the soil and vegetation moisture sensitive band. Analysis revealed that by combining information from multiple near infrared, and short wave infrared channels, NMDI has enhanced the sensitivity to drought severity, and is well suited to estimate both soil and vegetation moisture.( Lingli Wang, 2007)
https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2007GL031021
Each compressed file contains the NDVI by years. Internally each file have an ISO TC 211 metadata with a complete geographical description.
spatial resolution: 463.313m
format: GeoTiff
reference system:SR-ORG 6842
To easily manage the data, each file follow name structure:
YYYYMMDD_medgold_workpackage_AoI_sensor_index.
YYYYMMDD: Imagery acquisition date
medgold: Project name
sensor: sensor name
workpackage: sectorial workpackage name ( WP2 - Olive Oil Sector, WP3 – Wine Grape sector, WP4- Durum wheat pasta sector)
Aoi: Andalusia, Douro Valley.
Index: NDVI, NMDI, NDWI
Example:
20000218_medgold_wp3_douro_MOD09A1_nmdi
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background
The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods
This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results
The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).
Methods
Study Participants and Samples
The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.
All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.
Blood Collection and Processing
Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.
Characterization of DNA Methylation using the EPIC array
Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).
Processing and Analysis of DNA Methylation Data
The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.
Normalization Methods Evaluated
The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.