https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background
The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods
This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results
The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).
Methods
Study Participants and Samples
The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.
All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.
Blood Collection and Processing
Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.
Characterization of DNA Methylation using the EPIC array
Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).
Processing and Analysis of DNA Methylation Data
The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.
Normalization Methods Evaluated
The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.
Sichkar V. N. Effect of various dimension convolutional layer filters on traffic sign classification accuracy. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2019, vol. 19, no. 3, pp. DOI: 10.17586/2226-1494-2019-19-3-546-552 (Full-text available here ResearchGate.net/profile/Valentyn_Sichkar)
Test online with custom Traffic Sign here: https://valentynsichkar.name/mnist.html
Design, Train & Test deep CNN for Image Classification. Join the course & enjoy new opportunities to get deep learning skills: https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/
https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/slideshow_classification.gif?raw=true%20=470x516" alt="CNN Course" title="CNN Course">
https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/concept_map.png?raw=true%20=570x410" alt="Concept map" title="Concept map">
https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/
This is ready to use preprocessed data saved into pickle
file.
Preprocessing stages are as follows:
- Normalizing whole data by dividing / 255.0
.
- Dividing whole data into three datasets: train, validation and test.
- Normalizing whole data by subtracting mean image
and dividing by standard deviation
.
- Transposing every dataset to make channels come first.
mean image
and standard deviation
were calculated from train dataset
and applied to all datasets.
When using user's image for classification, it has to be preprocessed firstly in the same way: normalized
, subtracted with mean image
and divided by standard deviation
.
Data written as dictionary with following keys:
x_train: (59000, 1, 28, 28)
y_train: (59000,)
x_validation: (1000, 1, 28, 28)
y_validation: (1000,)
x_test: (1000, 1, 28, 28)
y_test: (1000,)
Contains pretrained weights model_params_ConvNet1.pickle
for the model with following architecture:
Input
--> Conv
--> ReLU
--> Pool
--> Affine
--> ReLU
--> Affine
--> Softmax
Parameters:
Pool
is 2 and height = width = 2.
Architecture also can be understood as follows:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3400968%2Fc23041248e82134b7d43ed94307b720e%2FModel_1_Architecture_MNIST.png?generation=1563654250901965&alt=media" alt="">
Initial data is MNIST that was collected by Yann LeCun, Corinna Cortes, Christopher J.C. Burges.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Normalization
# Generate a resting state (rs) timeseries (ts)
# Install / load package to make fake fMRI ts
# install.packages("neuRosim")
library(neuRosim)
# Generate a ts
ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
# 3dDetrend -normalize
# R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
# Do for the full timeseries
ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
# Do this again for a shorter version of the same timeseries
ts.shorter.length <- length(ts.normalised.long)/4
ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
# By looking at the summaries, it can be seen that the median values become larger
summary(ts.normalised.long)
summary(ts.normalised.short)
# Plot results for the long and short ts
# Truncate the longer ts for plotting only
ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
# Give the plot a title
title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
# Add zero line
lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
# 3dDetrend -normalize -polort 0 for long timeseries
lines(ts.normalised.long.made.shorter, col='blue');
# 3dDetrend -normalize -polort 0 for short timeseries
lines(ts.normalised.short, col='red');
Standardization/modernization
New afni_proc.py command line
afni_proc.py \
-subj_id "$sub_id_name_1" \
-blocks despike tshift align tlrc volreg mask blur scale regress \
-radial_correlate_blocks tcat volreg \
-copy_anat anatomical_warped/anatSS.1.nii.gz \
-anat_has_skull no \
-anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \
-anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \
-anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \
-anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \
-anat_follower_erode fsvent fswm \
-dsets media_?.nii.gz \
-tcat_remove_first_trs 8 \
-tshift_opts_ts -tpattern alt+z2 \
-align_opts_aea -cost lpc+ZZ -giant_move -check_flip \
-tlrc_base "$basedset" \
-tlrc_NL_warp \
-tlrc_NL_warped_dsets \
anatomical_warped/anatQQ.1.nii.gz \
anatomical_warped/anatQQ.1.aff12.1D \
anatomical_warped/anatQQ.1_WARP.nii.gz \
-volreg_align_to MIN_OUTLIER \
-volreg_post_vr_allin yes \
-volreg_pvra_base_index MIN_OUTLIER \
-volreg_align_e2a \
-volreg_tlrc_warp \
-mask_opts_automask -clfrac 0.10 \
-mask_epi_anat yes \
-blur_to_fwhm -blur_size $blur \
-regress_motion_per_run \
-regress_ROI_PC fsvent 3 \
-regress_ROI_PC_per_run fsvent \
-regress_make_corr_vols aeseg fsvent \
-regress_anaticor_fast \
-regress_anaticor_label fswm \
-regress_censor_motion 0.3 \
-regress_censor_outliers 0.1 \
-regress_apply_mot_types demean deriv \
-regress_est_blur_epits \
-regress_est_blur_errts \
-regress_run_clustsim no \
-regress_polort 2 \
-regress_bandpass 0.01 1 \
-html_review_style pythonic
We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.
Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.
Effect on results
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data collection
This dataset contains information on the eprints posted on arXiv from its launch in 1991 until the end of 2019 (1,589,006 unique eprints), plus the data on their citations and the associated impact metrics. Here, eprints include preprints, conference proceedings, book chapters, data sets and commentary, i.e. every electronic material that has been posted on arXiv.
The content and metadata of the arXiv eprints were retrieved from the arXiv API (https://arxiv.org/help/api/) as of 21st January 2020, where the metadata included data of the eprint’s title, author, abstract, subject category and the arXiv ID (the arXiv’s original eprint identifier). In addition, the associated citation data were derived from the Semantic Scholar API (https://api.semanticscholar.org/) from 24th January 2020 to 7th February 2020, containing the citation information in and out of the arXiv eprints and their published versions (if applicable). Here, whether an eprint has been published in a journal or other means is assumed to be inferrable, albeit indirectly, from the status of the digital object identifier (DOI) assignment. It is also assumed that if an arXiv eprint received cpre and cpub citations until the data retrieval date (7th February 2020) before and after it is assigned a DOI, respectively, then the citation count of this eprint is recorded in the Semantic Scholar dataset as cpre + cpub. Both the arXiv API and the Semantic Scholar datasets contained the arXiv ID as metadata, which served as a key variable to merge the two datasets.
The classification of research disciplines is based on that described in the arXiv.org website (https://arxiv.org/help/stats/2020_by_area/). There, the arXiv subject categories are aggregated into several disciplines, of which we restrict our attention to the following six disciplines: Astrophysics (‘astro-ph’), Computer Science (‘comp-sci’), Condensed Matter Physics (‘cond-mat’), High Energy Physics (‘hep’), Mathematics (‘math’) and Other Physics (‘oth-phys’), which collectively accounted for 98% of all the eprints. Those eprints tagged to multiple arXiv disciplines were counted independently for each discipline. Due to this overlapping feature, the current dataset contains a cumulative total of 2,011,216 eprints.
Some general statistics and visualisations per research discipline are provided in the original article (Okamura, to appear), where the validity and limitations associated with the dataset are also discussed.
Description of columns (variables)
arxiv_id : arXiv ID
category : Research discipline
pre_year : Year of posting v1 on arXiv
pub_year : Year of DOI acquisition
c_tot : No. of citations acquired during 1991–2019
c_pre : No. of citations acquired before and including the year of DOI acquisition
c_pub : No. of citations acquired after the year of DOI acquisition
c_yyyy (yyyy = 1991, …, 2019) : No. of citations acquired in the year yyyy (with ‘yyyy’ running from 1991 to 2019)
gamma : The quantitatively-and-temporally normalised citation index
gamma_star : The quantitatively-and-temporally standardised citation index
Note: The definition of the quantitatively-and-temporally normalised citation index (γ; ‘gamma’) and that of the standardised citation index (γ*; ‘gamma_star’) are provided in the original article (Okamura, to appear). Both indices can be used to compare the citational impact of papers/eprints published in different research disciplines at different times.
Data files
A comma-separated values file (‘arXiv_impact.csv’) and a Stata file (‘arXiv_impact.dta’) are provided, both containing the same information.
This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied.
SILO is a Queensland Government database containing continuous daily climate data for Australia from 1889 to present. Gridded datasets are constructed by spatially interpolating the observed point data. Continuous point datasets are constructed by supplementing the available point data with interpolated estimates when observed data are missing.
SILO provides climate datasets that are ready to use. Raw observational data typically contain missing data and are only available at the location of meteorological recording stations. SILO provides point datasets with no missing data and gridded datasets which cover mainland Australia and some islands.
Lineage statement:
(A) Processing System Version History
* Prior to 2001
The interpolation system used the algorithm detailed in Jeffrey et al.1
* 2001-2009
The normalisation procedure was modified. Observational rainfall, when accumulated over a sufficient period and raised to an appropriate fractional power, is (to a reasonable approximation) normally distributed. In the original procedure the fractional power was fixed at 0.5 and a normal distribution was fitted to the transformed data using a maximum likelihood technique. A Kolmogorov-Smirnov test was used to test the goodness of fit, with a threshold value of 0.8. In 2001 the procedure was modified to allow the fractional power to vary between 0.4 and 0.6. The normalisation parameters (fractional power, mean and standard deviation) at each station were spatially interpolated using a thin plate smoothing spline.
* 2009-2011
The normalisation procedure was modified. The Kolmogorov-Smirnov test was removed, enabling normalisation parameters to be computed for all stations having sufficient data. Previously parameters were only computed for those stations having data that were adequately modelled by a normal distribution, as determined by the Kolmogorov-Smirnov test.
* January 2012 - November 2012
The normalisation procedure was modified:
o The Kolmogorov-Smirnoff test was reintroduced, with a threshold value of 0.1.
o Data from Bellenden Ker Top station were included in the computation of normalisation parameters. The station was previously omitted on the basis of having insufficient data. It was forcibly included to ensure the steep rainfall gradient in the region was reflected in the normalisation parameters.
o The elevation data used when interpolating normalisation parameters were modified. Previously a mean elevation was assigned to each station, taken from the nearest grid cell in a 0.05° 0.05° digital elevation model. The procedure was modified to use the actual station elevation instead of the mean. In mountainous regions the discrepancy was substantial and cross validation tests showed a significant improvement in error statistics.
o The station data are normalised using: (i) a power parameter extracted from the nearest pixel in the gridded power surface. The surface was obtained by interpolating the power parameters fitted at station locations using a maximum likelihood algorithm; and (ii) mean and standard deviation parameters which had been fitted at station locations using a smoothing spline. Mean and standard deviation parameters were fitted at the subset of stations having at least 40 years of data, using a maximum likelihood algorithm. The fitted data were then spatially interpolated to construct: (a) gridded mean and standard deviation surfaces (for use in a subsequent de-normalisation procedure); and (b) interpolated estimates of the parameters at all station locations (not just the subset having long data records). The parameters fitted using maximum likelihood (at the subset of stations having long data records) may differ from those fitted by the interpolation algorithm, owing to the smoothing nature of the spline algorithm which was used. Previously, station data were normalised using mean and standard deviation parameters which were taken from the nearest pixel in the respective mean and standard deviation surfaces.
* November 2012 - May 2013
The algorithm used for selecting monthly rainfall data for interpolation was modified. Prior to November 2012, the system was as follows:
o Accumulated monthly rainfall was computed by the Bureau of Meteorology;
o Rainfall accumulations spanning the end of a month were assigned to the last month included in the accumulation period;
o A monthly rainfall value was provided for all stations which submitted at least one daily report. Zero rainfall was assumed for all missing values; and
o SILO imposed a complex set of ad-hoc rules which aimed to identify stations which had ceased reporting in real time. In such cases it would not be appropriate to assume zero rainfall for days when a report was not available. The rules were only applied when processing data for January 2001 and onwards.
In November 2012 a modified algorithm was implemented:
o SILO computed the accumulated monthly rainfall by summing the daily reports;
o Rainfall accumulations spanning the end of a month were discarded;
o A monthly rainfall value was not computed for a given station if any day throughout the month was not accounted for - either through a daily report or an accumulation; and
o The SILO ad-hoc rules were not applied.
* May 2013 - current
The algorithm used for selecting monthly rainfall data for interpolation was modified. The modified algorithm is only applied to datasets for the period October 2001 - current and is as follows:
o SILO computes the accumulated monthly rainfall by summing the daily reports;
o Rainfall accumulations spanning the end of a month are pro-rata distributed onto the two months included in the accumulation period;
o A monthly rainfall value is computed for all stations which have at least 21 days accounted for throughout the month. Zero rainfall is assumed for all missing values; and
o The SILO ad-hoc rules are applied when processing data for January 2001 and onwards.
Datasets for the period January 1889-September 2001 are prepared using the system that was in effect prior to November 2012.
Lineage statement:
(A) Processing System Version History
No changes have been made to the processing system since SILO's inception.
(B) Major Historical Data Updates
* All observational data and station coordinates were updated in 2009.
* Station coordinates were updated on 26 January 2012.
Process step:
The observed data are interpolated using a tri-variate thin plate smoothing spline, with latitude, longitude and elevation as independent variables.4 A two-pass interpolation system is used. All available observational data are interpolated in the first pass and residuals computed for all data points. The residual is the difference between the observed and interpolated values. Data points with high residuals may be indicative of erroneous data and are excluded from a subsequent interpolation which generates the final gridded surface. The surface covers the region 112˚E - 154˚E, 10˚S - 44˚S on a regular 0.05˚ × 0.05˚grid and is restricted to land areas on mainland Australia and some islands.
Gridded datasets for the period 1957-current are obtained by interpolation of the raw data.
Gridded datasets for the period 1957-current are obtained by interpolation of the raw data. Gridded datasets for the period 1889-1956 were constructed using an anomaly interpolation technique. The daily departure from the long term mean is interpolated, and the gridded dataset is constructed by adding the gridded anomaly to the gridded long term mean. The long term means were constructed using data from the period 1957-2001. The anomaly interpolation technique is described in Rayner et al.6
The observed and interpolated datasets evolve as new data becomes available and the existing data are improved through quality control procedures. Modifications gradually decrease over time, with most datasets undergoing little change 12 months after the date of observation.
"Queensland Department of Science, Information Technology, Innovation and the Arts" (2013) SILO Patched Point data for Narrabri (54120) and Gunnedah (55023) stations in the Namoi subregion. Bioregional Assessment Source Dataset. Viewed 29 September 2017, http://data.bioregionalassessments.gov.au/dataset/0a018b43-58d3-4b9e-b339-4dae8fd54ce8.
CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources: - OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents. The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series. The nature of the bitexts are paraphrasing of each other's meaning, rather than translations. - TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015. The size of the original corpora collected from OPUS and TED talks is as follows: CS/VI EN/VI Sentence 1337199/1337199 2035624/2035624 Word 9128897/12073975 16638364/17565580 Unique word 224416/68237 91905/78333 We improve the quality of the corpora in two steps: normalizing and filtering. In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly. In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs. The size of cleaned corpora as published is as follows: CS/VI EN/VI Sentence 1091058/1091058 1113177/1091058 Word 6718184/7646701 8518711/8140876 Unique word 195446/59737 69513/58286 The corpora are used as training data in [2]. References: [1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey. [2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this work we present results of all the major global models and normalise the model results by looking at changes over time relative to a common base year value.
We give an analysis of the variability across the models, both before and after normalisation in order to give insights into variance at national and regional level.
A dataset of harmonised results (based on means) and measures of dispersion is presented, providing a baseline dataset for CBCA validation and analysis.
The dataset is intended as a goto dataset for country and regional results of consumption and production based accounts. The normalised mean for each country/region is the principle result that can be used to assess the magnitude and trend in the emission accounts. However, an additional key element of the dataset are the measures of robustness and spread of the results across the source models. These metrics give insight into the amount of trust should be placed in the individual country/region results.
Measuring the information society report presents a global overview of the latest developments in information and communication technologies ICTs , based on internationally comparable data and agreed methodologies It aims to stimulate the ICT policy debate in ITU Member States by providing an objective assessment of countries rsquo performance in the field of ICT and by highlighting areas that need further improvement The ICT Development Index IDI is a composite index that combines 11 indicators into one benchmark measure It is used to monitor and compare developments in information and communication technology ICT between countries and over time The IDI is divided into the following three sub indices, and a total of 11 indicators Access sub index This sub index captures ICT readiness, and includes five infrastructure and access indicators fixed telephone subscriptions, mobile cellular telephone subscriptions, international Internet bandwidth per Internet user, households with a computer, and households with Internet access Use sub index This sub index captures ICT intensity, and includes three intensity and usage indicators individuals using the Internet, fixed broadband subscriptions, and mobile broadband subscriptions Skills sub index This sub index seeks to capture capabilities or skills which are important for ICTs It includes three proxy indicators mean years of schooling, gross secondary enrolment, and gross tertiary enrolment As these are proxy indicators, rather than indicators directly measuring ICT related skills, the skills sub index is given less weight in the computation of the IDI than the other two sub indices The data has been normalized to ensure that the data set uses the same unit of measurement The values for the indicators selected to construct the IDI are converted into the same unit of measurement, since some indicators have maximum value as 100 whereas for other indicators the maximum value exceeds 100 After normalizing the data, the individual series were all rescaled to identical ranges, from 1 to 10
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We measured local tissue blood flow and muscle stiffness in the upper trapezius muscle as well as blood pressure and heart rate (HR) in the whole body to investigate the effects of Manipulative therapy (MT) interventions. Diffuse correlation spectroscopy (DCS) system, which was developed in our laboratory, was used to measure blood flow index (BFI) as local tissue blood flow non-invasively. Other measurement devices are explained in the description of the tables below. We measured blood flow, heart rate and blood pressure simultaneously for a continuous period of 27 minutes. During this 27-minute period, the participants rested for 2 minutes lying in the prone position, followed by 5 minutes of MT, and then again for 20 minutes of rest. Notably, MT was exclusively performed on the right shoulder (MT side), while the left shoulder served as the control (CT side). Muscle stiffness was also measured before and after 27 minutes of measuring blood flow, heart rate and blood pressure. At the end of the experiment, skin thickness was measured. All of the above measurements were taken at specific measurement points in the upper trapezius muscle. The measurement points were set at the midpoint between the spinous process of the seventh cervical vertebra and the acromion of the scapula in both sides.
Table 1 Participant information A list of each participant's sex, age at the time of the experiment, and adipose layer thickness [cm] at the measurement site. Adipose layer thickness was assessed using a compact high-resolution ultrasonic diagnostic device (LS MUS-P0301-L75, FujikinSoft).
Table 2 Time series data of HR The data of heart rate (HR) [bpm] measured every minute for 27 minutes in each participant, using a blood pressure monitor (TangoM2, 99-0088-40, SunTech Medical), by the oscillometric method in non-exercise mode with the cuff placed on the left arm. Data is sampled every minute. In this data, 1~2 minutes is the rest period before MT, 3~7 minutes is the MT intervention time, and 8~27 minutes is the rest period after MT.
Table 3 Time series data of MAP The data of mean arterial pressure (MAP) [mmHg] measured every minute for 27 minutes in each participant, using a blood pressure monitor (TangoM2, 99-0088-40, SunTech Medical), by the oscillometric method in non-exercise mode with the cuff placed on the left arm. Data is sampled every minute. In this data, 1~2 minutes is the rest period before MT, 3~7 minutes is the MT intervention time, and 8~27 minutes is the rest period after MT.
Table 4 Tissue stiffness The data of muscle stiffness [N/m] measured before and after MT (preMT and postMT) at the same measurement points on MT and CT side for each participant. Muscle stiffness was evaluated using a digital palpation device (MyotonPro, Myoton). The measurement was automatically repeated five times, and the average value was taken as the muscle stiffness.
Table 5 Time series data of BFI (3 cm between the probes on the CT side) The data of blood flow index (BFI) [×10-9 cm2/s] measured at a source-detector distance of 3 cm on the CT side. The BFI data was measured every second for each participant. Data is sampled every second. In this data, 1~120 seconds is the rest period before MT, 121~420 seconds is the MT intervention time, and 421~1621 seconds is the rest period after MT.
Table 6 Time series data of BFI (1cm between the probes on the MT side) The data of BFI [×10-9 cm2/s] measured at a source-detector distance of 1 cm on the MT side. The BFI data was measured every second for each participant. Data is sampled every second. In this data, 1~120 seconds is the rest period before MT, 121~420 seconds is the MT intervention time, and 421~1621 seconds is the rest period after MT.
Table 7 Time series data of BFI (3 cm between the probes on the MT side) The data of BFI [×10-9 cm2/s] measured at a source-detector distance of 3 cm on the MT side. The BFI data was measured every second for each participant. Data is sampled every second. In this data, 1~120 seconds is the rest period before MT, 121~420 seconds is the MT intervention time, and 421~1621 seconds is the rest period after MT.
Table 8 Time series data of rBFI (3 cm between the probes on the CT side) The relative blood flow index (rBFI) data measured at a source-detector distance of 3 cm on the CT side. Data is sampled every second. In this data, 1~120 seconds is the rest period before MT, 121~420 seconds is the MT intervention time, and 421~1621 seconds is the rest period after MT. The rBFI was calculated from the data shown in Table 5 by normalizing the BFI for the entire measurement time by the average value of the BFI for the rest period before MT (1~120 seconds) for each participant.
Table 9 Time series data of rBFI (1 cm between the probes on the MT side) The rBFI data measured at a source-detector distance of 1 cm on the MT side. Data is sampled every second. In this data, 1~120 seconds is the rest period before MT, 121~420 seconds is the MT intervention time, and 421~1621 seconds is the rest period after MT. The rBFI was calculated from the data shown in Table 6 by normalizing the BFI for the entire measurement time by the average value of the BFI for the rest period before MT (1~120 seconds) for each participant.
Table 10 Time series data of rBFI (3 cm between the probes on the MT side) The rBFI data measured at a source-detector distance of 3 cm on the MT side. Data is sampled every second. In this data, 1~120 seconds is the rest period before MT, 121~420 seconds is the MT intervention time, and 421~1621 seconds is the rest period after MT. The rBFI was calculated from the data shown in Table 7 by normalizing the BFI for the entire measurement time by the average value of the BFI for the rest period before MT (1~120 seconds) for each participant.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Water quality data. These data have been normalised to their means over the time period with a normalised mean of 100.
Link to the ScienceBase Item Summary page for the item described by this metadata record. Service Protocol: Link to the ScienceBase Item Summary page for the item described by this metadata record. Application Profile: Web Browser. Link Function: information
Link to the ScienceBase Item Summary page for the item described by this metadata record. Service Protocol: Link to the ScienceBase Item Summary page for the item described by this metadata record. Application Profile: Web Browser. Link Function: information
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundLeft ventricular mass normalization for body size is recommended, but a question remains: what is the best body size variable for this normalization—body surface area, height or lean body mass computed based on a predictive equation? Since body surface area and computed lean body mass are derivatives of body mass, normalizing for them may result in underestimation of left ventricular mass in overweight children. The aim of this study is to indicate which of the body size variables normalize left ventricular mass without underestimating it in overweight children.MethodsLeft ventricular mass assessed by echocardiography, height and body mass were collected for 464 healthy boys, 5–18 years old. Lean body mass and body surface area were calculated. Left ventricular mass z-scores computed based on reference data, developed for height, body surface area and lean body mass, were compared between overweight and non-overweight children. The next step was a comparison of paired samples of expected left ventricular mass, estimated for each normalizing variable based on two allometric equations—the first developed for overweight children, the second for children of normal body mass.ResultsThe mean of left ventricular mass z-scores is higher in overweight children compared to non-overweight children for normative data based on height (0.36 vs. 0.00) and lower for normative data based on body surface area (-0.64 vs. 0.00). Left ventricular mass estimated normalizing for height, based on the equation for overweight children, is higher in overweight children (128.12 vs. 118.40); however, masses estimated normalizing for body surface area and lean body mass, based on equations for overweight children, are lower in overweight children (109.71 vs. 122.08 and 118.46 vs. 120.56, respectively).ConclusionNormalization for body surface area and for computed lean body mass, but not for height, underestimates left ventricular mass in overweight children.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study was performed in accordance with the PHS Policy on Humane Care and Use of Laboratory Animals, federal and state regulations, and was approved by the Institutional Animal Care and Use Committees (IACUC) of Cornell University and the Ethics and Welfare Committee at the Royal Veterinary College.Study design: adult horses were recruited if in good health and following evaluation of the upper airways through endoscopic exam, at rest and during exercise, either overground or on a high-speed treadmill using a wireless videoendoscope. Horses were categorized as “DDSP” affected horses if they presented with exercise-induced intermittent dorsal displacement of the soft palate consistently during multiple (n=3) exercise tests, or “control” horses if they did not experience dorsal displacement of the soft palate during exercise and had no signs compatible with DDSP like palatal instability during exercise, soft palate or sub-epiglottic ulcerations. Horses were instrumented with intramuscular electrodes, in one or both thyro-hyoid muscles for EMG recording, hard wired to a wireless transmitter for remote recording implanted in the cervical area. EMG recordings were then made during an incremental exercise test based on the percentage of maximum heart rate (HRmax). Incremental Exercise Test After surgical instrumentation, each horse performed a 4-step incremental test while recording TH electromyographic activity, heart rate, upper airway videoendoscopy, pharyngeal airway pressures, and gait frequency measurements. Horses were evaluated at exercise intensities corresponding to 50, 80, 90 and 100% of their maximum heart rate with each speed maintained for 1 minute. aryngeal function during the incremental test was recorded using a wireless videoendoscope (Optomed, Les Ulis, France), which was placed into the nasopharynx via the right ventral nasal meatus. Nasopharyngeal pressure was measured using a Teflon catheter (1.3 mm ID, Neoflon) inserted through the left ventral nasal meatus to the level of the left guttural pouch ostium. The catheter was attached to differential pressure transducers (Celesco LCVR, Celesco Transducers Products, Canoga Park, CA, USA) referenced to atmospheric pressure and calibrated from -70 to 70 mmHg. Occurrence of episodes of dorsal displacement of the soft palate was recorded and number of swallows during each exercise trials were counted for each speed interval.
EMG recordingEMG data was recorded through a wireless transmitter device implanted subcutaneously. Two different transmitters were used: 1) TR70BB (Telemetry Research Ltd, Auckland, New Zealand) with 12bit A/D conversion resolution, AC coupled amplifier, -3dB point at 1.5Hz, 2KHz sampling frequency (n=5 horses); or 2) ELI (Center for Medical Physics and Biomedical Engineering, Medical University of Vienna, Vienna, Austria) [23], with 12bit A/D conversion resolution, AC coupled amplifier, amplifier gain 1450, 1KHz sampling frequency (n=4 horses). The EMG signal was transmitted through a receiver (TR70BB) or Bluetooth (ELI) to a data acquisition system (PowerLab 16/30 - ML880/P, ADInstruments, Bella Vista, Australia). The EMG signal was amplified with octal bio-amplifier (Octal Bioamp, ML138, ADInstruments, Bella Vista, Australia) with a bandwidth frequency ranging from 20-1000 Hz (input impedance = 200 MV, common mode rejection ratio = 85 dB, gain = 1000), and transmitted to a personal computer. All EMG and pharyngeal pressure signals were collected at 2000 Hz rate with LabChart 6 software (ADInstruments, Bella Vista, Australia) that allows for real-time monitoring and storage for post-processing and analysis.
EMG signal processingElectromyographic signals from the TH muscles were processed using two methods; 1) a classical approach to myoelectrical activity and median frequency and 2) wavelet decomposition. For both methods, the beginning and end of recording segments including twenty consecutive breaths, at the end of each speed interval, were marked with comments in the acquisition software (LabChart). The relationship of EMG activity with phase of the respiratory cycle was determined by comparing pharyngeal pressure waveforms with the raw EMG and time-averaged EMG traces.For the classical approach, in a graphical user interface-based software (LabChart), a sixth-order Butterworth filter was applied (common mode rejection ratio, 90 dB; band pass, 20 to 1,000 Hz), the EMG signal was then amplified, full-wave rectified, and smoothed using a triangular Bartlett window (time constant: 150ms). The digitized area under the time-averaged full-wave rectified EMG signal was calculated to define the raw mean electrical activity (MEA) in mV.s. Median Power Frequency (MF) of the EMG power spectrum was calculated after a Fast Fourier Transformation (1024 points, Hann cosine window processing). For the wavelet decomposition, the whole dataset including comments and comment locations was exported as .mat files for processing in MATLAB R2018a with the Signal Processing Toolbox (The MathWorks Inc, Natick, MA, USA). A custom written automated script based on Hodson-Tole & Wakeling [24] was used to first cut the .mat file into the selected 20 breath segments and subsequently process each segment. A bank of 16 wavelets with time and frequency resolution optimized for EMG was used. The center frequencies of the bank ranged from 6.9 Hz to 804.2 Hz [25]. The intensity was summed (mV2) to a total, and the intensity contribution of each wavelet was calculated across all 20 breaths for each horse, with separate results for each trial date and exercise level (80, 90, 100% of HRmax as well as the period preceding episodes of DDSP). To determine the relevant bandwidths for the analysis, a Fast Fourier transform frequency analysis was performed on the horses unaffected by DDSP from 0 to 1000 Hz in increments of 50Hz and the contribution of each interval was calculated in percent of total spectrum as median and interquartile range. According to the Shannon-Nyquist sampling theorem, the relevant signal is below ½ the sample rate and because we had instrumentation sampling either 1000Hz and 2000Hz we choose to perform the frequency analysis up to 1000Hz. The 0-50Hz interval, mostly stride frequency and background noise, was excluded from further analysis. Of the remaining frequency spectrum, we included all intervals from 50-100Hz to 450-500Hz and excluded the remainder because they contributed with less than 5% to the total amplitude.Data analysisAt the end of each exercise speed interval, twenty consecutive breaths were selected and analyzed as described above. To standardize MEA, MF and mV2 within and between horses and trials, and to control for different electrodes size (i.e. different impedance and area of sampling), data were afterward normalized to 80% of HRmax value (HRmax80), referred to as normalized MEA (nMEA), normalized MF (nMF) and normalized mV2 (nmV2). During the initial processing, it became clear that the TH muscle is inconsistently activated at 50% of HRmax and that speed level was therefore excluded from further analysis. The endoscopy video was reviewed and episodes of palatal displacement were marked with comments. For both the classical approach and wavelet analysis, an EMG segment preceding and concurrent to the DDSP episode was analyzed. If multiple episodes were recorded during the same trial, only the period preceding the first palatal displacement was analyzed. In horses that had both TH muscles implanted, the average between the two sides was used for the analysis. Averaged data from multiple trials were considered for each horse. Descriptive data are expressed as means with standard deviation (SD). Normal distribution of data was assessed using the Kolmogorov-Smirnov test and quantile-quantile (Q-Q) plot. To determine the frequency clusters in the EMG signal, a hierarchical agglomerative dendrogram was applied using the packages Matplotlib, pandas, numpy and scipy in python (version 3.6.6) executed through Spyder (version 3.2.2) and Anaconda Navigator. Based on the frequency analysis, wavelets included in the cluster analysis were 92.4 Hz, 128.5 Hz, 170.4 Hz, 218.1 Hz, 271.5 Hz, 330.6 Hz, 395.4 Hz and 465.9 Hz. The number of frequency clusters was set to two based on maximum acceleration in a scree plot and maximum vertical distance in the dendrogram. For continuous outcome measures (number of swallows, MEA, MF, and mV2) a mixed effect model was fitted to the data to determine the relationship between the outcome variable and relevant fixed effects (breed, sex, age, weight, speed, group) using horse as a random effect. Tukey’s post hoc tests and linear contrasts used as appropriate. Statistical analysis was performed using JMP Pro13 (SAS Institute, Cary, NC, USA). Significance set at P < 0.05 throughout.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The species sensitivity distribution (SSD) is an internationally accepted approach to hazard estimation using the probability distribution of toxicity values that is representative of the sensitivity of a group of species to a chemical. Application of SSDs in ecological risk assessment has been limited by insufficient taxonomic diversity of species to estimate a statistically robust fifth percentile hazard concentration (HC5). We used the toxicity-normalized SSD (SSDn) approach, (Lambert, F. N.; Raimondo, S.; Barron, M. G. Environ. Sci. Technol.2022,56, 8278–8289), modified to include all possible normalizing species, to estimate HC5 values for acute toxicity data for groups of carbamate and organophosphorous insecticides. We computed mean and variance of single chemical HC5 values for each chemical using leave-one-out (LOO) variance estimation and compared them to SSDn and conventionally estimated HC5 values. SSDn-estimated HC5 values showed low uncertainty and high accuracy compared to single-chemical SSDs when including all possible combinations of normalizing species within the chemical-taxa grouping (carbamate-all species, carbamate-fish, organophosphate-fish, and organophosphate-invertebrate). The SSDn approach is recommended for estimating HC5 values for compounds with insufficient species diversity for HC5 computation or high uncertainty in estimated single-chemical HC5 values. Furthermore, the LOO variance approach provides SSD practitioners with a simple computational method to estimate confidence intervals around an HC5 estimate that is nearly identical to the conventionally estimated HC5.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The file contains six worksheets: (1) the “XRF raw data” worksheet provides the X-Ray Fluorescence data obtained in samples from the Sergipe-Alagoas Basin. Analyses were performed using a Rigaku Supermini 200 Wavelength Dispersive XRF sequential spectrometer and the ZSX software by Rigaku. Results are displayed in counts per second. Only the data discussed in the paper “Palaeoceanographic changes during the late Albian-early Turonian in the Sergipe-Alagoas Basin (Northeast of Brazil) and their link with global events” are reported. ND: not determined; (2) the “XRF normalized” worksheet presents the elemental data as normalized to the sum of all elements measured in a sample and as log ratios. ND: not determined; (3) the “Magnetic data” worksheet contains Magnetic Susceptibility (MS) values measured at 976 Hz (Frequency 1) using a MFK1-FA Kappabridge (Advanced Geophysical Instruments Company, AGICO), and paleomagnetic data obtained with a 2G Enterprises 755-4K Long Core Squid magnetometer and a pulser. They include the Saturation of Isothermal Remanent Magnetization (SIRM), Backfield Isothermal Remanent Magnetization (BIRM), and S-ratio. MS/Fe values were obtained after normalizing both MS (m3/Kg) and Fe/sum data to the mean equals zero and standard deviation equals one (i.e., zero-mean normalization); (4) the “Age model” worksheet contains the ln(K/Al) data linearly interpolated for an average sampling space of 3 m, linearly detrended, and filtered for frequencies from 0.04 to 0.005 cycles/m. The worksheet also includes the tie points obtained by tuning the filtered ln(K/Al) data with the astronomical solution from Laskar (2020) and the sedimentation rates; (5) cross-correlation between ln(K/Al) and ln(Ca/Fe) data; (6) ln(Ca/Fe) data interpolated for an average sampling space of 3 m.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains expression data for positionally conserved lncRNAs measure using the Nanostring assay. Sheet A: Annotation of probes used in the assay. B: List of samples tested in the assay. C: Normalised expression of tested human and mouse pcRNAs and associated coding genes.MethodsWe designed probes to detect 50 pairs of pcRNAs and corresponding coding genes in human and mouse. The probes were designed according to the Nanostring guidelines and to maximize their specificity and included 9 house-keeping genes for normalization (ALAS1, B2M, CLTC, GAPDH, GUSB, HPRT, PGK1, TDB, TUBB). The raw count data were first normalized by Nanostring Technologies with the nSolver software using a two-step protocol. First, data were normalized to internal positive controls, then to the geometric mean of house-keeping genes. The normalised data was then imported into R for further analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Thoracic diseases, including pneumonia, tuberculosis, lung cancer, and others, pose significant health risks and require timely and accurate diagnosis to ensure proper treatment. Thus, in this research, a model for thorax disease classification using Chest X-rays is proposed by considering deep learning model. The input is pre-processed by resizing, normalizing pixel values, and applying data augmentation to address the issue of imbalanced datasets and improve model generalization. Significant features are extracted from the images using an Enhanced Auto-Encoder (EnAE) model, which combines a stacked auto-encoder architecture with an attention module to enhance feature representation and classification accuracy. To further improve feature selection, we utilize the Chaotic Whale Optimization (ChWO) Algorithm, which optimally selects the most relevant attributes from the extracted features. Finally, the disease classification is performed using the novel Improved Swin Transformer (IMSTrans) model, which is designed to efficiently process high-dimensional medical image data and achieve superior classification performance. The proposed EnAE + ChWO+IMSTrans model for thorax disease classification was evaluated using extensive Chest X-ray datasets and the Lung Disease Dataset. The proposed method demonstrates enhanced Accuracy, Precision, Recall, F-Score, MCC and MAE of 0.964, 0.977, 0.9845, 0.964, 0.9647, and 0.184 respectively indicating the reliable and efficient solution for thorax disease classification.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background
The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods
This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results
The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).
Methods
Study Participants and Samples
The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.
All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.
Blood Collection and Processing
Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.
Characterization of DNA Methylation using the EPIC array
Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).
Processing and Analysis of DNA Methylation Data
The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.
Normalization Methods Evaluated
The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.