Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ÎÎCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels â„2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Normalization
# Generate a resting state (rs) timeseries (ts)
# Install / load package to make fake fMRI ts
# install.packages("neuRosim")
library(neuRosim)
# Generate a ts
ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
# 3dDetrend -normalize
# R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
# Do for the full timeseries
ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
# Do this again for a shorter version of the same timeseries
ts.shorter.length <- length(ts.normalised.long)/4
ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
# By looking at the summaries, it can be seen that the median values become larger
summary(ts.normalised.long)
summary(ts.normalised.short)
# Plot results for the long and short ts
# Truncate the longer ts for plotting only
ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
# Give the plot a title
title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
# Add zero line
lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
# 3dDetrend -normalize -polort 0 for long timeseries
lines(ts.normalised.long.made.shorter, col='blue');
# 3dDetrend -normalize -polort 0 for short timeseries
lines(ts.normalised.short, col='red');
Standardization/modernization
New afni_proc.py command line
afni_proc.py \
-subj_id "$sub_id_name_1" \
-blocks despike tshift align tlrc volreg mask blur scale regress \
-radial_correlate_blocks tcat volreg \
-copy_anat anatomical_warped/anatSS.1.nii.gz \
-anat_has_skull no \
-anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \
-anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \
-anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \
-anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \
-anat_follower_erode fsvent fswm \
-dsets media_?.nii.gz \
-tcat_remove_first_trs 8 \
-tshift_opts_ts -tpattern alt+z2 \
-align_opts_aea -cost lpc+ZZ -giant_move -check_flip \
-tlrc_base "$basedset" \
-tlrc_NL_warp \
-tlrc_NL_warped_dsets \
anatomical_warped/anatQQ.1.nii.gz \
anatomical_warped/anatQQ.1.aff12.1D \
anatomical_warped/anatQQ.1_WARP.nii.gz \
-volreg_align_to MIN_OUTLIER \
-volreg_post_vr_allin yes \
-volreg_pvra_base_index MIN_OUTLIER \
-volreg_align_e2a \
-volreg_tlrc_warp \
-mask_opts_automask -clfrac 0.10 \
-mask_epi_anat yes \
-blur_to_fwhm -blur_size $blur \
-regress_motion_per_run \
-regress_ROI_PC fsvent 3 \
-regress_ROI_PC_per_run fsvent \
-regress_make_corr_vols aeseg fsvent \
-regress_anaticor_fast \
-regress_anaticor_label fswm \
-regress_censor_motion 0.3 \
-regress_censor_outliers 0.1 \
-regress_apply_mot_types demean deriv \
-regress_est_blur_epits \
-regress_est_blur_errts \
-regress_run_clustsim no \
-regress_polort 2 \
-regress_bandpass 0.01 1 \
-html_review_style pythonic
We used similar command lines to generate âblurred and not censoredâ and the ânot blurred and not censoredâ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrendâs -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.
Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paulâs own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.
Effect on results
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
README â Code and data
Project: LOCALISED
Work Package 7, Task 7.1
Paper: A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union
What this repo does
-------------------
Builds the TransitionâRisk Index (TRI) for EU manufacturing at NUTSâ2 Ă NACE Rev.2, and reproduces the articleâs Figures 3â6:
âą Exposure (emissions by region/sector)
âą Vulnerability (composite index)
âą Risk = Exposure â Vulnerability
Outputs include intermediate tables, the final analysis dataset, and publication figures.
Folder of interest
------------------
Code and data/
ââ Code/ # R scripts (run in order 1A â 5)
â ââ Create Initial Data/ # scripts to (re)build Initial data/ from Eurostat API with imputation
ââ Initial data/ # Eurostat inputs imputed for missing values
ââ Derived data/ # intermediates
ââ Final data/ # final analysis-ready tables
ââ Figures/ # exported figures
Quick start
-----------
1) Open R (or RStudio) and set the working directory to âCode and data/Codeâ.
Example: setwd(".../Code and data/Code")
2) Initial data/ contains the required Eurostat inputs referenced by the scripts.
To reproduce the inputs in Initial data/, run the scripts in Code/Create Initial Data/.
These scripts download the required datasets from the respective API and impute missing values; outputs are written to ../Initial data/.
3) Run scripts sequentially (they use relative paths to ../Raw data, ../Derived data, etc.):
1A-non-sector-data.R â 1B-sector-data.R â 1C-all-data.R â 2-reshape-data.R â 3-normalize-data-by-n-enterpr.R â 4-risk-aggregation.R â 5A-results-maps.R, 5B-results-radar.R
What each script does
---------------------
Create Initial Data â Recreate inputs
âą Download source tables from the Eurostat API or the Localised DSP, apply light cleaning, and impute missing values.
âą Write the resulting inputs to Initial data/ for the analysis pipeline.
1A / 1B / 1C â Build the unified base
âą Read individual Eurostat datasets (some sectoral, some only regional).
âą Harmonize, aggregate, and align them into a single analysis-ready schema.
âą Write aggregated outputs to Derived data/ (and/or Final data/ as needed).
2 â Reshape and enrich
âą Reshapes the combined data and adds metadata.
âą Output: Derived data/2_All_data_long_READY.xlsx (all raw indicators in tidy long format, with indicator names and values).
3 â Normalize (enterprises & minâmax)
âą Divide selected indicators by number of enterprises.
âą Apply minâmax normalization to [0.01, 0.99].
âą Exposure keeps real zeros (zeros remain zero).
âą Write normalized tables to Derived data/ or Final data/.
4 â Aggregate indices
âą Vulnerability: build dimension scores (Energy, Labour, Finance, Supply Chain, Technology).
â Within each dimension: equalâweight mean of directionally aligned, [0.01,0.99]âscaled indicators.
â Dimension scores are reâscaled to [0.01,0.99].
âą Aggregate Vulnerability: equalâweight mean of the five dimensions.
⹠TRI (Risk): combine Exposure (E) and Vulnerability (V) via a weighted geometric rule with α = 0.5 in the baseline.
â Policyâintuitive properties: high E & high V â high risk; imbalances penalized (nonâcompensatory).
âą Output: Final data/ (main analysis tables).
5A / 5B â Visualize results
âą 5A: maps and distribution plots for Exposure, Vulnerability, and Risk â Figures 3 & 4.
âą 5B: comparative/radar profiles for selected countries/regions/subsectors â Figures 5 & 6.
âą Outputs saved to Figures/.
Data flow (at a glance)
-----------------------
Initial data â (1Aâ1C) Aggregated base â (2) Tidy long file â (3) Normalized indicators â (4) Composite indices â (5) Figures
| | |
v v v
Derived data/ 2_All_data_long_READY.xlsx Final data/ & Figures/
Assumptions & conventions
-------------------------
âą Geography: EU NUTSâ2 regions; Sector: NACE Rev.2 manufacturing subsectors.
âą Equal weights by default where no evidence supports alternatives.
âą All indicators directionally aligned so that higher = greater transition difficulty.
âą Relative paths assume working directory = Code/.
Reproducing the article
-----------------------
âą Optionally run the codes from the Code/Create Initial Data subfolder
âą Run 1A â 5B without interruption to regenerate:
â Figure 3: Exposure, Vulnerability, Risk maps (total manufacturing).
â Figure 4: Vulnerability dimensions (Energy, Labour, Finance, Supply Chain, Technology).
â Figure 5: Drivers of riskâhighest vs. lowest risk regions (example: Germany & Greece).
â Figure 6: Subsector case (e.g., basic metals) by selected regions.
âą Final tables for the paper live in Final data/. Figures export to Figures/.
Requirements
------------
âą R (version per your environment).
âą Install any missing packages listed at the top of each script (e.g., install.packages("...")).
Troubleshooting
---------------
âą âFile not foundâ: check that the previous script finished and wrote its outputs to the expected folder.
âą Paths: confirm getwd() ends with /Code so relative paths resolve to ../Raw data, ../Derived data, etc.
âą Reruns: optionally clear Derived data/, Final data/, and Figures/ before a clean rebuild.
Provenance & citation
---------------------
âą Inputs: Eurostat and related sources cited in the paper and headers of the scripts.
âą Methods: OECD compositeâindicator guidance; IPCC AR6 risk framing (see paper references).
âą If you use this code, please cite the article:
A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union.
Facebook
TwitterThis data release consists of three products relating to a 82 x 50 neuron Emergent Self-Organizing Map (ESOM), which describes the multivariate topology of reservoir temperature and geochemical data for 190 samples of produced and geothermal waters from across the United States. Variables included in the ESOM are coordinates derived from reservoir temperature and concentration of Sc, Nd, Pr, Tb, Lu, Gd, Tm, Ce, Yb, Sm, Ho, Er, Eu, Dy, F, alkalinity as bicarbonate, Si, B, Br, Li, Ba, Sr, sulfate, H (derived from pH), K, Mg, Ca, Cl, and Na converted to units of proportion. The concentration data were converted to isometric log-ratio coordinates (following Hron et al., 2010), where the first ratio is Sc serving as the denominator to the geometric mean of all of the remaining elements (Nd to Na), the second ratio is Nd serving as the denominator by the geometric mean of all of the remaining elements (Pr to Na), and so on, until the final ratio is Na to Cl. Both the temperature and log-ratio coordinates of the concentration data were normalized to a mean of zero and a sample standard deviation of one. The first table is the mean and standard deviation of all of the data in this dataset, which is used to standardize the data. The second table is the codebook vectors from the trained ESOM where all variables were standardized and compositional data converted to isometric log-ratios. The final tables provides are rare earth element potentials predicted for a subset of the U.S. Geological Survey Produced Waters Geochemical Database, Version 2.3 (Blondes et al., 2017) through the used of the ESOM. The original source data used to create the ESOM all come from the U.S. Department of Energy Resources Geothermal Data Repository and are detailed in Engle (2019).
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
These data accompany the publication "Trends, Reversion, and Critical Phenomena in Financial Markets".
They contain daily data from Jan 1992 to Dec 2019 on 24 financial markets, namely
The data are provided in 13 columns:
The trend strengths are defined in the accompanying paper. They are cut off at plus/minus 2.5. The daily log returns were computed from daily futures prices, rolled 5 days prior to first notice, which were taken from Bloomberg. The following mean returns and volatilites were used to normalize the daily log returns in column 3:
Market Mean St. Dev.
S&P 500 2.217% 1.100% TSE 60 2.416% 1.067% DAX 30 1.199% 1.366% FTSE 100 1.053% 1.103% Nikkei 225 -0.483% 1.486% Hang Seng 0.768% 1.674% US 10-year 3.734% 0.366% Can. 10-year 3.637% 0.376% Ger. 10-year 4.141% 0.337% UK 10-year 2.983% 0.419% Jap. 10-year 4.453% 0.249% Aus. 3-year 3.029% 0.074% CAD/USD 0.048% 0.479% EUR/USD -0.222% 0.619% GBP/USD 0.316% 0.597% JPY/USD -0.761% 0.667% AUD/USD 0.851% 0.725% NZD/USD 1.563% 0.724% Crude Oil 0.093% 2.243% Natural Gas -2.649% 2.985% Gold 0.580% 0.987% Copper 0.936% 1.586% Soybeans 0.631% 1.360% Live Cattle 0.483% 0.894%
Facebook
TwitterA novel brain tumor dataset containing 4500 2D MRI-CT slices. The original MRI and CT scans are also contained in this dataset.
Pre-processing strategy: The pre-processing data pipeline includes pairing MRI and CT scans according to a specific time interval between CT and MRI scans of the same patient, MRI image registration to a standard template, MRI-CT image registration, intensity normalization, and extracting 2D slices from 3D volumes. The pipeline can be used to obtain classic 2D MRI-CT images from 3D Dicom format MRI and CT scans, which can be directly used as the training data for the end-to-end synthetic CT deep learning networks. Detail: Pairing MRI and CT scan: If the time interval between MRI and CT scans is too long, the information in MRI and CT images will not match. Therefore, we pair MRI and CT scans according to a certain time interval between CT and MRI scans of the same patient, which should not exceed half a year. MRI image registration: Considering the differences both in the human brain and space coordinates of radiation images during scanning, the dataset must avoid individual differences and unify the coordinates, which means all the CT and MRI images should be registered to the standard template. The generated images can be more accurate after registration. The template proposed by Montreal Neurosciences Institute is called MNI ICBM 152 non-linear 6th Generation Symmetric Average Brain Stereotaxic Registration Model (MNI 152) (Grabneret al., 2006). Affine registration is first applied to register MRI scans to the MNI152 template. Intensity normalization: The registered scans have some extreme values, which introduce errors that would affect the generation accuracy. We normalize the image data and eliminated these extreme values by selecting the pixel values ranked at the top 1% and bottom 1% and replacing the original pixel values of these pixels with the pixel values of 1% and 99%. Extracting 2D slices from 3D volumes: After carrying out the registration, the 3D MRI and CT scans can be represented as 237Ă197Ă189 matrices. To ensure the compatibility between training models and inputs, each 3D image is sliced, and 4500 2D MRI-CT image pairs are selected as the final training data.
Source database: 1. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=33948305 2. https://wiki.cancerimagingarchive.net/display/Public/CPTAC-GBM 3. https://wiki.cancerimagingarchive.net/display/Public/TCGA-GBM
Patient information: Number of patients: 41
Introduction of each file: Dicom: contains the source file collected from the three websites above. data(processed): contains the processed data which are saved as .npy type. you can use the train_input.npy and train_output.npy as the input and output of the encoder-decoder structure to train the model. Test and Val input and output can be used as test and validation datasets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.
The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:
| Column Name | Description |
|---|---|
| Student_ID | Unique identifier for each student (e.g., S0001, S0002, âŠ) |
| Age | Age of the student (between 18 and 25 years) |
| Gender | Gender of the student (Male/Female) |
| Study_Hours | Average number of study hours per day (contains missing values and outliers) |
| Attendance(%) | Percentage of class attendance (contains missing values) |
| Test_Score | Final exam score (0â100 scale) |
| Grade | Letter grade derived from test scores (F, C, B, A, A+) |
Test_Score â Predict test score based on study hours, attendance, age, and gender.
Predict the studentâs test score using their study hours, attendance percentage, and age.
đ§ Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']
You can use:
And analyze feature influence using correlation or SHAP/LIME explainability.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explore how public SaaS valuations now match private markets, with forward revenue multiples hitting 9x. Key analysis of ARR metrics and growth trends in 2018.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Zurich Summer v1.0" dataset is a collection of 20 chips (crops), taken from a QuickBird acquisition of the city of Zurich (Switzerland) in August 2002. QuickBird images are composed by 4 channels (NIR-R-G-B) and were pansharpened to the PAN resolution of about 0.62 cm GSD. We manually annotated 8 different urban and periurban classes : Roads, Buildings, Trees, Grass, Bare Soil, Water, Railways and Swimming pools. The cumulative number of class samples is highly unbalanced, to reflect real world situations. Note that annotations are not perfect, are not ultradense (not every pixel is annotated) and there might be some errors as well. We performed annotations by jointly selecting superpixels (SLIC) and drawing (freehand) over regions which we could confidently assign an object class.
The dataset is composed by 20 image - ground truth pairs, in geotiff format. Images are distributed in raw DN values. We provide a rough and dirty MATLAB script (preprocess.m) to:
i) extract basic statistics from images (min, max, mean and average std) which should be used to globally normalize the data (note that class distribution of the chips is highly uneven, so single-frame normalization would shift distribution of classes).
ii) Visualize raw DN images (with unsaturated values) and a corresponding stretched version (good for illustration purposes). It also saves a raw and adjusted image version in MATLAB format (.mat) in a local subfolder.
iii) Convert RGB annotations to index mask (CLASS \in {1,...,C}) (via rgb2label.m provided).
iv) Convert index mask to georeferenced RGB annotations (via rgb2label.m provided). Useful if you want to see the final maps of the tiles in some GIS software (coordinate system copied from original geotiffs).
Some requests from you
We encourage researchers to report the ID of images used for training / validation / test (e.g. train: zh1 to zh7, validation zh8 to zh12 and test zh13 to zh20). The purpose of distributing datasets is to encourage reproducibility of experiments.
Acknowledgements
We release this data after a kind agreement obtained with DigitalGlobe, co. This data can be redistributed freely, provided that this document and corresponding license are part of the distribution. Ideally, since the dataset could be updated over the time, I suggest to distribute the dataset by the official link from which this archive has been downloaded.
We would like to thank (a lot) Nathan Longbotham @ DigitalGlobe and the whole DG team for his / their help for granting the distribution of the dataset.
We release this dataset hoping that will help researchers working in semantic classification / segmentation of remote sensing data in comparing to other state-of-the-art methods using this dataset as well in testing models on a larger and more complete set of images (with respect to most benchmarks available in our community). As you can imagine, it has been a tedious work in preparing everything. Just for you.
If you are using the data please cite the following work
Volpi, M. & Ferrari, V.; Semantic segmentation of urban scenes by learning local class interactions, In IEEE CVPR 2015 Workshop "Looking from above: when Earth observation meets vision" (EARTHVISION), Boston, USA, 2015.
Facebook
TwitterIn kidney transplantation, the donor kidney inevitably undergoes ischemia-reperfusion injury. It is of great importance to study the pathogenesis of ischemia-reperfusion injury and find effective measures to attenuate acute injury of renal tubules after ischemia-reperfusion. We systematically analyzed differences in the expression profiles of three SHP-1 (encoded by Ptpn6)-insufficient mice and three wild-type mice and achieved the expression data of 21367 genes by RNA-sequencing.TopHat v2.1.0 was used with the default parameters to generate acceptable alignments for Cufflinks, which was used to align the RNA sequencing paired-end reads against the reference genome, Ensembl release 90 GRCm38.p5. The expression of the annotated genes in the RNA-seq data was evaluated in fragments per kilobase million (FPKM) using Cufflinks. The following formula was used to calculate the FPKM value: FPKM = (number of mapped fragments) Ă 103 Ă 106/ [(length of transcript) Ă (total number of fragments)]. Log transformation and zero-mean normalization were used to normalize the expression data for comparisons. The false discovery rate (FDR) of <0.05, after applying Benjamini-Hochberg correction, was chosen for determining significant differentially expressed genes.Data sheet name: allsymbol.genes.expressionThis data sheet include the whole expression data of 21367 genes in three SHP-1 (encoded by Ptpn6)-insufficient mice and three wild-type mice.Data sheet name: WT-vs-HE.genes.annot This data sheet include the comparasion of expression data of 21367 genes in three SHP-1 (encoded by Ptpn6)-insufficient mice and three wild-type mice.Data sheet name: WT-vs-HE.genes.filter.annotThis data sheet include the comparasion of expression data of 161 significant differentially expressed genes in three SHP-1 (encoded by Ptpn6)-insufficient mice and three wild-type mice.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Water quality data. These data have been normalised to their means over the time period with a normalised mean of 100.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.
We release our dataset as a set of folders indicating the patch target label (e.g., `banana`), each containing 1000 subfolders as the ImageNet output classes.
An example showing how to use the dataset is shown below.
# code for testing robustness of a model
import os.path
from torchvision import datasets, transforms, models
import torch.utils.data
class ImageFolderWithEmptyDirs(datasets.ImageFolder):
"""
This is required for handling empty folders from the ImageFolder Class.
"""
def find_classes(self, directory):
classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
if not classes:
raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if
len(os.listdir(os.path.join(directory, cls_name))) > 0}
return classes, class_to_idx
# extract and unzip the dataset, then write top folder here
dataset_folder = 'data/ImageNet-Patch'
available_labels = {
487: 'cellular telephone',
513: 'cornet',
546: 'electric guitar',
585: 'hair spray',
804: 'soap dispenser',
806: 'sock',
878: 'typewriter keyboard',
923: 'plate',
954: 'banana',
968: 'cup'
}
# select folder with specific target
target_label = 954
dataset_folder = os.path.join(dataset_folder, str(target_label))
normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
transforms = transforms.Compose([
transforms.ToTensor(),
normalizer
])
dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms)
model = models.resnet50(pretrained=True)
loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5)
model.eval()
batches = 10
correct, attack_success, total = 0, 0, 0
for batch_idx, (images, labels) in enumerate(loader):
if batch_idx == batches:
break
pred = model(images).argmax(dim=1)
correct += (pred == labels).sum()
attack_success += sum(pred == target_label)
total += pred.shape[0]
accuracy = correct / total
attack_sr = attack_success / total
print("Robust Accuracy: ", accuracy)
print("Attack Success: ", attack_sr)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.
Dataset
The artifact contains the resources described below.
Experiment resources
The resources needed for replicating the experiment, namely in directory experiment:
alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.
alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.
docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.
api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.
Experiment data
The task database used in our application of the experiment, namely in directory data/experiment:
Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.
identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.
Collected data
Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:
data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).
data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:
participant identification: participant's unique identifier (ID);
socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).
data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:
participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);
detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.
data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:
participant identification: participant's unique identifier (ID);
user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).
participants.txt: the list of participant identifiers that have registered for the experiment.
Analysis scripts
The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:
analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.
requirements.r: An R script to install the required libraries for the analysis script.
normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.
normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.
Dockerfile: Docker script to automate the analysis script from the collected data.
Setup
To replicate the experiment and the analysis of the results, only Docker is required.
If you wish to manually replicate the experiment and collect your own data, you'll need to install:
A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.
If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:
Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.
R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.
Usage
Experiment replication
This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.
To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.
cd experimentdocker-compose up
This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.
In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:
Group N (no hints): http://localhost:3000/0CAN
Group L (error locations): http://localhost:3000/CA0L
Group E (counter-example): http://localhost:3000/350E
Group D (error description): http://localhost:3000/27AD
In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.
Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.
Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.
After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:
Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323â327.
Analysis of other applications of the experiment
This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.
The analysis script expects data in 4 CSV files,
Facebook
TwitterMachine failure prediction refers to the task of using machine learning and data analysis techniques to predict when a machine or equipment is likely to fail or experience a breakdown. By analyzing historical data and identifying patterns and indicators, machine failure prediction models can provide early warnings or alerts, enabling proactive maintenance and minimizing downtime.
Here is an overview of the process of machine failure predictions:
Data Collection: Relevant data is collected from the machines or equipment, such as sensor readings, operational parameters, maintenance records, and historical failure data. This data serves as the basis for training and building the predictive models.
Data Preprocessing: The collected data is cleaned, organized, and preprocessed to remove noise, handle missing values, and normalize the data. Feature engineering techniques may be applied to extract relevant features that capture patterns related to machine failures.
Feature Selection: Selecting the most informative features is crucial for building accurate prediction models. Various techniques, such as statistical analysis, correlation analysis, or domain knowledge, can be employed for feature selection.
Model Development: Machine learning algorithms, such as classification, regression, or time series analysis methods, are applied to train prediction models using the preprocessed data. The choice of algorithms depends on the nature of the data and the specific requirements of the prediction task.
Model Evaluation and Validation: The developed models are evaluated using suitable evaluation metrics to assess their performance and generalization capabilities. Cross-validation techniques may be employed to ensure robustness and reliability of the models.
Prediction and Maintenance Planning: Once the models are trained and validated, they can be used to predict machine failures in real-time. These predictions can help in scheduling preventive maintenance, optimizing resource allocation, and minimizing costly unplanned downtime.
By accurately predicting machine failures in advance, organizations can improve operational efficiency, reduce maintenance costs, enhance safety, and maximize the lifespan of their machines and equipment.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides a comprehensive view of student performance and learning behavior, integrating academic, demographic, behavioral, and psychological factors.
It was created by merging two publicly available Kaggle datasets, resulting in a unified dataset of 14,003 student records with 16 attributes. All entries are anonymized, with no personally identifiable information.
StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, DiscussionsResources, Internet, EduTechMotivation, StressLevelGender, Age (18â30 years)LearningStyleExamScore, FinalGradeThe dataset can be used for:
ExamScore, FinalGrade)The dataset was analyzed in Python using:
LearningStyle categories & extracting insights for adaptive learningmerged_dataset.csv â 14,003 rows Ă 16 columns
Includes student demographics, behaviors, engagement, learning styles, and performance indicators.This dataset is an excellent playground for educational data mining â from clustering and behavioral analytics to predictive modeling and personalized learning applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the improvement of -omics and next-generation sequencing (NGS) methodologies, along with the lowered cost of generating these types of data, the analysis of high-throughput biological data has become standard both for forming and testing biomedical hypotheses. Our knowledge of how to normalize datasets to remove latent undesirable variances has grown extensively, making for standardized data that are easily compared between studies. Here we present the CAncer bioMarker Prediction Pipeline (CAMPP), an open-source R-based wrapper (https://github.com/ELELAB/CAncer-bioMarker-Prediction-Pipeline -CAMPP) intended to aid bioinformatic software-users with data analyses. CAMPP is called from a terminal command line and is supported by a user-friendly manual. The pipeline may be run on a local computer and requires little or no knowledge of programming. To avoid issues relating to R-package updates, a renv .lock file is provided to ensure R-package stability. Data-management includes missing value imputation, data normalization, and distributional checks. CAMPP performs (I) k-means clustering, (II) differential expression/abundance analysis, (III) elastic-net regression, (IV) correlation and co-expression network analyses, (V) survival analysis, and (VI) protein-protein/miRNA-gene interaction networks. The pipeline returns tabular files and graphical representations of the results. We hope that CAMPP will assist in streamlining bioinformatic analysis of quantitative biological data, whilst ensuring an appropriate bio-statistical framework.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundLeft ventricular mass normalization for body size is recommended, but a question remains: what is the best body size variable for this normalizationâbody surface area, height or lean body mass computed based on a predictive equation? Since body surface area and computed lean body mass are derivatives of body mass, normalizing for them may result in underestimation of left ventricular mass in overweight children. The aim of this study is to indicate which of the body size variables normalize left ventricular mass without underestimating it in overweight children.MethodsLeft ventricular mass assessed by echocardiography, height and body mass were collected for 464 healthy boys, 5â18 years old. Lean body mass and body surface area were calculated. Left ventricular mass z-scores computed based on reference data, developed for height, body surface area and lean body mass, were compared between overweight and non-overweight children. The next step was a comparison of paired samples of expected left ventricular mass, estimated for each normalizing variable based on two allometric equationsâthe first developed for overweight children, the second for children of normal body mass.ResultsThe mean of left ventricular mass z-scores is higher in overweight children compared to non-overweight children for normative data based on height (0.36 vs. 0.00) and lower for normative data based on body surface area (-0.64 vs. 0.00). Left ventricular mass estimated normalizing for height, based on the equation for overweight children, is higher in overweight children (128.12 vs. 118.40); however, masses estimated normalizing for body surface area and lean body mass, based on equations for overweight children, are lower in overweight children (109.71 vs. 122.08 and 118.46 vs. 120.56, respectively).ConclusionNormalization for body surface area and for computed lean body mass, but not for height, underestimates left ventricular mass in overweight children.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Western blot data are widely used in quantitative applications such as statistical testing and mathematical modelling. To ensure accurate quantitation and comparability between experiments, Western blot replicates must be normalised, but it is unclear how the available methods affect statistical properties of the data. Here we evaluate three commonly used normalisation strategies: (i) by fixed normalisation point or control; (ii) by sum of all data points in a replicate; and (iii) by optimal alignment of the replicates. We consider how these different strategies affect the coefficient of variation (CV) and the results of hypothesis testing with the normalised data. Normalisation by fixed point tends to increase the mean CV of normalised data in a manner that naturally depends on the choice of the normalisation point. Thus, in the context of hypothesis testing, normalisation by fixed point reduces false positives and increases false negatives. Analysis of published experimental data shows that choosing normalisation points with low quantified intensities results in a high normalised data CV and should thus be avoided. Normalisation by sum or by optimal alignment redistributes the raw data uncertainty in a mean-dependent manner, reducing the CV of high intensity points and increasing the CV of low intensity points. This causes the effect of normalisations by sum or optimal alignment on hypothesis testing to depend on the mean of the data tested; for high intensity points, false positives are increased and false negatives are decreased, while for low intensity points, false positives are decreased and false negatives are increased. These results will aid users of Western blotting to choose a suitable normalisation strategy and also understand the implications of this normalisation for subsequent hypothesis testing.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Depression and anxiety are among the most common mental health concerns for science and engineering (S&E) undergraduates in the United States (U.S.), and students perceive they would benefit from knowing a S&E instructor with depression or anxiety. However, it is unknown how prevalent depression and anxiety are among S&E instructors and whether instructors disclose their depression or anxiety to their undergraduates. These identities are unique because they are concealable stigmatized identities (CSIs), meaning they can be kept hidden and carry negative stereotypes. To address these gaps, we surveyed 2013 S&E faculty instructors across U.S. very high research activity doctoral-granting institutions. The survey assessed the extent to which they had and revealed depression or anxiety to undergraduates, why they chose to reveal or conceal their depression or anxiety, and the benefits of revealing depression or anxiety. These items were developed based on prior studies exploring why individuals conceal or reveal CSIs including mental health conditions. Of the university S&E instructors surveyed, 23.9% (n = 482) reported having depression and 32.8% (n = 661) reported having anxiety. Instructors who are women, white, Millennials, or LGBTQ+ are more likely to report depression or anxiety than their counterparts. Very few participants revealed their depression (5.4%) or anxiety (8.3%) to undergraduates. Instructors reported concealing their depression and anxiety because they do not typically disclose to others or because it is not relevant to course content. Instructors anticipated that undergraduates would benefit from disclosure because it would normalize struggling with mental health and provide an example of someone with depression and anxiety who is successful in S&E. Despite undergraduates reporting a need for role models in academic S&E who struggle with mental health and depression/anxiety being relatively common among U.S. S&E instructors, our study found that instructors rarely reveal these identities to their undergraduates.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ÎÎCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.