26 datasets found
  1. LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time...

    • plos.figshare.com
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas (2023). LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time PCR (qPCR) Data as an Alternative to Reference Gene Based Methods [Dataset]. http://doi.org/10.1371/journal.pone.0135852
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ΔΔCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.

  2. f

    Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xlsx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach (2023). Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.xlsx [Dataset]. http://doi.org/10.3389/fgene.2020.00594.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≄2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.

  3. Naturalistic Neuroimaging Database

    • openneuro.org
    Updated Apr 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v1.1.3
    Explore at:
    Dataset updated
    Apr 20, 2021
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Overview

    • The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.
    • The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

    v2.0 Changes

    • Overview
      • We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.
    • Normalization

      • Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:
      # Generate a resting state (rs) timeseries (ts)
      # Install / load package to make fake fMRI ts
      # install.packages("neuRosim")
      library(neuRosim)
      # Generate a ts
      ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
      # 3dDetrend -normalize
      # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
      # Do for the full timeseries
      ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
      # Do this again for a shorter version of the same timeseries
      ts.shorter.length <- length(ts.normalised.long)/4
      ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
      # By looking at the summaries, it can be seen that the median values become  larger
      summary(ts.normalised.long)
      summary(ts.normalised.short)
      # Plot results for the long and short ts
      # Truncate the longer ts for plotting only
      ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
      # Give the plot a title
      title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
      plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
      # Add zero line
      lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
      # 3dDetrend -normalize -polort 0 for long timeseries
      lines(ts.normalised.long.made.shorter, col='blue');
      # 3dDetrend -normalize -polort 0 for short timeseries
      lines(ts.normalised.short, col='red');
      
    • Standardization/modernization

      • The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.
    • New afni_proc.py command line

      • The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

      We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

      Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

    • Effect on results

      • From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
  4. Data for A Systemic Framework for Assessing the Risk of Decarbonization to...

    • zenodo.org
    txt
    Updated Sep 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soheil Shayegh; Soheil Shayegh; Giorgia Coppola; Giorgia Coppola (2025). Data for A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union [Dataset]. http://doi.org/10.5281/zenodo.17152310
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Soheil Shayegh; Soheil Shayegh; Giorgia Coppola; Giorgia Coppola
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 18, 2025
    Area covered
    European Union
    Description

    README — Code and data
    Project: LOCALISED

    Work Package 7, Task 7.1

    Paper: A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union

    What this repo does
    -------------------
    Builds the Transition‑Risk Index (TRI) for EU manufacturing at NUTS‑2 × NACE Rev.2, and reproduces the article’s Figures 3–6:
    ‱ Exposure (emissions by region/sector)
    ‱ Vulnerability (composite index)
    ‱ Risk = Exposure ⊗ Vulnerability
    Outputs include intermediate tables, the final analysis dataset, and publication figures.

    Folder of interest
    ------------------
    Code and data/
    ├─ Code/ # R scripts (run in order 1A → 5)
    │ └─ Create Initial Data/ # scripts to (re)build Initial data/ from Eurostat API with imputation
    ├─ Initial data/ # Eurostat inputs imputed for missing values
    ├─ Derived data/ # intermediates
    ├─ Final data/ # final analysis-ready tables
    └─ Figures/ # exported figures

    Quick start
    -----------
    1) Open R (or RStudio) and set the working directory to “Code and data/Code”.
    Example: setwd(".../Code and data/Code")
    2) Initial data/ contains the required Eurostat inputs referenced by the scripts.
    To reproduce the inputs in Initial data/, run the scripts in Code/Create Initial Data/.
    These scripts download the required datasets from the respective API and impute missing values; outputs are written to ../Initial data/.
    3) Run scripts sequentially (they use relative paths to ../Raw data, ../Derived data, etc.):
    1A-non-sector-data.R → 1B-sector-data.R → 1C-all-data.R → 2-reshape-data.R → 3-normalize-data-by-n-enterpr.R → 4-risk-aggregation.R → 5A-results-maps.R, 5B-results-radar.R

    What each script does
    ---------------------
    Create Initial Data — Recreate inputs
    ‱ Download source tables from the Eurostat API or the Localised DSP, apply light cleaning, and impute missing values.
    ‱ Write the resulting inputs to Initial data/ for the analysis pipeline.

    1A / 1B / 1C — Build the unified base
    ‱ Read individual Eurostat datasets (some sectoral, some only regional).
    ‱ Harmonize, aggregate, and align them into a single analysis-ready schema.
    ‱ Write aggregated outputs to Derived data/ (and/or Final data/ as needed).

    2 — Reshape and enrich
    ‱ Reshapes the combined data and adds metadata.
    ‱ Output: Derived data/2_All_data_long_READY.xlsx (all raw indicators in tidy long format, with indicator names and values).

    3 — Normalize (enterprises & min–max)
    ‱ Divide selected indicators by number of enterprises.
    ‱ Apply min–max normalization to [0.01, 0.99].
    ‱ Exposure keeps real zeros (zeros remain zero).
    ‱ Write normalized tables to Derived data/ or Final data/.

    4 — Aggregate indices
    ‱ Vulnerability: build dimension scores (Energy, Labour, Finance, Supply Chain, Technology).
    – Within each dimension: equal‑weight mean of directionally aligned, [0.01,0.99]‑scaled indicators.
    – Dimension scores are re‑scaled to [0.01,0.99].
    ‱ Aggregate Vulnerability: equal‑weight mean of the five dimensions.
    ‱ TRI (Risk): combine Exposure (E) and Vulnerability (V) via a weighted geometric rule with α = 0.5 in the baseline.
    – Policy‑intuitive properties: high E & high V → high risk; imbalances penalized (non‑compensatory).
    ‱ Output: Final data/ (main analysis tables).

    5A / 5B — Visualize results
    ‱ 5A: maps and distribution plots for Exposure, Vulnerability, and Risk → Figures 3 & 4.
    ‱ 5B: comparative/radar profiles for selected countries/regions/subsectors → Figures 5 & 6.
    ‱ Outputs saved to Figures/.

    Data flow (at a glance)
    -----------------------
    Initial data → (1A–1C) Aggregated base → (2) Tidy long file → (3) Normalized indicators → (4) Composite indices → (5) Figures
    | | |
    v v v
    Derived data/ 2_All_data_long_READY.xlsx Final data/ & Figures/

    Assumptions & conventions
    -------------------------
    ‱ Geography: EU NUTS‑2 regions; Sector: NACE Rev.2 manufacturing subsectors.
    ‱ Equal weights by default where no evidence supports alternatives.
    ‱ All indicators directionally aligned so that higher = greater transition difficulty.
    ‱ Relative paths assume working directory = Code/.

    Reproducing the article
    -----------------------
    ‱ Optionally run the codes from the Code/Create Initial Data subfolder
    ‱ Run 1A → 5B without interruption to regenerate:
    – Figure 3: Exposure, Vulnerability, Risk maps (total manufacturing).
    – Figure 4: Vulnerability dimensions (Energy, Labour, Finance, Supply Chain, Technology).
    – Figure 5: Drivers of risk—highest vs. lowest risk regions (example: Germany & Greece).
    – Figure 6: Subsector case (e.g., basic metals) by selected regions.
    ‱ Final tables for the paper live in Final data/. Figures export to Figures/.

    Requirements
    ------------
    ‱ R (version per your environment).
    ‱ Install any missing packages listed at the top of each script (e.g., install.packages("...")).

    Troubleshooting
    ---------------
    ‱ “File not found”: check that the previous script finished and wrote its outputs to the expected folder.
    ‱ Paths: confirm getwd() ends with /Code so relative paths resolve to ../Raw data, ../Derived data, etc.
    ‱ Reruns: optionally clear Derived data/, Final data/, and Figures/ before a clean rebuild.

    Provenance & citation
    ---------------------
    ‱ Inputs: Eurostat and related sources cited in the paper and headers of the scripts.
    ‱ Methods: OECD composite‑indicator guidance; IPCC AR6 risk framing (see paper references).
    ‱ If you use this code, please cite the article:
    A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union.

  5. d

    Data from: Codebook vectors and predicted rare earth potential from a...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Oct 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Codebook vectors and predicted rare earth potential from a trained emergent self-organizing map displaying multivariate topology of geochemical and reservoir temperature data from produced and geothermal waters of the United States [Dataset]. https://catalog.data.gov/dataset/codebook-vectors-and-predicted-rare-earth-potential-from-a-trained-emergent-self-organizin
    Explore at:
    Dataset updated
    Oct 30, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States, Earth
    Description

    This data release consists of three products relating to a 82 x 50 neuron Emergent Self-Organizing Map (ESOM), which describes the multivariate topology of reservoir temperature and geochemical data for 190 samples of produced and geothermal waters from across the United States. Variables included in the ESOM are coordinates derived from reservoir temperature and concentration of Sc, Nd, Pr, Tb, Lu, Gd, Tm, Ce, Yb, Sm, Ho, Er, Eu, Dy, F, alkalinity as bicarbonate, Si, B, Br, Li, Ba, Sr, sulfate, H (derived from pH), K, Mg, Ca, Cl, and Na converted to units of proportion. The concentration data were converted to isometric log-ratio coordinates (following Hron et al., 2010), where the first ratio is Sc serving as the denominator to the geometric mean of all of the remaining elements (Nd to Na), the second ratio is Nd serving as the denominator by the geometric mean of all of the remaining elements (Pr to Na), and so on, until the final ratio is Na to Cl. Both the temperature and log-ratio coordinates of the concentration data were normalized to a mean of zero and a sample standard deviation of one. The first table is the mean and standard deviation of all of the data in this dataset, which is used to standardize the data. The second table is the codebook vectors from the trained ESOM where all variables were standardized and compositional data converted to isometric log-ratios. The final tables provides are rare earth element potentials predicted for a subset of the U.S. Geological Survey Produced Waters Geochemical Database, Version 2.3 (Blondes et al., 2017) through the used of the ESOM. The original source data used to create the ESOM all come from the U.S. Department of Energy Resources Geothermal Data Repository and are detailed in Engle (2019).

  6. m

    Data for: Trends, Reversion, and Critical Phenomena in Financial Markets

    • data.mendeley.com
    Updated Dec 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christof Schmidhuber (2020). Data for: Trends, Reversion, and Critical Phenomena in Financial Markets [Dataset]. http://doi.org/10.17632/v73nzdt7rt.1
    Explore at:
    Dataset updated
    Dec 11, 2020
    Authors
    Christof Schmidhuber
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    These data accompany the publication "Trends, Reversion, and Critical Phenomena in Financial Markets".

    They contain daily data from Jan 1992 to Dec 2019 on 24 financial markets, namely

    • 6 equity indices: S&P 500, TSE 60, DAX 30, FTSE 100, Nikkei 225, Hang Seng
    • 6 Interest rates for government bonds: US 10-year, Canada 10-year, Germany 10-year, UK 10-year, Japan 10-year, Australia 3-year
    • 6 FX rates: CAD/USD, EUR/USD, GBP/USD, JPY/USD, AUD/USD, NZD/USD
    • 6 Commodities: Crude Oil, Natural Gas, Gold, Copper, Soybeans, Live Cattle

    The data are provided in 13 columns:

    • Column 1: date
    • Column 2: market
    • Column 3: daily log return of futures on that market, normalized to have mean 0 and standard deviation 1 over the 28-year time period
    • Columns 4-13: trend strengths in that market over 10 different time horizons of (2,4,8,16,32,64,128,256,512,1024) business days.

    The trend strengths are defined in the accompanying paper. They are cut off at plus/minus 2.5. The daily log returns were computed from daily futures prices, rolled 5 days prior to first notice, which were taken from Bloomberg. The following mean returns and volatilites were used to normalize the daily log returns in column 3:

    Market Mean St. Dev.

    S&P 500 2.217% 1.100% TSE 60 2.416% 1.067% DAX 30 1.199% 1.366% FTSE 100 1.053% 1.103% Nikkei 225 -0.483% 1.486% Hang Seng 0.768% 1.674% US 10-year 3.734% 0.366% Can. 10-year 3.637% 0.376% Ger. 10-year 4.141% 0.337% UK 10-year 2.983% 0.419% Jap. 10-year 4.453% 0.249% Aus. 3-year 3.029% 0.074% CAD/USD 0.048% 0.479% EUR/USD -0.222% 0.619% GBP/USD 0.316% 0.597% JPY/USD -0.761% 0.667% AUD/USD 0.851% 0.725% NZD/USD 1.563% 0.724% Crude Oil 0.093% 2.243% Natural Gas -2.649% 2.985% Gold 0.580% 0.987% Copper 0.936% 1.586% Soybeans 0.631% 1.360% Live Cattle 0.483% 0.894%

  7. Brain tumor MRI and CT scan

    • kaggle.com
    zip
    Updated Oct 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chenghan pu (2022). Brain tumor MRI and CT scan [Dataset]. https://www.kaggle.com/datasets/chenghanpu/brain-tumor-mri-and-ct-scan/code
    Explore at:
    zip(3685497430 bytes)Available download formats
    Dataset updated
    Oct 3, 2022
    Authors
    chenghan pu
    Description

    A novel brain tumor dataset containing 4500 2D MRI-CT slices. The original MRI and CT scans are also contained in this dataset.

    Pre-processing strategy: The pre-processing data pipeline includes pairing MRI and CT scans according to a specific time interval between CT and MRI scans of the same patient, MRI image registration to a standard template, MRI-CT image registration, intensity normalization, and extracting 2D slices from 3D volumes. The pipeline can be used to obtain classic 2D MRI-CT images from 3D Dicom format MRI and CT scans, which can be directly used as the training data for the end-to-end synthetic CT deep learning networks. Detail: Pairing MRI and CT scan: If the time interval between MRI and CT scans is too long, the information in MRI and CT images will not match. Therefore, we pair MRI and CT scans according to a certain time interval between CT and MRI scans of the same patient, which should not exceed half a year. MRI image registration: Considering the differences both in the human brain and space coordinates of radiation images during scanning, the dataset must avoid individual differences and unify the coordinates, which means all the CT and MRI images should be registered to the standard template. The generated images can be more accurate after registration. The template proposed by Montreal Neurosciences Institute is called MNI ICBM 152 non-linear 6th Generation Symmetric Average Brain Stereotaxic Registration Model (MNI 152) (Grabneret al., 2006). Affine registration is first applied to register MRI scans to the MNI152 template. Intensity normalization: The registered scans have some extreme values, which introduce errors that would affect the generation accuracy. We normalize the image data and eliminated these extreme values by selecting the pixel values ranked at the top 1% and bottom 1% and replacing the original pixel values of these pixels with the pixel values of 1% and 99%. Extracting 2D slices from 3D volumes: After carrying out the registration, the 3D MRI and CT scans can be represented as 237×197×189 matrices. To ensure the compatibility between training models and inputs, each 3D image is sliced, and 4500 2D MRI-CT image pairs are selected as the final training data.

    Source database: 1. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=33948305 2. https://wiki.cancerimagingarchive.net/display/Public/CPTAC-GBM 3. https://wiki.cancerimagingarchive.net/display/Public/TCGA-GBM

    Patient information: Number of patients: 41

    Introduction of each file: Dicom: contains the source file collected from the three websites above. data(processed): contains the processed data which are saved as .npy type. you can use the train_input.npy and train_output.npy as the input and output of the encoder-decoder structure to train the model. Test and Val input and output can be used as test and validation datasets.

  8. Student Academic Performance (Synthetic Dataset)

    • kaggle.com
    zip
    Updated Oct 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset
    Explore at:
    zip(9287 bytes)Available download formats
    Dataset updated
    Oct 10, 2025
    Authors
    Mamun Hasan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

    The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

    • Handling missing values
    • Removing duplicates
    • Detecting and treating outliers
    • Data normalization and transformation
    • Encoding categorical variables
    • Exploratory data analysis (EDA)
    • Regression Analysis

    📊 Columns Description

    Column NameDescription
    Student_IDUnique identifier for each student (e.g., S0001, S0002, 
)
    AgeAge of the student (between 18 and 25 years)
    GenderGender of the student (Male/Female)
    Study_HoursAverage number of study hours per day (contains missing values and outliers)
    Attendance(%)Percentage of class attendance (contains missing values)
    Test_ScoreFinal exam score (0–100 scale)
    GradeLetter grade derived from test scores (F, C, B, A, A+)

    🧠 Example Lab Tasks Using This Dataset:

    • Identify and impute missing values using mean/median.
    • Detect and remove duplicate records.
    • Use IQR or Z-score methods to handle outliers.
    • Normalize Study_Hours and Test_Score using Min-Max scaling.
    • Encode categorical variables (Gender, Grade) for model input.
    • Prepare a clean dataset ready for classification/regression analysis.
    • Can be used for Limited Regression

    🎯 Possible Regression Targets

    Test_Score → Predict test score based on study hours, attendance, age, and gender.

    đŸ§© Example Regression Problem

    Predict the student’s test score using their study hours, attendance percentage, and age.

    🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

    You can use:

    • Linear Regression (for simplicity)
    • Polynomial Regression (to explore nonlinear patterns)
    • Decision Tree Regressor or Random Forest Regressor

    And analyze feature influence using correlation or SHAP/LIME explainability.

  9. t

    Normalization of Valuations in the Public and Private Software Markets -...

    • tomtunguz.com
    Updated Oct 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomasz Tunguz (2018). Normalization of Valuations in the Public and Private Software Markets - Data Analysis [Dataset]. https://tomtunguz.com/publi-arr-multiples/
    Explore at:
    Dataset updated
    Oct 9, 2018
    Dataset provided by
    Theory Ventures
    Authors
    Tomasz Tunguz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Explore how public SaaS valuations now match private markets, with forward revenue multiples hitting 9x. Key analysis of ARR metrics and growth trends in 2018.

  10. Zurich Summer Dataset

    • zenodo.org
    zip
    Updated Jan 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vittorio Ferrari Michele Volpi; Vittorio Ferrari Michele Volpi (2022). Zurich Summer Dataset [Dataset]. http://doi.org/10.5281/zenodo.5914759
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Vittorio Ferrari Michele Volpi; Vittorio Ferrari Michele Volpi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    ZĂŒrich
    Description

    The "Zurich Summer v1.0" dataset is a collection of 20 chips (crops), taken from a QuickBird acquisition of the city of Zurich (Switzerland) in August 2002. QuickBird images are composed by 4 channels (NIR-R-G-B) and were pansharpened to the PAN resolution of about 0.62 cm GSD. We manually annotated 8 different urban and periurban classes : Roads, Buildings, Trees, Grass, Bare Soil, Water, Railways and Swimming pools. The cumulative number of class samples is highly unbalanced, to reflect real world situations. Note that annotations are not perfect, are not ultradense (not every pixel is annotated) and there might be some errors as well. We performed annotations by jointly selecting superpixels (SLIC) and drawing (freehand) over regions which we could confidently assign an object class.

    The dataset is composed by 20 image - ground truth pairs, in geotiff format. Images are distributed in raw DN values. We provide a rough and dirty MATLAB script (preprocess.m) to:

    i) extract basic statistics from images (min, max, mean and average std) which should be used to globally normalize the data (note that class distribution of the chips is highly uneven, so single-frame normalization would shift distribution of classes).

    ii) Visualize raw DN images (with unsaturated values) and a corresponding stretched version (good for illustration purposes). It also saves a raw and adjusted image version in MATLAB format (.mat) in a local subfolder.

    iii) Convert RGB annotations to index mask (CLASS \in {1,...,C}) (via rgb2label.m provided).

    iv) Convert index mask to georeferenced RGB annotations (via rgb2label.m provided). Useful if you want to see the final maps of the tiles in some GIS software (coordinate system copied from original geotiffs).

    Some requests from you

    We encourage researchers to report the ID of images used for training / validation / test (e.g. train: zh1 to zh7, validation zh8 to zh12 and test zh13 to zh20). The purpose of distributing datasets is to encourage reproducibility of experiments.

    Acknowledgements

    We release this data after a kind agreement obtained with DigitalGlobe, co. This data can be redistributed freely, provided that this document and corresponding license are part of the distribution. Ideally, since the dataset could be updated over the time, I suggest to distribute the dataset by the official link from which this archive has been downloaded.

    We would like to thank (a lot) Nathan Longbotham @ DigitalGlobe and the whole DG team for his / their help for granting the distribution of the dataset.

    We release this dataset hoping that will help researchers working in semantic classification / segmentation of remote sensing data in comparing to other state-of-the-art methods using this dataset as well in testing models on a larger and more complete set of images (with respect to most benchmarks available in our community). As you can imagine, it has been a tedious work in preparing everything. Just for you.

    If you are using the data please cite the following work

  11. f

    Transcriptomic analysis in three SHP-1 insufficiency mice and three wide...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Apr 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yan, Sijia; Tian, Hongzhe; Fu, Jiazhao; Sui, Mingxing; Zeng, Li; Ding, Xianting; Chen, Jing; Li, Yanfeng (2022). Transcriptomic analysis in three SHP-1 insufficiency mice and three wide type mice after renal ischemia-reperfusion injury [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000290215
    Explore at:
    Dataset updated
    Apr 7, 2022
    Authors
    Yan, Sijia; Tian, Hongzhe; Fu, Jiazhao; Sui, Mingxing; Zeng, Li; Ding, Xianting; Chen, Jing; Li, Yanfeng
    Description

    In kidney transplantation, the donor kidney inevitably undergoes ischemia-reperfusion injury. It is of great importance to study the pathogenesis of ischemia-reperfusion injury and find effective measures to attenuate acute injury of renal tubules after ischemia-reperfusion. We systematically analyzed differences in the expression profiles of three SHP-1 (encoded by Ptpn6)-insufficient mice and three wild-type mice and achieved the expression data of 21367 genes by RNA-sequencing.TopHat v2.1.0 was used with the default parameters to generate acceptable alignments for Cufflinks, which was used to align the RNA sequencing paired-end reads against the reference genome, Ensembl release 90 GRCm38.p5. The expression of the annotated genes in the RNA-seq data was evaluated in fragments per kilobase million (FPKM) using Cufflinks. The following formula was used to calculate the FPKM value: FPKM = (number of mapped fragments) × 103 × 106/ [(length of transcript) × (total number of fragments)]. Log transformation and zero-mean normalization were used to normalize the expression data for comparisons. The false discovery rate (FDR) of <0.05, after applying Benjamini-Hochberg correction, was chosen for determining significant differentially expressed genes.Data sheet name: allsymbol.genes.expressionThis data sheet include the whole expression data of 21367 genes in three SHP-1 (encoded by Ptpn6)-insufficient mice and three wild-type mice.Data sheet name: WT-vs-HE.genes.annot This data sheet include the comparasion of expression data of 21367 genes in three SHP-1 (encoded by Ptpn6)-insufficient mice and three wild-type mice.Data sheet name: WT-vs-HE.genes.filter.annotThis data sheet include the comparasion of expression data of 161 significant differentially expressed genes in three SHP-1 (encoded by Ptpn6)-insufficient mice and three wild-type mice.

  12. Normalized Water Quality Data

    • figshare.com
    txt
    Updated Feb 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dov Stekel (2022). Normalized Water Quality Data [Dataset]. http://doi.org/10.6084/m9.figshare.19213386.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Dov Stekel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Water quality data. These data have been normalised to their means over the time period with a normalised mean of 100.

  13. Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli; Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli (2022). ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches [Dataset]. http://doi.org/10.5281/zenodo.6568778
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli; Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.

    We release our dataset as a set of folders indicating the patch target label (e.g., `banana`), each containing 1000 subfolders as the ImageNet output classes.

    An example showing how to use the dataset is shown below.

    # code for testing robustness of a model
    import os.path
    
    from torchvision import datasets, transforms, models
    import torch.utils.data
    
    
    class ImageFolderWithEmptyDirs(datasets.ImageFolder):
      """
      This is required for handling empty folders from the ImageFolder Class.
      """
    
      def find_classes(self, directory):
        classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
        if not classes:
          raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
        class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if
                len(os.listdir(os.path.join(directory, cls_name))) > 0}
        return classes, class_to_idx
    
    
    # extract and unzip the dataset, then write top folder here
    dataset_folder = 'data/ImageNet-Patch'
    
    available_labels = {
      487: 'cellular telephone',
      513: 'cornet',
      546: 'electric guitar',
      585: 'hair spray',
      804: 'soap dispenser',
      806: 'sock',
      878: 'typewriter keyboard',
      923: 'plate',
      954: 'banana',
      968: 'cup'
    }
    
    # select folder with specific target
    target_label = 954
    
    dataset_folder = os.path.join(dataset_folder, str(target_label))
    normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
    transforms = transforms.Compose([
      transforms.ToTensor(),
      normalizer
    ])
    
    dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms)
    model = models.resnet50(pretrained=True)
    loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5)
    model.eval()
    
    batches = 10
    correct, attack_success, total = 0, 0, 0
    for batch_idx, (images, labels) in enumerate(loader):
      if batch_idx == batches:
        break
      pred = model(images).argmax(dim=1)
      correct += (pred == labels).sum()
      attack_success += sum(pred == target_label)
      total += pred.shape[0]
    
    accuracy = correct / total
    attack_sr = attack_success / total
    
    print("Robust Accuracy: ", accuracy)
    print("Attack Success: ", attack_sr)
    

  14. Z

    Assessing the impact of hints in learning formal specification: Research...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jan 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
    Explore at:
    Dataset updated
    Jan 29, 2024
    Dataset provided by
    INESC TEC
    Centro de Computação Gråfica
    Authors
    Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

    Dataset

    The artifact contains the resources described below.

    Experiment resources

    The resources needed for replicating the experiment, namely in directory experiment:

    alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

    alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

    docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

    api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

    Experiment data

    The task database used in our application of the experiment, namely in directory data/experiment:

    Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

    identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

    Collected data

    Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

    data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

    data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

    participant identification: participant's unique identifier (ID);

    socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

    data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

    participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

    detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

    data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

    participant identification: participant's unique identifier (ID);

    user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

    participants.txt: the list of participant identifiers that have registered for the experiment.

    Analysis scripts

    The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

    analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

    requirements.r: An R script to install the required libraries for the analysis script.

    normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

    normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

    Dockerfile: Docker script to automate the analysis script from the collected data.

    Setup

    To replicate the experiment and the analysis of the results, only Docker is required.

    If you wish to manually replicate the experiment and collect your own data, you'll need to install:

    A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

    If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

    Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

    R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

    Usage

    Experiment replication

    This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

    To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

    cd experimentdocker-compose up

    This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

    In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

    Group N (no hints): http://localhost:3000/0CAN

    Group L (error locations): http://localhost:3000/CA0L

    Group E (counter-example): http://localhost:3000/350E

    Group D (error description): http://localhost:3000/27AD

    In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

    Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

    Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

    After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

    Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

    Analysis of other applications of the experiment

    This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

    The analysis script expects data in 4 CSV files,

  15. Machine Failure Predictions

    • kaggle.com
    zip
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kozue (2023). Machine Failure Predictions [Dataset]. https://www.kaggle.com/datasets/shashanknecrothapa/machine-failure-predictions
    Explore at:
    zip(1191488 bytes)Available download formats
    Dataset updated
    Jun 20, 2023
    Authors
    kozue
    Description

    Machine failure prediction refers to the task of using machine learning and data analysis techniques to predict when a machine or equipment is likely to fail or experience a breakdown. By analyzing historical data and identifying patterns and indicators, machine failure prediction models can provide early warnings or alerts, enabling proactive maintenance and minimizing downtime.

    Here is an overview of the process of machine failure predictions:

    1. Data Collection: Relevant data is collected from the machines or equipment, such as sensor readings, operational parameters, maintenance records, and historical failure data. This data serves as the basis for training and building the predictive models.

    2. Data Preprocessing: The collected data is cleaned, organized, and preprocessed to remove noise, handle missing values, and normalize the data. Feature engineering techniques may be applied to extract relevant features that capture patterns related to machine failures.

    3. Feature Selection: Selecting the most informative features is crucial for building accurate prediction models. Various techniques, such as statistical analysis, correlation analysis, or domain knowledge, can be employed for feature selection.

    4. Model Development: Machine learning algorithms, such as classification, regression, or time series analysis methods, are applied to train prediction models using the preprocessed data. The choice of algorithms depends on the nature of the data and the specific requirements of the prediction task.

    5. Model Evaluation and Validation: The developed models are evaluated using suitable evaluation metrics to assess their performance and generalization capabilities. Cross-validation techniques may be employed to ensure robustness and reliability of the models.

    6. Prediction and Maintenance Planning: Once the models are trained and validated, they can be used to predict machine failures in real-time. These predictions can help in scheduling preventive maintenance, optimizing resource allocation, and minimizing costly unplanned downtime.

    By accurately predicting machine failures in advance, organizations can improve operational efficiency, reduce maintenance costs, enhance safety, and maximize the lifespan of their machines and equipment.

  16. Student Performance and Learning Behavior Dataset

    • kaggle.com
    zip
    Updated Sep 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Shamim (2025). Student Performance and Learning Behavior Dataset [Dataset]. https://www.kaggle.com/datasets/adilshamim8/student-performance-and-learning-style
    Explore at:
    zip(78897 bytes)Available download formats
    Dataset updated
    Sep 4, 2025
    Authors
    Adil Shamim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a comprehensive view of student performance and learning behavior, integrating academic, demographic, behavioral, and psychological factors.

    It was created by merging two publicly available Kaggle datasets, resulting in a unified dataset of 14,003 student records with 16 attributes. All entries are anonymized, with no personally identifiable information.

    Key Features

    • Study behaviors & engagement → StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions
    • Resources & environment → Resources, Internet, EduTech
    • Motivation & psychology → Motivation, StressLevel
    • Demographics → Gender, Age (18–30 years)
    • Learning preference → LearningStyle
    • Performance indicators → ExamScore, FinalGrade

    Objectives & Use Cases

    The dataset can be used for:

    • Predictive modeling → Regression/classification of student performance (ExamScore, FinalGrade)
    • Clustering analysis → Identifying learning behavior groups with K-Means or other unsupervised methods
    • Educational analytics → Exploring how study habits, stress, and motivation affect outcomes
    • Adaptive learning research → Linking behavioral patterns to personalized learning pathways

    Analysis Pipeline (from original study)

    The dataset was analyzed in Python using:

    • Preprocessing → Encoding, normalization (z-score, Min–Max), deduplication
    • Clustering → K-Means, Elbow Method, Silhouette Score, Davies–Bouldin Index
    • Dimensionality Reduction → PCA (2D/3D visualizations)
    • Statistical Analysis → ANOVA, regression for group differences
    • Interpretation → Mapping clusters to LearningStyle categories & extracting insights for adaptive learning

    File

    • merged_dataset.csv → 14,003 rows × 16 columns Includes student demographics, behaviors, engagement, learning styles, and performance indicators.

    Provenance

    This dataset is an excellent playground for educational data mining — from clustering and behavioral analytics to predictive modeling and personalized learning applications.

  17. CAncer bioMarker Prediction Pipeline (CAMPP)—A standardized framework for...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thilde Terkelsen; Anders Krogh; Elena Papaleo (2023). CAncer bioMarker Prediction Pipeline (CAMPP)—A standardized framework for the analysis of quantitative biological data [Dataset]. http://doi.org/10.1371/journal.pcbi.1007665
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Thilde Terkelsen; Anders Krogh; Elena Papaleo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the improvement of -omics and next-generation sequencing (NGS) methodologies, along with the lowered cost of generating these types of data, the analysis of high-throughput biological data has become standard both for forming and testing biomedical hypotheses. Our knowledge of how to normalize datasets to remove latent undesirable variances has grown extensively, making for standardized data that are easily compared between studies. Here we present the CAncer bioMarker Prediction Pipeline (CAMPP), an open-source R-based wrapper (https://github.com/ELELAB/CAncer-bioMarker-Prediction-Pipeline -CAMPP) intended to aid bioinformatic software-users with data analyses. CAMPP is called from a terminal command line and is supported by a user-friendly manual. The pipeline may be run on a local computer and requires little or no knowledge of programming. To avoid issues relating to R-package updates, a renv .lock file is provided to ensure R-package stability. Data-management includes missing value imputation, data normalization, and distributional checks. CAMPP performs (I) k-means clustering, (II) differential expression/abundance analysis, (III) elastic-net regression, (IV) correlation and co-expression network analyses, (V) survival analysis, and (VI) protein-protein/miRNA-gene interaction networks. The pipeline returns tabular files and graphical representations of the results. We hope that CAMPP will assist in streamlining bioinformatic analysis of quantitative biological data, whilst ensuring an appropriate bio-statistical framework.

  18. Left ventricular mass is underestimated in overweight children because of...

    • plos.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hubert Krysztofiak; Marcel MƂyƄczak; Ɓukasz A. MaƂek; Andrzej Folga; Wojciech Braksator (2023). Left ventricular mass is underestimated in overweight children because of incorrect body size variable chosen for normalization [Dataset]. http://doi.org/10.1371/journal.pone.0217637
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hubert Krysztofiak; Marcel MƂyƄczak; Ɓukasz A. MaƂek; Andrzej Folga; Wojciech Braksator
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundLeft ventricular mass normalization for body size is recommended, but a question remains: what is the best body size variable for this normalization—body surface area, height or lean body mass computed based on a predictive equation? Since body surface area and computed lean body mass are derivatives of body mass, normalizing for them may result in underestimation of left ventricular mass in overweight children. The aim of this study is to indicate which of the body size variables normalize left ventricular mass without underestimating it in overweight children.MethodsLeft ventricular mass assessed by echocardiography, height and body mass were collected for 464 healthy boys, 5–18 years old. Lean body mass and body surface area were calculated. Left ventricular mass z-scores computed based on reference data, developed for height, body surface area and lean body mass, were compared between overweight and non-overweight children. The next step was a comparison of paired samples of expected left ventricular mass, estimated for each normalizing variable based on two allometric equations—the first developed for overweight children, the second for children of normal body mass.ResultsThe mean of left ventricular mass z-scores is higher in overweight children compared to non-overweight children for normative data based on height (0.36 vs. 0.00) and lower for normative data based on body surface area (-0.64 vs. 0.00). Left ventricular mass estimated normalizing for height, based on the equation for overweight children, is higher in overweight children (128.12 vs. 118.40); however, masses estimated normalizing for body surface area and lean body mass, based on equations for overweight children, are lower in overweight children (109.71 vs. 122.08 and 118.46 vs. 120.56, respectively).ConclusionNormalization for body surface area and for computed lean body mass, but not for height, underestimates left ventricular mass in overweight children.

  19. Evaluating Strategies to Normalise Biological Replicates of Western Blot...

    • plos.figshare.com
    tiff
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Degasperi; Marc R. Birtwistle; Natalia Volinsky; Jens Rauch; Walter Kolch; Boris N. Kholodenko (2023). Evaluating Strategies to Normalise Biological Replicates of Western Blot Data [Dataset]. http://doi.org/10.1371/journal.pone.0087293
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrea Degasperi; Marc R. Birtwistle; Natalia Volinsky; Jens Rauch; Walter Kolch; Boris N. Kholodenko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Western blot data are widely used in quantitative applications such as statistical testing and mathematical modelling. To ensure accurate quantitation and comparability between experiments, Western blot replicates must be normalised, but it is unclear how the available methods affect statistical properties of the data. Here we evaluate three commonly used normalisation strategies: (i) by fixed normalisation point or control; (ii) by sum of all data points in a replicate; and (iii) by optimal alignment of the replicates. We consider how these different strategies affect the coefficient of variation (CV) and the results of hypothesis testing with the normalised data. Normalisation by fixed point tends to increase the mean CV of normalised data in a manner that naturally depends on the choice of the normalisation point. Thus, in the context of hypothesis testing, normalisation by fixed point reduces false positives and increases false negatives. Analysis of published experimental data shows that choosing normalisation points with low quantified intensities results in a high normalised data CV and should thus be avoided. Normalisation by sum or by optimal alignment redistributes the raw data uncertainty in a mean-dependent manner, reducing the CV of high intensity points and increasing the CV of low intensity points. This causes the effect of normalisations by sum or optimal alignment on hypothesis testing to depend on the mean of the data tested; for high intensity points, false positives are increased and false negatives are decreased, while for low intensity points, false positives are decreased and false negatives are increased. These results will aid users of Western blotting to choose a suitable normalisation strategy and also understand the implications of this normalisation for subsequent hypothesis testing.

  20. De-identified data for use in analyses.

    • plos.figshare.com
    • figshare.com
    csv
    Updated Dec 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carly A. Busch; Margaret Barstow; Sara E. Brownell; Katelyn M. Cooper (2024). De-identified data for use in analyses. [Dataset]. http://doi.org/10.1371/journal.pmen.0000086.s008
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Carly A. Busch; Margaret Barstow; Sara E. Brownell; Katelyn M. Cooper
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Depression and anxiety are among the most common mental health concerns for science and engineering (S&E) undergraduates in the United States (U.S.), and students perceive they would benefit from knowing a S&E instructor with depression or anxiety. However, it is unknown how prevalent depression and anxiety are among S&E instructors and whether instructors disclose their depression or anxiety to their undergraduates. These identities are unique because they are concealable stigmatized identities (CSIs), meaning they can be kept hidden and carry negative stereotypes. To address these gaps, we surveyed 2013 S&E faculty instructors across U.S. very high research activity doctoral-granting institutions. The survey assessed the extent to which they had and revealed depression or anxiety to undergraduates, why they chose to reveal or conceal their depression or anxiety, and the benefits of revealing depression or anxiety. These items were developed based on prior studies exploring why individuals conceal or reveal CSIs including mental health conditions. Of the university S&E instructors surveyed, 23.9% (n = 482) reported having depression and 32.8% (n = 661) reported having anxiety. Instructors who are women, white, Millennials, or LGBTQ+ are more likely to report depression or anxiety than their counterparts. Very few participants revealed their depression (5.4%) or anxiety (8.3%) to undergraduates. Instructors reported concealing their depression and anxiety because they do not typically disclose to others or because it is not relevant to course content. Instructors anticipated that undergraduates would benefit from disclosure because it would normalize struggling with mental health and provide an example of someone with depression and anxiety who is successful in S&E. Despite undergraduates reporting a need for role models in academic S&E who struggle with mental health and depression/anxiety being relatively common among U.S. S&E instructors, our study found that instructors rarely reveal these identities to their undergraduates.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas (2023). LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time PCR (qPCR) Data as an Alternative to Reference Gene Based Methods [Dataset]. http://doi.org/10.1371/journal.pone.0135852
Organization logo

LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time PCR (qPCR) Data as an Alternative to Reference Gene Based Methods

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
pdfAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ΔΔCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.

Search
Clear search
Close search
Google apps
Main menu