12 datasets found

Input data and some models (all except multi-model ensembles) for JAMES...
zenodo.org
tar
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Lagerquist; Ryan Lagerquist (2023). Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic" [Dataset]. http://doi.org/10.5281/zenodo.10081205
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10081205
Dataset updated
Nov 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ryan Lagerquist; Ryan Lagerquist
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The tar file contains two directories: data and models. Within "data," there are 4 subdirectories: "training" (the clean training data -- without perturbations), "training_all_perturbed_for_uq" (the lightly perturbed training data), "validation_all_perturbed_for_uq" (the moderately perturbed validation data), and "testing_all_perturbed_for_uq" (the heavily perturbed validation data). The data in these directories are unnormalized. The subdirectories "training" and "training_all_perturbed_for_uq" each contain a normalization file. These normalization files contain parameters used to normalize the data (from physical units to z-scores) for Experiment 1 and Experiment 2, respectively. To do the normalization, you can use the script normalize_examples.py in the code library (ml4rt) with the argument input_normalization_file_name set to one of these two file paths. The other arguments should be as follows:
--uniformize=1
--predictor_norm_type_string="z_score"
--vector_target_norm_type_string=""
--scalar_target_norm_type_string=""

Within the directory "models," there are 6 subdirectories: for the BNN-only models trained with clean and lightly perturbed data, for the CRPS-only models trained with clean and lightly perturbed data, and for the BNN/CRPS models trained with clean and lightly perturbed data. To read the models into Python, you can use the method neural_net.read_model in the ml4rt library.
t
Transformer network trained on simulated X-ray photoelectron spectroscopy...
researchdata.tuwien.at
bin, csv, json, zip
Updated Oct 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl (2025). Transformer network trained on simulated X-ray photoelectron spectroscopy data for organic and inorganic compounds [Dataset]. http://doi.org/10.48436/eybcx-t0a02
Explore at:
csv, json, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.48436/eybcx-t0a02
Dataset updated
Oct 17, 2025
Dataset provided by
TU Wien
Authors
Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

This data repository provides the underlying data and neural network training scripts associated with the manuscript titled "A Transformer Network for High-Throughput Materials Characterization with X-ray Photoelectron Spectroscopy" by Simperl and Werner published in the Journal of Applied Physics (https://doi.org/10.1063/5.0296600) (2025)

All data files are released under the Creative Commons Attribution 4.0 International (CC-BY) license, while all code files are distributed under the MIT license.

The repository contains simulated X-ray photoelectron spectroscopy (XPS) spectra stored as hdf5 files in the zipped (h5_files.zip) folder, which was generated using the software developed by the authors. The NIST Standard Reference Database 100 – Simulation of Electron Spectra for Surface Analysis (SESSA) is freely available at https://www.nist.gov/srd/nist-standard-reference-database-100.

The neural network architecture is implemented using the PyTorch Lightning framework and is fully available within the attached materials as Transformer_SimulatedSpectra.py contained in the python_scripts.zip.

The trained model and the list of materials for the train, test and validation sets are contained in the models.zip folder.

The repository contains all the data necessary to replot the figures from the manuscript. These data are available in the form of .csv files or .h5 files for the spectra. In addition, the repository also contains a Python script (Plot_Data_Manuscript.ipynb) which is contained in the python_scripts.zip file.

Context and methodology

The dataset and accompanying Python code files included in this repository were used to train a transformer-based neural network capable of directly inferring chemical concentrations from simulated survey X-ray photoelectron spectroscopy (XPS) spectra of bulk compounds.

The spectral dataset provided here represents the raw output from the SESSA software (version 2.2.2), prior to the normalization procedure described in the associated manuscript. This step of normalisation is of paramount importance for the effective training of the neural network.

The repository contains the Python scripts utilised to execute the spectral simulations and the neural network training on the Vienna Scientific Cluster (VSC5) which is part of the Austrian Scientific Computing Infrastructure (ASC). In order to obtain guidance on the proper configuration of the Command Line Interface (CLI) tools required for SESSA, users are advised to consult the official SESSA manual, which is available at the following address: https://nvlpubs.nist.gov/nistpubs/NSRDS/NIST.NSRDS.100-2024.pdf.

To run the neural network training we provided the requirements_nn_training.txt file that contains all the necessary python packages and version numbers. All other python scripts can be run locally with the python libraries listed in requirements_data_analysis.txt.

Data details

HDF5 (in zip folder): As described in the manuscript, we simulate X-ray photoelectron spectra for each of the 7,587 inorganic [1] and organic [2] materials in our dataset. To reflect realistic experimental conditions, each simulated spectrum was augmented by systematically varying parameters such as peak width, peak shift, and peak type—all configurable within the SESSA software—as well as by applying statistical Poisson noise to simulate varying signal-to-noise ratios. These modifications account for experimentally observed and material-specific spectral broadening, peak shifts, and detector-induced noise. Each material is represented by an individual HDF5 (.h5) file, named according to its chemical formula and mass density (in g/cm³). For example, the file for SiO2 with a density of 2.196 gcm-3 is named SiO2_2.196.h5. For more complex chemical formulas, such as Co(ClO4)2 with a density of 3.33 gcm-3, the file is named Co_ClO4_2_3.33.h5. Within each HDF5 file, the metadata for each spectrum is stored alongside a fixed energy axis and the corresponding intensity values. The spectral data are organized hierarchically by augmentation parameters in the following directory structure, e.g. for Ac_10.0.h5 we have SNR_0/WIDTH_0.3/SHIFT_-3.0/PEAK_gauss/Ac_10.0/. These files can be easily inspected with H5Web in Visual Studio Code or using h5py in Python or any other h5 interpretable program.

Session Files: The .ses files are SESSA specific input files that can be directly loaded into SESSA to specify certain input parameters for the initilization (ini), the geometry (geo) and the simulation parameters (sim_para) and are required by the python script Simulation_Script_VSC_json.py to run the simulation on the cluster.

Json Files: The two json files (MaterialsListVSC_gauss.json, MaterialsListVSC_lorentz.json) are used as the input files to the Python script Simulation_Script_VSC_json.py. These files contain all the material specific information for the SESSA simulation.

csv files: The csv files are used to generate the plots from the manuscript described in the section "Plotting Scripts".

npz files: The two .npz files (element_counts.npz, single_elements.npz) are python arrays that are needed by the Transformer_SimulatedSpectra.py script and contain the number of each single element in the dataset and an array of each single element present, respectively.

SESSA Simulation Script

There is one python file that sets the communication with SESSA:

Simulation_Script_VSC_json.py: This script is the heart of the simulation as it controls the communication through the CLI with SESSA using the specified input paramters in the .json and .ses files together with external functions specified in VSC_function.py

Technical Details

Simulation_Script_VSC_json.py: This script uses the functions of the VSC_function.py script (therefore needs to be placed in the same directory as this script) and can be called with the following command:

python3 Simulation_Script_VSC_json.py MaterialsListVSC_gauss.json 0

It simulates the spectrum for the material at index 0 in the .json file and with the corresponding parameters specified in the .json file.

It is important that before running this script the following paths need to be specified:

sessa_path: The path to their SESSA installation in sessa_path and the path to their session files in

folder_path: The path to their .ses files. In this directory an output folder will be generated where all the output files, including the simulated spectra, are written to.

To run SESSA on a computing cluster it is important to have a working Xvfb (virtual frame buffer) or a similar tool available to which any graphical output from SESSA can be written to.

Neural Network Training Script

Before running the training script it is important to normalize the data such that the squared integral of the spectrum is 1 (as described in the manuscript) and shown in the code: normalize_spectra.py

For the neural network training we use the Transformer_SimulatedSpectra.py where the external functions used are specified in external_functions.py. This script contains the full description of the neural network architecture, the hyperparameter tuning and the Wandb logging.

In the models.zip folder the fully trained network final_trained_model.ckpt presented in the manuscript is available as well as the list of training, validation and testing materials (test_materials_list.pt, train_materials_list.pt, val_materials_list.pt) where the corresponding spectra are extracted from the hdf5 files. The file types .ckpt and .pt can be read in by using the pytorch specific load functions in Python, e.g.

torch.load(train_materials_list)

Technical Details

normalize_spectra.py: To run this script properly it is important to set up a python environment with the necessary libraries specified in the requirements_data_analysis.txt file. Then it can be called with

python3 normalize_spectra.py

where it is important to specify the path to the .h5 files containing the unnormalized spectra.

Transformer_SimulatedSpectra.py: To run this script properly on the cluster it is important to set up a python environment with the necessary libraries specified in the requirements_nn_training.txt file. This script also relies on external_functions.py, single_elements.npz and element_counts.npz (that should be placed in the same directory as the python script) file. This is important for creating the datasets for training, validation and testing and ensures that all the single elements appear in the testing set. You can call this script (on the cluster) within a slurm script to start the GPU training.

python3 Transformer_SimulatedSpectra.py

It is important that before running this script the following paths need to be specified:

data_path: General path where all the data is stored

neural_network_data: The location where you keep your normalized hdf5 files

wandb_api_key: The api key to use
Z
Assessing the impact of hints in learning formal specification: Research...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
INESC TEC
Centro de Computação Gráfica
Authors
Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

Dataset

The artifact contains the resources described below.

Experiment resources

The resources needed for replicating the experiment, namely in directory experiment:

alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

Experiment data

The task database used in our application of the experiment, namely in directory data/experiment:

Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

Collected data

Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

participant identification: participant's unique identifier (ID);

socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID);

user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

participants.txt: the list of participant identifiers that have registered for the experiment.

Analysis scripts

The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

requirements.r: An R script to install the required libraries for the analysis script.

normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

Dockerfile: Docker script to automate the analysis script from the collected data.

Setup

To replicate the experiment and the analysis of the results, only Docker is required.

If you wish to manually replicate the experiment and collect your own data, you'll need to install:

A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

Usage

Experiment replication

This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

cd experimentdocker-compose up

This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

Group N (no hints): http://localhost:3000/0CAN

Group L (error locations): http://localhost:3000/CA0L

Group E (counter-example): http://localhost:3000/350E

Group D (error description): http://localhost:3000/27AD

In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

Analysis of other applications of the experiment

This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

The analysis script expects data in 4 CSV files,
Metabolomics Data Preprocessing PQN PCA
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Metabolomics Data Preprocessing PQN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/metabolomics-data-preprocessing-pqn-pca
Explore at:
zip(22763 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides a step-by-step pipeline for preprocessing metabolomics data.

The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.

Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.

Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.

Includes data visualization techniques to interpret PCA results effectively.

Suitable for metabolomics researchers and data scientists working on omics data.

Enables better reproducibility of preprocessing workflows for metabolomics studies.

Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.

Provides a Python-based notebook that is easy to adapt to new datasets.

Includes example datasets and code snippets for immediate application.

Helps users understand the impact of normalization on downstream statistical analyses.

Supports integration with other metabolomics pipelines or machine learning workflows.
n
Research data underpinning "Investigating Reinforcement Learning Approaches...
data.ncl.ac.uk
application/csv
Updated Aug 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zheng Luo (2024). Research data underpinning "Investigating Reinforcement Learning Approaches In Stock Market Trading" [Dataset]. http://doi.org/10.25405/data.ncl.26539735.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.25405/data.ncl.26539735.v1
Dataset updated
Aug 13, 2024
Dataset provided by
Newcastle University
Authors
Zheng Luo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The final dataset utilised for the publication "Investigating Reinforcement Learning Approaches In Stock Market Trading" was processed by downloading and combining data from multiple reputable sources to suit the specific needs of this project. Raw data were retrieved by downloading them using a Python finance API. Afterwards, Python and NumPy were used to combine and normalise the data to create the final dataset.The raw data was sourced as follows:Stock Prices of NVIDIA & AMD, Financial Indexes, and Commodity Prices: Retrieved from Yahoo Finance.Economic Indicators: Collected from the US Federal Reserve.The dataset was normalised to minute intervals, and the stock prices were adjusted to account for stock splits.This dataset was used for exploring the application of reinforcement learning in stock market trading. After creating the dataset, it was used in s reinforcement learning environment to train several reinforcement learning algorithms, including deep Q-learning, policy networks, policy networks with baselines, actor-critic methods, and time series incorporation. The performance of these algorithms was then compared based on profit made and other financial evaluation metrics, to investigate the application of reinforcement learning algorithms in stock market trading.The attached 'README.txt' contains methodological information and a glossary of all the variables in the .csv file.
m
Text Script Analytics Code for Automatic Video Generation
data.mendeley.com
Updated Aug 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
gaganpreet gagan (2025). Text Script Analytics Code for Automatic Video Generation [Dataset]. http://doi.org/10.17632/kgngzzs5c8.5
Explore at:
Unique identifier
https://doi.org/10.17632/kgngzzs5c8.5
Dataset updated
Aug 22, 2025
Authors
gaganpreet gagan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Python notebook (research work) provides a comprehensive solution for text analysis and hint extraction that will be useful for making computational scenes using input text .

It includes a collection of functions that can be used to preprocess textual data, extract information such as characters, relationships, emotions, dates, times, addresses, locations, purposes, and hints from the text.

Key Features:

Preprocessing Collected Data: The notebook offers preprocessing capabilities to remove unwanted strings, normalize text data, and prepare it for further analysis. Character Extraction: The notebook includes functions to extract characters from the text, count the number of characters, and determine the number of male and female characters. Relationship Extraction: Functions are provided to calculate possible relationships among characters and extract the relationship names. Dominant Emotion Extraction: The notebook includes a function to extract the dominant emotion from the text. Date and Time Extraction: Functions are available to extract dates and times from the text, including handling phrases like "before," "after," "in the morning," and "in the evening." Address and Location Extraction: The notebook provides functions to extract addresses and locations from the text, including identifying specific places like offices, homes, rooms, or bathrooms. Purpose Extraction: Functions are included to extract the purpose of the text. Hint Collection: The notebook offers the ability to collect hints from the text based on specific keywords or phrases. Sample Implementations: Sample Python code is provided for each function, demonstrating how to use them effectively. This notebook serves as a valuable resource for text analysis tasks, assisting in extracting essential information and hints from textual data. It can be used in various domains such as natural language processing, sentiment analysis, and information retrieval. The code is well-documented and can be easily integrated into existing projects or workflows.
EdX, Coursera, and Udemy Course Data
kaggle.com
zip
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karar Haitham (2025). EdX, Coursera, and Udemy Course Data [Dataset]. https://www.kaggle.com/datasets/kararhaitham/courses
Explore at:
zip(49594908 bytes)Available download formats
Dataset updated
Apr 11, 2025
Authors
Karar Haitham
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains course information and metadata scraped from two popular MOOC platforms: EdX and Coursera(Udemy will be added soon, although the script to scrape it available on my git https://github.com/karar-git/Quamus). It includes various course attributes such as descriptions, program details, and other relevant data, making it useful for building recommendation systems, educational tools, or performing analysis on online learning content.

The dataset includes the following files:

combine_preprocessing.py: A Python script that preprocesses and combines the raw data into a unified format. You must run this script to generate the processed dataset. combined_dataset.json: The final, preprocessed, and combined dataset of course metadata from both EdX and Coursera. edx_courses.json: Raw course data scraped from the EdX platform. edx_degree_programs.json: Data on degree programs available on EdX. edx_executive_education_paidstuff.json: Paid course and executive education data from EdX. edx_programs.json: Data on various programs available on EdX. processed_coursera_data.json: Processed course data scraped from Coursera.

How to Use:

To generate the final combined_dataset.json, you need to run the combine_preprocessing.py script. This script will process the raw data files, clean, normalize, and combine them into one unified dataset. Disclaimer:

Data Source: The data in this dataset was scraped from publicly available information on EdX and Coursera. The scraping was done solely for educational and research purposes. The scraping process adheres to the terms of use of the respective platforms. Usage: This dataset is intended for non-commercial use only. Please use responsibly and adhere to the terms and conditions of the platforms from which the data was collected. No Warranty: This data is provided "as-is" without any warranty. Users are responsible for ensuring that their use of the data complies with the relevant platform policies.

Models:

In addition to the data, there are machine learning models related to this dataset available on GitHub. These models can help with content-based course recommendations and are built using the data you will find here. Specifically, the models include:

A cosine similarity-based model for course recommendations. A two-tower model for personalized recommendations, trained using pseudo-labels. A transformer-based course predictor (work in progress) designed to suggest the next course based on a user's learning progression.

Note:

The dataset currently contains data only from EdX and Coursera. The script to scrape Udemy data can be found in the related GitHub repository.

You will need to access the GitHub repository to view and experiment with the models and the Udemy scraping script. The models require the data files from this dataset to work properly.

By uploading this dataset to Kaggle, you can explore these educational resources and leverage them for building custom educational tools or analyzing online course trends.
Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals...
plos.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
François Serra; Davide Baù; Mike Goodstadt; David Castillo; Guillaume J. Filion; Marc A. Marti-Renom (2023). Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals structural features of the fly chromatin colors [Dataset]. http://doi.org/10.1371/journal.pcbi.1005665
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1005665
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
François Serra; Davide Baù; Mike Goodstadt; David Castillo; Guillaume J. Filion; Marc A. Marti-Renom
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The sequence of a genome is insufficient to understand all genomic processes carried out in the cell nucleus. To achieve this, the knowledge of its three-dimensional architecture is necessary. Advances in genomic technologies and the development of new analytical methods, such as Chromosome Conformation Capture (3C) and its derivatives, provide unprecedented insights in the spatial organization of genomes. Here we present TADbit, a computational framework to analyze and model the chromatin fiber in three dimensions. Our package takes as input the sequencing reads of 3C-based experiments and performs the following main tasks: (i) pre-process the reads, (ii) map the reads to a reference genome, (iii) filter and normalize the interaction data, (iv) analyze the resulting interaction matrices, (v) build 3D models of selected genomic domains, and (vi) analyze the resulting models to characterize their structural properties. To illustrate the use of TADbit, we automatically modeled 50 genomic domains from the fly genome revealing differential structural features of the previously defined chromatin colors, establishing a link between the conformation of the genome and the local chromatin composition. TADbit provides three-dimensional models built from 3C-based experiments, which are ready for visualization and for characterizing their relation to gene expression and epigenetic states. TADbit is an open-source Python library available for download from https://github.com/3DGenomes/tadbit.
Koei Tecmo games
kaggle.com
zip
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alonso Villa Rivera (2025). Koei Tecmo games [Dataset]. https://www.kaggle.com/datasets/alonsovillarivera/koei-tecmo-games
Explore at:
zip(35342 bytes)Available download formats
Dataset updated
Jun 20, 2025
Authors
Alonso Villa Rivera
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains a curated list of video games developed or published by Tecmo Koei, compiled via web scraping from Wikipedia. Tecmo and Koei, two prominent Japanese video game companies, merged in 2009, creating a legacy of titles across a wide range of platforms and genres.

Purpose

The main purpose of this dataset is to serve as a learning tool for practicing data cleaning, standardization, and exploratory analysis. Real-world data, even when sourced from structured platforms like Wikipedia, often comes with inconsistencies, missing values, and formatting issues. This dataset offers a realistic example of how to:

Clean textual data (e.g., standardizing genres and platforms).

Handle missing or inconsistent entries.

Normalize categorical values.

Prepare scraped data for machine learning or visualization tasks.

It's especially useful for students, junior data analysts, or anyone learning to work with messy data in Python using tools - like pandas, numpy, and regex.

New in This Version: Enhanced Data and Platform Explosion

This updated version of the dataset includes expanded coverage and a key pre-processed transformation to enhance its utility for analysis.

Previously, the dataset provided core information about game titles, platforms, release dates, genres, developers, publishers, and descriptions. In this release, we've focused on two significant improvements:

Expanded Game Coverage: We've broadened the scope of the original dataset to include a more comprehensive list of Tecmo Koei titles, ensuring a richer and more complete view of their extensive game catalog. This means you'll find even more games to analyze, providing a deeper understanding of their history and output.

Pre-processed Platform Data (Exploded View): To facilitate more granular analysis, particularly for games released on multiple platforms, we've included a transformed version of the dataset where the Platforms column has been "exploded."

Originally, the Platforms column might contain comma-separated values (e.g., "PC, PlayStation, Nintendo Switch"). This structure can be challenging for direct analysis when you want to count games per individual platform. The "exploded" dataset now presents each unique platform for a game as a separate row, duplicating the other game details. This means if "Ninja Gaiden" was released on "Xbox, PlayStation 3," it will appear as two separate rows in the exploded dataset—one for "Xbox" and one for "PlayStation 3."

This transformation significantly simplifies tasks like:

Counting games released on specific platforms. Analyzing platform-specific trends in genres or release dates. Creating more accurate visualizations of platform distribution. The original, untransformed data is also included, allowing users to practice the exploding technique themselves if desired.
Insurance_claims
kaggle.com
data.mendeley.com
zip
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miannotti (2025). Insurance_claims [Dataset]. https://www.kaggle.com/datasets/mian91218/insurance-claims
Explore at:
zip(68984 bytes)Available download formats
Dataset updated
Oct 19, 2025
Authors
Miannotti
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AQQAD, ABDELRAHIM (2023), “insurance_claims ”, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2

https://data.mendeley.com/datasets/992mh7dk9y/2

Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2

Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.

Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:

Load the Dataset File

insurance_df = pd.read_csv('insurance_claims.csv')

Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure.

Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.

Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.

Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.

Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).

Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.

Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.

Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).

Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.
Brain Tumor CSV
kaggle.com
zip
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akash Nath (2024). Brain Tumor CSV [Dataset]. https://www.kaggle.com/datasets/akashnath29/brain-tumor-csv/code
Explore at:
zip(538175483 bytes)Available download formats
Dataset updated
Oct 30, 2024
Authors
Akash Nath
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.

Motivation and Use Cases

Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.

Data Structure

This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).

CSV File Contents

Pixel Values: Each row contains the pixel values of a single grayscale image, flattened into a 1-dimensional array. The original image dimensions vary, and rows in the CSV will correspondingly vary in length.

Simplified Access: By using a CSV format, this dataset avoids the need for specialized image processing libraries and can be easily loaded into data analysis and machine learning frameworks like Pandas, Scikit-Learn, and TensorFlow.

How to Use This Dataset

Loading the Data: The CSV can be loaded using standard data analysis libraries, making it compatible with Python, R, and other platforms.

Data Preprocessing: Users may normalize pixel values (e.g., between 0 and 1) for deep learning applications.

Splitting Data: While this dataset does not predefine training and testing splits, users can separate rows into training, validation, and test sets.

Reshaping for Models: If needed, each row can be reshaped to the original dimensions (retrieved from the subfolder structure) to view or process as an image.

Technical Details

Image Format: Grayscale MRI images, with pixel values ranging from 0 to 255.

Resolution: Original resolution, no resizing applied.

Size: Each row’s length varies according to the original dimensions of each MRI image.

Data Type: CSV file with integer pixel values.

Acknowledgments

This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.

Images used for training, validation, and testing.

kaggle.com

Updated Mar 15, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Chrysthian Chrisley (2024). Images used for training, validation, and testing. [Dataset]. https://www.kaggle.com/datasets/chrysthian/images-used-for-training-validation-and-testing

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 15, 2024

Dataset provided by

Kaggle

Authors

Chrysthian Chrisley

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Imports:

# All Imports
import os
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.calibration import LabelEncoder
import seaborn as sns
import matplotlib.image as mpimg
import cv2
import numpy as np
import pickle

# Tensflor and Keras Layer and Model and Optimize and Loss
import tensorflow as tf
from tensorflow import keras
from keras import Sequential
from keras.layers import *

#Kernel Intilizer 
from keras.optimizers import Adamax

# PreTrained Model
from keras.applications import *

#Early Stopping
from keras.callbacks import EarlyStopping
import warnings

Warnings Suppression | Configuration

# Warnings Remove 
warnings.filterwarnings("ignore")

# Define the base path for the training folder
base_path = 'jaguar_cheetah/train'

# Weights file
weights_file = 'Model_train_weights.weights.h5'

# Path to the saved or to save the model:
model_file = 'Model-cheetah_jaguar_Treined.keras'

# Model history
history_path = 'training_history_cheetah_jaguar.pkl'

# Initialize lists to store file paths and labels
filepaths = []
labels = []

# Iterate over folders and files within the training directory
for folder in ['Cheetah', 'Jaguar']:
  folder_path = os.path.join(base_path, folder)
  for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    filepaths.append(file_path)
    labels.append(folder)

# Create the TRAINING dataframe
file_path_series = pd.Series(filepaths , name= 'filepath')
Label_path_series = pd.Series(labels , name = 'label')
df_train = pd.concat([file_path_series ,Label_path_series ] , axis = 1)


# Define the base path for the test folder
directory = "jaguar_cheetah/test"

filepath =[]
label = []

folds = os.listdir(directory)

for fold in folds:
  f_path = os.path.join(directory , fold)
  
  imgs = os.listdir(f_path)
  
  for img in imgs:
    
    img_path = os.path.join(f_path , img)
    filepath.append(img_path)
    label.append(fold)
    
# Create the TEST dataframe
file_path_series = pd.Series(filepath , name= 'filepath')
Label_path_series = pd.Series(label , name = 'label')
df_test = pd.concat([file_path_series ,Label_path_series ] , axis = 1)

# Display the first rows of the dataframe for verification
#print(df_train)

# Folders with Training and Test files
data_dir = 'jaguar_cheetah/train'
test_dir = 'jaguar_cheetah/test'

# Image size 256x256
IMAGE_SIZE = (256,256)

Tain | Test

#print('Training Images:')

# Create the TRAIN dataframe
train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.1,
  subset='training',
  seed=123,
  image_size=IMAGE_SIZE,
  batch_size=32)

#Testing Data
#print('Validation Images:')
validation_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir, 
  validation_split=0.1,
  subset='validation',
  seed=123,
  image_size=IMAGE_SIZE,
  batch_size=32)

print('Testing Images:')
test_ds = tf.keras.utils.image_dataset_from_directory(
  test_dir, 
  seed=123,
  image_size=IMAGE_SIZE,
  batch_size=32)

# Extract labels
train_labels = train_ds.class_names
test_labels = test_ds.class_names
validation_labels = validation_ds.class_names

# Encode labels
# Defining the class labels
class_labels = ['CHEETAH', 'JAGUAR'] 

# Instantiate (encoder) LabelEncoder
label_encoder = LabelEncoder()

# Fit the label encoder on the class labels
label_encoder.fit(class_labels)

# Transform the labels for the training dataset
train_labels_encoded = label_encoder.transform(train_labels)

# Transform the labels for the validation dataset
validation_labels_encoded = label_encoder.transform(validation_labels)

# Transform the labels for the testing dataset
test_labels_encoded = label_encoder.transform(test_labels)

# Normalize the pixel values

# Train files 
train_ds = train_ds.map(lambda x, y: (x / 255.0, y))
# Validate files
validation_ds = validation_ds.map(lambda x, y: (x / 255.0, y))
# Test files
test_ds = test_ds.map(lambda x, y: (x / 255.0, y))

#TRAINING VISUALIZATION
#Count the occurrences of each category in the column
count = df_train['label'].value_counts()

# Create a figure with 2 subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 6), facecolor='white')

# Plot a pie chart on the first subplot
palette = sns.color_palette("viridis")
sns.set_palette(palette)
axs[0].pie(count, labels=count.index, autopct='%1.1f%%', startangle=140)
axs[0].set_title('Distribution of Training Categories')

# Plot a bar chart on the second subplot
sns.barplot(x=count.index, y=count.values, ax=axs[1], palette="viridis")
axs[1].set_title('Count of Training Categories')

# Adjust the layout
plt.tight_layout()

# Visualize
plt.show()

# TEST VISUALIZATION
count = df_test['label'].value_counts()

# Create a figure with 2 subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 6), facec...

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ryan Lagerquist; Ryan Lagerquist (2023). Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic" [Dataset]. http://doi.org/10.5281/zenodo.10081205

Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic"

Explore at:

tarAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10081205

Dataset updated

Nov 8, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Ryan Lagerquist; Ryan Lagerquist

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The tar file contains two directories: data and models. Within "data," there are 4 subdirectories: "training" (the clean training data -- without perturbations), "training_all_perturbed_for_uq" (the lightly perturbed training data), "validation_all_perturbed_for_uq" (the moderately perturbed validation data), and "testing_all_perturbed_for_uq" (the heavily perturbed validation data). The data in these directories are unnormalized. The subdirectories "training" and "training_all_perturbed_for_uq" each contain a normalization file. These normalization files contain parameters used to normalize the data (from physical units to z-scores) for Experiment 1 and Experiment 2, respectively. To do the normalization, you can use the script normalize_examples.py in the code library (ml4rt) with the argument input_normalization_file_name set to one of these two file paths. The other arguments should be as follows:

--uniformize=1

--predictor_norm_type_string="z_score"

--vector_target_norm_type_string=""

--scalar_target_norm_type_string=""

Within the directory "models," there are 6 subdirectories: for the BNN-only models trained with clean and lightly perturbed data, for the CRPS-only models trained with clean and lightly perturbed data, and for the BNN/CRPS models trained with clean and lightly perturbed data. To read the models into Python, you can use the method neural_net.read_model in the ml4rt library.

Clear search

Close search

Google apps

Main menu

Input data and some models (all except multi-model ensembles) for JAMES...

Transformer network trained on simulated X-ray photoelectron spectroscopy...

Dataset Description

Context and methodology

Data details

SESSA Simulation Script

Technical Details

Neural Network Training Script

Technical Details

Assessing the impact of hints in learning formal specification: Research...

Metabolomics Data Preprocessing PQN PCA

Research data underpinning "Investigating Reinforcement Learning Approaches...

Text Script Analytics Code for Automatic Video Generation

EdX, Coursera, and Udemy Course Data

Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals...

Koei Tecmo games

Purpose

New in This Version: Enhanced Data and Platform Explosion

Insurance_claims

Load the Dataset File

Brain Tumor CSV

Motivation and Use Cases

Data Structure

CSV File Contents

How to Use This Dataset

Technical Details

Acknowledgments

Images used for training, validation, and testing.

Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic"