11 datasets found

Input data and some models (all except multi-model ensembles) for JAMES...
zenodo.org
tar
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Lagerquist; Ryan Lagerquist (2023). Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic" [Dataset]. http://doi.org/10.5281/zenodo.10081205
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10081205
Dataset updated
Nov 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ryan Lagerquist; Ryan Lagerquist
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The tar file contains two directories: data and models. Within "data," there are 4 subdirectories: "training" (the clean training data -- without perturbations), "training_all_perturbed_for_uq" (the lightly perturbed training data), "validation_all_perturbed_for_uq" (the moderately perturbed validation data), and "testing_all_perturbed_for_uq" (the heavily perturbed validation data). The data in these directories are unnormalized. The subdirectories "training" and "training_all_perturbed_for_uq" each contain a normalization file. These normalization files contain parameters used to normalize the data (from physical units to z-scores) for Experiment 1 and Experiment 2, respectively. To do the normalization, you can use the script normalize_examples.py in the code library (ml4rt) with the argument input_normalization_file_name set to one of these two file paths. The other arguments should be as follows:
--uniformize=1
--predictor_norm_type_string="z_score"
--vector_target_norm_type_string=""
--scalar_target_norm_type_string=""

Within the directory "models," there are 6 subdirectories: for the BNN-only models trained with clean and lightly perturbed data, for the CRPS-only models trained with clean and lightly perturbed data, and for the BNN/CRPS models trained with clean and lightly perturbed data. To read the models into Python, you can use the method neural_net.read_model in the ml4rt library.
Metabolomics Data Preprocessing PQN PCA
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Metabolomics Data Preprocessing PQN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/metabolomics-data-preprocessing-pqn-pca
Explore at:
zip(22763 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides a step-by-step pipeline for preprocessing metabolomics data.

The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.

Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.

Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.

Includes data visualization techniques to interpret PCA results effectively.

Suitable for metabolomics researchers and data scientists working on omics data.

Enables better reproducibility of preprocessing workflows for metabolomics studies.

Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.

Provides a Python-based notebook that is easy to adapt to new datasets.

Includes example datasets and code snippets for immediate application.

Helps users understand the impact of normalization on downstream statistical analyses.

Supports integration with other metabolomics pipelines or machine learning workflows.
t
Transformer network trained on simulated X-ray photoelectron spectroscopy...
researchdata.tuwien.at
bin, csv, json, zip
Updated Oct 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl (2025). Transformer network trained on simulated X-ray photoelectron spectroscopy data for organic and inorganic compounds [Dataset]. http://doi.org/10.48436/eybcx-t0a02
Explore at:
csv, json, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.48436/eybcx-t0a02
Dataset updated
Oct 17, 2025
Dataset provided by
TU Wien
Authors
Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

This data repository provides the underlying data and neural network training scripts associated with the manuscript titled "A Transformer Network for High-Throughput Materials Characterization with X-ray Photoelectron Spectroscopy" by Simperl and Werner published in the Journal of Applied Physics (https://doi.org/10.1063/5.0296600) (2025)

All data files are released under the Creative Commons Attribution 4.0 International (CC-BY) license, while all code files are distributed under the MIT license.

The repository contains simulated X-ray photoelectron spectroscopy (XPS) spectra stored as hdf5 files in the zipped (h5_files.zip) folder, which was generated using the software developed by the authors. The NIST Standard Reference Database 100 – Simulation of Electron Spectra for Surface Analysis (SESSA) is freely available at https://www.nist.gov/srd/nist-standard-reference-database-100.

The neural network architecture is implemented using the PyTorch Lightning framework and is fully available within the attached materials as Transformer_SimulatedSpectra.py contained in the python_scripts.zip.

The trained model and the list of materials for the train, test and validation sets are contained in the models.zip folder.

The repository contains all the data necessary to replot the figures from the manuscript. These data are available in the form of .csv files or .h5 files for the spectra. In addition, the repository also contains a Python script (Plot_Data_Manuscript.ipynb) which is contained in the python_scripts.zip file.

Context and methodology

The dataset and accompanying Python code files included in this repository were used to train a transformer-based neural network capable of directly inferring chemical concentrations from simulated survey X-ray photoelectron spectroscopy (XPS) spectra of bulk compounds.

The spectral dataset provided here represents the raw output from the SESSA software (version 2.2.2), prior to the normalization procedure described in the associated manuscript. This step of normalisation is of paramount importance for the effective training of the neural network.

The repository contains the Python scripts utilised to execute the spectral simulations and the neural network training on the Vienna Scientific Cluster (VSC5) which is part of the Austrian Scientific Computing Infrastructure (ASC). In order to obtain guidance on the proper configuration of the Command Line Interface (CLI) tools required for SESSA, users are advised to consult the official SESSA manual, which is available at the following address: https://nvlpubs.nist.gov/nistpubs/NSRDS/NIST.NSRDS.100-2024.pdf.

To run the neural network training we provided the requirements_nn_training.txt file that contains all the necessary python packages and version numbers. All other python scripts can be run locally with the python libraries listed in requirements_data_analysis.txt.

Data details

HDF5 (in zip folder): As described in the manuscript, we simulate X-ray photoelectron spectra for each of the 7,587 inorganic [1] and organic [2] materials in our dataset. To reflect realistic experimental conditions, each simulated spectrum was augmented by systematically varying parameters such as peak width, peak shift, and peak type—all configurable within the SESSA software—as well as by applying statistical Poisson noise to simulate varying signal-to-noise ratios. These modifications account for experimentally observed and material-specific spectral broadening, peak shifts, and detector-induced noise. Each material is represented by an individual HDF5 (.h5) file, named according to its chemical formula and mass density (in g/cm³). For example, the file for SiO2 with a density of 2.196 gcm-3 is named SiO2_2.196.h5. For more complex chemical formulas, such as Co(ClO4)2 with a density of 3.33 gcm-3, the file is named Co_ClO4_2_3.33.h5. Within each HDF5 file, the metadata for each spectrum is stored alongside a fixed energy axis and the corresponding intensity values. The spectral data are organized hierarchically by augmentation parameters in the following directory structure, e.g. for Ac_10.0.h5 we have SNR_0/WIDTH_0.3/SHIFT_-3.0/PEAK_gauss/Ac_10.0/. These files can be easily inspected with H5Web in Visual Studio Code or using h5py in Python or any other h5 interpretable program.

Session Files: The .ses files are SESSA specific input files that can be directly loaded into SESSA to specify certain input parameters for the initilization (ini), the geometry (geo) and the simulation parameters (sim_para) and are required by the python script Simulation_Script_VSC_json.py to run the simulation on the cluster.

Json Files: The two json files (MaterialsListVSC_gauss.json, MaterialsListVSC_lorentz.json) are used as the input files to the Python script Simulation_Script_VSC_json.py. These files contain all the material specific information for the SESSA simulation.

csv files: The csv files are used to generate the plots from the manuscript described in the section "Plotting Scripts".

npz files: The two .npz files (element_counts.npz, single_elements.npz) are python arrays that are needed by the Transformer_SimulatedSpectra.py script and contain the number of each single element in the dataset and an array of each single element present, respectively.

SESSA Simulation Script

There is one python file that sets the communication with SESSA:

Simulation_Script_VSC_json.py: This script is the heart of the simulation as it controls the communication through the CLI with SESSA using the specified input paramters in the .json and .ses files together with external functions specified in VSC_function.py

Technical Details

Simulation_Script_VSC_json.py: This script uses the functions of the VSC_function.py script (therefore needs to be placed in the same directory as this script) and can be called with the following command:

python3 Simulation_Script_VSC_json.py MaterialsListVSC_gauss.json 0

It simulates the spectrum for the material at index 0 in the .json file and with the corresponding parameters specified in the .json file.

It is important that before running this script the following paths need to be specified:

sessa_path: The path to their SESSA installation in sessa_path and the path to their session files in

folder_path: The path to their .ses files. In this directory an output folder will be generated where all the output files, including the simulated spectra, are written to.

To run SESSA on a computing cluster it is important to have a working Xvfb (virtual frame buffer) or a similar tool available to which any graphical output from SESSA can be written to.

Neural Network Training Script

Before running the training script it is important to normalize the data such that the squared integral of the spectrum is 1 (as described in the manuscript) and shown in the code: normalize_spectra.py

For the neural network training we use the Transformer_SimulatedSpectra.py where the external functions used are specified in external_functions.py. This script contains the full description of the neural network architecture, the hyperparameter tuning and the Wandb logging.

In the models.zip folder the fully trained network final_trained_model.ckpt presented in the manuscript is available as well as the list of training, validation and testing materials (test_materials_list.pt, train_materials_list.pt, val_materials_list.pt) where the corresponding spectra are extracted from the hdf5 files. The file types .ckpt and .pt can be read in by using the pytorch specific load functions in Python, e.g.

torch.load(train_materials_list)

Technical Details

normalize_spectra.py: To run this script properly it is important to set up a python environment with the necessary libraries specified in the requirements_data_analysis.txt file. Then it can be called with

python3 normalize_spectra.py

where it is important to specify the path to the .h5 files containing the unnormalized spectra.

Transformer_SimulatedSpectra.py: To run this script properly on the cluster it is important to set up a python environment with the necessary libraries specified in the requirements_nn_training.txt file. This script also relies on external_functions.py, single_elements.npz and element_counts.npz (that should be placed in the same directory as the python script) file. This is important for creating the datasets for training, validation and testing and ensures that all the single elements appear in the testing set. You can call this script (on the cluster) within a slurm script to start the GPU training.

python3 Transformer_SimulatedSpectra.py

It is important that before running this script the following paths need to be specified:

data_path: General path where all the data is stored

neural_network_data: The location where you keep your normalized hdf5 files

wandb_api_key: The api key to use
Z
Assessing the impact of hints in learning formal specification: Research...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
Centro de Computação Gráfica
INESC TEC
Authors
Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

Dataset

The artifact contains the resources described below.

Experiment resources

The resources needed for replicating the experiment, namely in directory experiment:

alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

Experiment data

The task database used in our application of the experiment, namely in directory data/experiment:

Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

Collected data

Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

participant identification: participant's unique identifier (ID);

socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID);

user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

participants.txt: the list of participant identifiers that have registered for the experiment.

Analysis scripts

The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

requirements.r: An R script to install the required libraries for the analysis script.

normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

Dockerfile: Docker script to automate the analysis script from the collected data.

Setup

To replicate the experiment and the analysis of the results, only Docker is required.

If you wish to manually replicate the experiment and collect your own data, you'll need to install:

A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

Usage

Experiment replication

This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

cd experimentdocker-compose up

This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

Group N (no hints): http://localhost:3000/0CAN

Group L (error locations): http://localhost:3000/CA0L

Group E (counter-example): http://localhost:3000/350E

Group D (error description): http://localhost:3000/27AD

In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

Analysis of other applications of the experiment

This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

The analysis script expects data in 4 CSV files,
m
Text Script Analytics Code for Automatic Video Generation
data.mendeley.com
Updated Aug 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
gaganpreet gagan (2025). Text Script Analytics Code for Automatic Video Generation [Dataset]. http://doi.org/10.17632/kgngzzs5c8.5
Explore at:
Unique identifier
https://doi.org/10.17632/kgngzzs5c8.5
Dataset updated
Aug 22, 2025
Authors
gaganpreet gagan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Python notebook (research work) provides a comprehensive solution for text analysis and hint extraction that will be useful for making computational scenes using input text .

It includes a collection of functions that can be used to preprocess textual data, extract information such as characters, relationships, emotions, dates, times, addresses, locations, purposes, and hints from the text.

Key Features:

Preprocessing Collected Data: The notebook offers preprocessing capabilities to remove unwanted strings, normalize text data, and prepare it for further analysis. Character Extraction: The notebook includes functions to extract characters from the text, count the number of characters, and determine the number of male and female characters. Relationship Extraction: Functions are provided to calculate possible relationships among characters and extract the relationship names. Dominant Emotion Extraction: The notebook includes a function to extract the dominant emotion from the text. Date and Time Extraction: Functions are available to extract dates and times from the text, including handling phrases like "before," "after," "in the morning," and "in the evening." Address and Location Extraction: The notebook provides functions to extract addresses and locations from the text, including identifying specific places like offices, homes, rooms, or bathrooms. Purpose Extraction: Functions are included to extract the purpose of the text. Hint Collection: The notebook offers the ability to collect hints from the text based on specific keywords or phrases. Sample Implementations: Sample Python code is provided for each function, demonstrating how to use them effectively. This notebook serves as a valuable resource for text analysis tasks, assisting in extracting essential information and hints from textual data. It can be used in various domains such as natural language processing, sentiment analysis, and information retrieval. The code is well-documented and can be easily integrated into existing projects or workflows.
Benchmark data set for MSPypeline, a python package for streamlined mass...
data-staging.niaid.nih.gov
data.niaid.nih.gov
xml
Updated Jul 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Held; Ursula Klingmüller (2021). Benchmark data set for MSPypeline, a python package for streamlined mass spectrometry-based proteomics data analysis [Dataset]. https://data-staging.niaid.nih.gov/resources?id=pxd025792
Explore at:
xmlAvailable download formats
Dataset updated
Jul 22, 2021
Dataset provided by
DKFZ Heidelberg
Division Systems Biology of Signal Transduction, German Cancer Research Center (DKFZ), Heidelberg, 69120, Germany
Authors
Alexander Held; Ursula Klingmüller
Variables measured
Proteomics
Description
Mass spectrometry-based proteomics is increasingly employed in biology and medicine. To generate reliable information from large data sets and ensure comparability of results it is crucial to implement and standardize the quality control of the raw data, the data processing steps and the statistical analyses. The MSPypeline provides a platform for the import of MaxQuant output tables, the generation of quality control reports, the preprocessing of data including normalization and exploratory analyses by statistical inference plots. These standardized steps assess data quality, provide customizable figures and enable the identification of differentially expressed proteins to reach biologically relevant conclusions.
Insurance_claims
kaggle.com
data.mendeley.com
zip
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miannotti (2025). Insurance_claims [Dataset]. https://www.kaggle.com/datasets/mian91218/insurance-claims
Explore at:
zip(68984 bytes)Available download formats
Dataset updated
Oct 19, 2025
Authors
Miannotti
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AQQAD, ABDELRAHIM (2023), “insurance_claims ”, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2

https://data.mendeley.com/datasets/992mh7dk9y/2

Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2

Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.

Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:

Load the Dataset File

insurance_df = pd.read_csv('insurance_claims.csv')

Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure.

Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.

Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.

Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.

Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).

Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.

Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.

Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).

Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.
Brain Tumor CSV
kaggle.com
zip
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akash Nath (2024). Brain Tumor CSV [Dataset]. https://www.kaggle.com/datasets/akashnath29/brain-tumor-csv/code
Explore at:
zip(538175483 bytes)Available download formats
Dataset updated
Oct 30, 2024
Authors
Akash Nath
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.

Motivation and Use Cases

Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.

Data Structure

This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).

CSV File Contents

Pixel Values: Each row contains the pixel values of a single grayscale image, flattened into a 1-dimensional array. The original image dimensions vary, and rows in the CSV will correspondingly vary in length.

Simplified Access: By using a CSV format, this dataset avoids the need for specialized image processing libraries and can be easily loaded into data analysis and machine learning frameworks like Pandas, Scikit-Learn, and TensorFlow.

How to Use This Dataset

Loading the Data: The CSV can be loaded using standard data analysis libraries, making it compatible with Python, R, and other platforms.

Data Preprocessing: Users may normalize pixel values (e.g., between 0 and 1) for deep learning applications.

Splitting Data: While this dataset does not predefine training and testing splits, users can separate rows into training, validation, and test sets.

Reshaping for Models: If needed, each row can be reshaped to the original dimensions (retrieved from the subfolder structure) to view or process as an image.

Technical Details

Image Format: Grayscale MRI images, with pixel values ranging from 0 to 255.

Resolution: Original resolution, no resizing applied.

Size: Each row’s length varies according to the original dimensions of each MRI image.

Data Type: CSV file with integer pixel values.

Acknowledgments

This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.
Koei Tecmo games
kaggle.com
zip
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alonso Villa Rivera (2025). Koei Tecmo games [Dataset]. https://www.kaggle.com/datasets/alonsovillarivera/koei-tecmo-games
Explore at:
zip(35342 bytes)Available download formats
Dataset updated
Jun 20, 2025
Authors
Alonso Villa Rivera
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains a curated list of video games developed or published by Tecmo Koei, compiled via web scraping from Wikipedia. Tecmo and Koei, two prominent Japanese video game companies, merged in 2009, creating a legacy of titles across a wide range of platforms and genres.

Purpose

The main purpose of this dataset is to serve as a learning tool for practicing data cleaning, standardization, and exploratory analysis. Real-world data, even when sourced from structured platforms like Wikipedia, often comes with inconsistencies, missing values, and formatting issues. This dataset offers a realistic example of how to:

Clean textual data (e.g., standardizing genres and platforms).

Handle missing or inconsistent entries.

Normalize categorical values.

Prepare scraped data for machine learning or visualization tasks.

It's especially useful for students, junior data analysts, or anyone learning to work with messy data in Python using tools - like pandas, numpy, and regex.

New in This Version: Enhanced Data and Platform Explosion

This updated version of the dataset includes expanded coverage and a key pre-processed transformation to enhance its utility for analysis.

Previously, the dataset provided core information about game titles, platforms, release dates, genres, developers, publishers, and descriptions. In this release, we've focused on two significant improvements:

Expanded Game Coverage: We've broadened the scope of the original dataset to include a more comprehensive list of Tecmo Koei titles, ensuring a richer and more complete view of their extensive game catalog. This means you'll find even more games to analyze, providing a deeper understanding of their history and output.

Pre-processed Platform Data (Exploded View): To facilitate more granular analysis, particularly for games released on multiple platforms, we've included a transformed version of the dataset where the Platforms column has been "exploded."

Originally, the Platforms column might contain comma-separated values (e.g., "PC, PlayStation, Nintendo Switch"). This structure can be challenging for direct analysis when you want to count games per individual platform. The "exploded" dataset now presents each unique platform for a game as a separate row, duplicating the other game details. This means if "Ninja Gaiden" was released on "Xbox, PlayStation 3," it will appear as two separate rows in the exploded dataset—one for "Xbox" and one for "PlayStation 3."

This transformation significantly simplifies tasks like:

Counting games released on specific platforms. Analyzing platform-specific trends in genres or release dates. Creating more accurate visualizations of platform distribution. The original, untransformed data is also included, allowing users to practice the exploding technique themselves if desired.
METAVERSE GAIT AUTHENTICATION DATASET (MGAD)
kaggle.com
zip
Updated Feb 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bits rmit (2025). METAVERSE GAIT AUTHENTICATION DATASET (MGAD) [Dataset]. https://www.kaggle.com/bitsrmit/metaverse-gait-authentication-dataset-mgad
Explore at:
zip(380503 bytes)Available download formats
Dataset updated
Feb 11, 2025
Authors
bits rmit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metaverse Gait Authentication Dataset (MGAD)

The Metaverse Gait Authentication Dataset (MGAD) is a large-scale dataset designed for biometric authentication using gait patterns in virtual environments. It contains 5,000 simulated user records, generated using Unity 3D and processed with OpenPose & MediaPipe to extract 16 key gait-based features.

This dataset is ideal for biometric security, AI-driven authentication, and gait analysis researchers.

Key Features: ✔ 5,000 Users – Simulated gait data from a diverse range of individuals. ✔ 16 Gait Features – Includes stride length, step frequency, joint angles, and ground reaction forces. ✔ CSV Format – Easy to integrate into AI/ML models. ✔ Preprocessed & Cleaned – Ready for machine learning applications.

Potential Use Cases: 🔹 Gait-based authentication for Metaverse security. 🔹 Human motion analysis in healthcare & sports. 🔹 AI-driven identity verification research. 🔹 Feature engineering & model training for biometric systems.

How to Use: Load in Python: import pandas as pd
data = pd.read_csv('MGAD.csv')
print(data.head()) Preprocess & Normalize Features before training AI models. Train ML models (e.g., Random Forest, Autoencoders) for authentication. Citation: If you use MGAD in your research, please cite: Sandeep Ravikanti, "Metaverse Gait Authentication Dataset (MGAD)," 2025. DOI: https://dx.doi.org/10.21227/rvh5-8842

Images used for training, validation, and testing.

kaggle.com

Updated Mar 15, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Chrysthian Chrisley (2024). Images used for training, validation, and testing. [Dataset]. https://www.kaggle.com/datasets/chrysthian/images-used-for-training-validation-and-testing

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 15, 2024

Dataset provided by

Kaggle

Authors

Chrysthian Chrisley

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Imports:

# All Imports
import os
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.calibration import LabelEncoder
import seaborn as sns
import matplotlib.image as mpimg
import cv2
import numpy as np
import pickle

# Tensflor and Keras Layer and Model and Optimize and Loss
import tensorflow as tf
from tensorflow import keras
from keras import Sequential
from keras.layers import *

#Kernel Intilizer 
from keras.optimizers import Adamax

# PreTrained Model
from keras.applications import *

#Early Stopping
from keras.callbacks import EarlyStopping
import warnings

Warnings Suppression | Configuration

# Warnings Remove 
warnings.filterwarnings("ignore")

# Define the base path for the training folder
base_path = 'jaguar_cheetah/train'

# Weights file
weights_file = 'Model_train_weights.weights.h5'

# Path to the saved or to save the model:
model_file = 'Model-cheetah_jaguar_Treined.keras'

# Model history
history_path = 'training_history_cheetah_jaguar.pkl'

# Initialize lists to store file paths and labels
filepaths = []
labels = []

# Iterate over folders and files within the training directory
for folder in ['Cheetah', 'Jaguar']:
  folder_path = os.path.join(base_path, folder)
  for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    filepaths.append(file_path)
    labels.append(folder)

# Create the TRAINING dataframe
file_path_series = pd.Series(filepaths , name= 'filepath')
Label_path_series = pd.Series(labels , name = 'label')
df_train = pd.concat([file_path_series ,Label_path_series ] , axis = 1)


# Define the base path for the test folder
directory = "jaguar_cheetah/test"

filepath =[]
label = []

folds = os.listdir(directory)

for fold in folds:
  f_path = os.path.join(directory , fold)
  
  imgs = os.listdir(f_path)
  
  for img in imgs:
    
    img_path = os.path.join(f_path , img)
    filepath.append(img_path)
    label.append(fold)
    
# Create the TEST dataframe
file_path_series = pd.Series(filepath , name= 'filepath')
Label_path_series = pd.Series(label , name = 'label')
df_test = pd.concat([file_path_series ,Label_path_series ] , axis = 1)

# Display the first rows of the dataframe for verification
#print(df_train)

# Folders with Training and Test files
data_dir = 'jaguar_cheetah/train'
test_dir = 'jaguar_cheetah/test'

# Image size 256x256
IMAGE_SIZE = (256,256)

Tain | Test

#print('Training Images:')

# Create the TRAIN dataframe
train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.1,
  subset='training',
  seed=123,
  image_size=IMAGE_SIZE,
  batch_size=32)

#Testing Data
#print('Validation Images:')
validation_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir, 
  validation_split=0.1,
  subset='validation',
  seed=123,
  image_size=IMAGE_SIZE,
  batch_size=32)

print('Testing Images:')
test_ds = tf.keras.utils.image_dataset_from_directory(
  test_dir, 
  seed=123,
  image_size=IMAGE_SIZE,
  batch_size=32)

# Extract labels
train_labels = train_ds.class_names
test_labels = test_ds.class_names
validation_labels = validation_ds.class_names

# Encode labels
# Defining the class labels
class_labels = ['CHEETAH', 'JAGUAR'] 

# Instantiate (encoder) LabelEncoder
label_encoder = LabelEncoder()

# Fit the label encoder on the class labels
label_encoder.fit(class_labels)

# Transform the labels for the training dataset
train_labels_encoded = label_encoder.transform(train_labels)

# Transform the labels for the validation dataset
validation_labels_encoded = label_encoder.transform(validation_labels)

# Transform the labels for the testing dataset
test_labels_encoded = label_encoder.transform(test_labels)

# Normalize the pixel values

# Train files 
train_ds = train_ds.map(lambda x, y: (x / 255.0, y))
# Validate files
validation_ds = validation_ds.map(lambda x, y: (x / 255.0, y))
# Test files
test_ds = test_ds.map(lambda x, y: (x / 255.0, y))

#TRAINING VISUALIZATION
#Count the occurrences of each category in the column
count = df_train['label'].value_counts()

# Create a figure with 2 subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 6), facecolor='white')

# Plot a pie chart on the first subplot
palette = sns.color_palette("viridis")
sns.set_palette(palette)
axs[0].pie(count, labels=count.index, autopct='%1.1f%%', startangle=140)
axs[0].set_title('Distribution of Training Categories')

# Plot a bar chart on the second subplot
sns.barplot(x=count.index, y=count.values, ax=axs[1], palette="viridis")
axs[1].set_title('Count of Training Categories')

# Adjust the layout
plt.tight_layout()

# Visualize
plt.show()

# TEST VISUALIZATION
count = df_test['label'].value_counts()

# Create a figure with 2 subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 6), facec...

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ryan Lagerquist; Ryan Lagerquist (2023). Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic" [Dataset]. http://doi.org/10.5281/zenodo.10081205

Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic"

Explore at:

tarAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10081205

Dataset updated

Nov 8, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Ryan Lagerquist; Ryan Lagerquist

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The tar file contains two directories: data and models. Within "data," there are 4 subdirectories: "training" (the clean training data -- without perturbations), "training_all_perturbed_for_uq" (the lightly perturbed training data), "validation_all_perturbed_for_uq" (the moderately perturbed validation data), and "testing_all_perturbed_for_uq" (the heavily perturbed validation data). The data in these directories are unnormalized. The subdirectories "training" and "training_all_perturbed_for_uq" each contain a normalization file. These normalization files contain parameters used to normalize the data (from physical units to z-scores) for Experiment 1 and Experiment 2, respectively. To do the normalization, you can use the script normalize_examples.py in the code library (ml4rt) with the argument input_normalization_file_name set to one of these two file paths. The other arguments should be as follows:

--uniformize=1

--predictor_norm_type_string="z_score"

--vector_target_norm_type_string=""

--scalar_target_norm_type_string=""

Within the directory "models," there are 6 subdirectories: for the BNN-only models trained with clean and lightly perturbed data, for the CRPS-only models trained with clean and lightly perturbed data, and for the BNN/CRPS models trained with clean and lightly perturbed data. To read the models into Python, you can use the method neural_net.read_model in the ml4rt library.

Clear search

Close search

Google apps

Main menu

Input data and some models (all except multi-model ensembles) for JAMES...

Metabolomics Data Preprocessing PQN PCA

Transformer network trained on simulated X-ray photoelectron spectroscopy...

Dataset Description

Context and methodology

Data details

SESSA Simulation Script

Technical Details

Neural Network Training Script

Technical Details

Assessing the impact of hints in learning formal specification: Research...

Text Script Analytics Code for Automatic Video Generation

Benchmark data set for MSPypeline, a python package for streamlined mass...

Insurance_claims

Load the Dataset File

Brain Tumor CSV

Motivation and Use Cases

Data Structure

CSV File Contents

How to Use This Dataset

Technical Details

Acknowledgments

Koei Tecmo games

Purpose

New in This Version: Enhanced Data and Platform Explosion

METAVERSE GAIT AUTHENTICATION DATASET (MGAD)

Metaverse Gait Authentication Dataset (MGAD)

Images used for training, validation, and testing.

Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic"