11 datasets found
  1. Input data and some models (all except multi-model ensembles) for JAMES...

    • zenodo.org
    tar
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Lagerquist; Ryan Lagerquist (2023). Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic" [Dataset]. http://doi.org/10.5281/zenodo.10081205
    Explore at:
    tarAvailable download formats
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ryan Lagerquist; Ryan Lagerquist
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The tar file contains two directories: data and models. Within "data," there are 4 subdirectories: "training" (the clean training data -- without perturbations), "training_all_perturbed_for_uq" (the lightly perturbed training data), "validation_all_perturbed_for_uq" (the moderately perturbed validation data), and "testing_all_perturbed_for_uq" (the heavily perturbed validation data). The data in these directories are unnormalized. The subdirectories "training" and "training_all_perturbed_for_uq" each contain a normalization file. These normalization files contain parameters used to normalize the data (from physical units to z-scores) for Experiment 1 and Experiment 2, respectively. To do the normalization, you can use the script normalize_examples.py in the code library (ml4rt) with the argument input_normalization_file_name set to one of these two file paths. The other arguments should be as follows:

    --uniformize=1

    --predictor_norm_type_string="z_score"

    --vector_target_norm_type_string=""

    --scalar_target_norm_type_string=""

    Within the directory "models," there are 6 subdirectories: for the BNN-only models trained with clean and lightly perturbed data, for the CRPS-only models trained with clean and lightly perturbed data, and for the BNN/CRPS models trained with clean and lightly perturbed data. To read the models into Python, you can use the method neural_net.read_model in the ml4rt library.

  2. Metabolomics Data Preprocessing PQN PCA

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). Metabolomics Data Preprocessing PQN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/metabolomics-data-preprocessing-pqn-pca
    Explore at:
    zip(22763 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides a step-by-step pipeline for preprocessing metabolomics data.

    The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.

    Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.

    Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.

    Includes data visualization techniques to interpret PCA results effectively.

    Suitable for metabolomics researchers and data scientists working on omics data.

    Enables better reproducibility of preprocessing workflows for metabolomics studies.

    Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.

    Provides a Python-based notebook that is easy to adapt to new datasets.

    Includes example datasets and code snippets for immediate application.

    Helps users understand the impact of normalization on downstream statistical analyses.

    Supports integration with other metabolomics pipelines or machine learning workflows.

  3. t

    Transformer network trained on simulated X-ray photoelectron spectroscopy...

    • researchdata.tuwien.at
    bin, csv, json, zip
    Updated Oct 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl (2025). Transformer network trained on simulated X-ray photoelectron spectroscopy data for organic and inorganic compounds [Dataset]. http://doi.org/10.48436/eybcx-t0a02
    Explore at:
    csv, json, bin, zipAvailable download formats
    Dataset updated
    Oct 17, 2025
    Dataset provided by
    TU Wien
    Authors
    Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    This data repository provides the underlying data and neural network training scripts associated with the manuscript titled "A Transformer Network for High-Throughput Materials Characterization with X-ray Photoelectron Spectroscopy" by Simperl and Werner published in the Journal of Applied Physics (https://doi.org/10.1063/5.0296600) (2025)

    All data files are released under the Creative Commons Attribution 4.0 International (CC-BY) license, while all code files are distributed under the MIT license.

    The repository contains simulated X-ray photoelectron spectroscopy (XPS) spectra stored as hdf5 files in the zipped (h5_files.zip) folder, which was generated using the software developed by the authors. The NIST Standard Reference Database 100 – Simulation of Electron Spectra for Surface Analysis (SESSA) is freely available at https://www.nist.gov/srd/nist-standard-reference-database-100.

    The neural network architecture is implemented using the PyTorch Lightning framework and is fully available within the attached materials as Transformer_SimulatedSpectra.py contained in the python_scripts.zip.

    The trained model and the list of materials for the train, test and validation sets are contained in the models.zip folder.

    The repository contains all the data necessary to replot the figures from the manuscript. These data are available in the form of .csv files or .h5 files for the spectra. In addition, the repository also contains a Python script (Plot_Data_Manuscript.ipynb) which is contained in the python_scripts.zip file.

    Context and methodology

    The dataset and accompanying Python code files included in this repository were used to train a transformer-based neural network capable of directly inferring chemical concentrations from simulated survey X-ray photoelectron spectroscopy (XPS) spectra of bulk compounds.

    The spectral dataset provided here represents the raw output from the SESSA software (version 2.2.2), prior to the normalization procedure described in the associated manuscript. This step of normalisation is of paramount importance for the effective training of the neural network.

    The repository contains the Python scripts utilised to execute the spectral simulations and the neural network training on the Vienna Scientific Cluster (VSC5) which is part of the Austrian Scientific Computing Infrastructure (ASC). In order to obtain guidance on the proper configuration of the Command Line Interface (CLI) tools required for SESSA, users are advised to consult the official SESSA manual, which is available at the following address: https://nvlpubs.nist.gov/nistpubs/NSRDS/NIST.NSRDS.100-2024.pdf.

    To run the neural network training we provided the requirements_nn_training.txt file that contains all the necessary python packages and version numbers. All other python scripts can be run locally with the python libraries listed in requirements_data_analysis.txt.

    Data details

    HDF5 (in zip folder): As described in the manuscript, we simulate X-ray photoelectron spectra for each of the 7,587 inorganic [1] and organic [2] materials in our dataset. To reflect realistic experimental conditions, each simulated spectrum was augmented by systematically varying parameters such as peak width, peak shift, and peak type—all configurable within the SESSA software—as well as by applying statistical Poisson noise to simulate varying signal-to-noise ratios. These modifications account for experimentally observed and material-specific spectral broadening, peak shifts, and detector-induced noise. Each material is represented by an individual HDF5 (.h5) file, named according to its chemical formula and mass density (in g/cm³). For example, the file for SiO2 with a density of 2.196 gcm-3 is named SiO2_2.196.h5. For more complex chemical formulas, such as Co(ClO4)2 with a density of 3.33 gcm-3, the file is named Co_ClO4_2_3.33.h5. Within each HDF5 file, the metadata for each spectrum is stored alongside a fixed energy axis and the corresponding intensity values. The spectral data are organized hierarchically by augmentation parameters in the following directory structure, e.g. for Ac_10.0.h5 we have SNR_0/WIDTH_0.3/SHIFT_-3.0/PEAK_gauss/Ac_10.0/. These files can be easily inspected with H5Web in Visual Studio Code or using h5py in Python or any other h5 interpretable program.

    Session Files: The .ses files are SESSA specific input files that can be directly loaded into SESSA to specify certain input parameters for the initilization (ini), the geometry (geo) and the simulation parameters (sim_para) and are required by the python script Simulation_Script_VSC_json.py to run the simulation on the cluster.

    Json Files: The two json files (MaterialsListVSC_gauss.json, MaterialsListVSC_lorentz.json) are used as the input files to the Python script Simulation_Script_VSC_json.py. These files contain all the material specific information for the SESSA simulation.

    csv files: The csv files are used to generate the plots from the manuscript described in the section "Plotting Scripts".

    npz files: The two .npz files (element_counts.npz, single_elements.npz) are python arrays that are needed by the Transformer_SimulatedSpectra.py script and contain the number of each single element in the dataset and an array of each single element present, respectively.

    SESSA Simulation Script

    There is one python file that sets the communication with SESSA:

    • Simulation_Script_VSC_json.py: This script is the heart of the simulation as it controls the communication through the CLI with SESSA using the specified input paramters in the .json and .ses files together with external functions specified in VSC_function.py

    Technical Details

    Simulation_Script_VSC_json.py: This script uses the functions of the VSC_function.py script (therefore needs to be placed in the same directory as this script) and can be called with the following command:

    python3 Simulation_Script_VSC_json.py MaterialsListVSC_gauss.json 0

    It simulates the spectrum for the material at index 0 in the .json file and with the corresponding parameters specified in the .json file.

    It is important that before running this script the following paths need to be specified:

    • sessa_path: The path to their SESSA installation in sessa_path and the path to their session files in
    • folder_path: The path to their .ses files. In this directory an output folder will be generated where all the output files, including the simulated spectra, are written to.

    To run SESSA on a computing cluster it is important to have a working Xvfb (virtual frame buffer) or a similar tool available to which any graphical output from SESSA can be written to.

    Neural Network Training Script

    Before running the training script it is important to normalize the data such that the squared integral of the spectrum is 1 (as described in the manuscript) and shown in the code: normalize_spectra.py

    For the neural network training we use the Transformer_SimulatedSpectra.py where the external functions used are specified in external_functions.py. This script contains the full description of the neural network architecture, the hyperparameter tuning and the Wandb logging.

    In the models.zip folder the fully trained network final_trained_model.ckpt presented in the manuscript is available as well as the list of training, validation and testing materials (test_materials_list.pt, train_materials_list.pt, val_materials_list.pt) where the corresponding spectra are extracted from the hdf5 files. The file types .ckpt and .pt can be read in by using the pytorch specific load functions in Python, e.g.

    torch.load(train_materials_list)

    Technical Details

    normalize_spectra.py: To run this script properly it is important to set up a python environment with the necessary libraries specified in the requirements_data_analysis.txt file. Then it can be called with

    python3 normalize_spectra.py

    where it is important to specify the path to the .h5 files containing the unnormalized spectra.

    Transformer_SimulatedSpectra.py: To run this script properly on the cluster it is important to set up a python environment with the necessary libraries specified in the requirements_nn_training.txt file. This script also relies on external_functions.py, single_elements.npz and element_counts.npz (that should be placed in the same directory as the python script) file. This is important for creating the datasets for training, validation and testing and ensures that all the single elements appear in the testing set. You can call this script (on the cluster) within a slurm script to start the GPU training.

    python3 Transformer_SimulatedSpectra.py

    It is important that before running this script the following paths need to be specified:

    • data_path: General path where all the data is stored
    • neural_network_data: The location where you keep your normalized hdf5 files
    • wandb_api_key: The api key to use

  4. Z

    Assessing the impact of hints in learning formal specification: Research...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jan 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
    Explore at:
    Dataset updated
    Jan 29, 2024
    Dataset provided by
    Centro de Computação Gráfica
    INESC TEC
    Authors
    Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

    Dataset

    The artifact contains the resources described below.

    Experiment resources

    The resources needed for replicating the experiment, namely in directory experiment:

    alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

    alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

    docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

    api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

    Experiment data

    The task database used in our application of the experiment, namely in directory data/experiment:

    Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

    identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

    Collected data

    Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

    data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

    data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

    participant identification: participant's unique identifier (ID);

    socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

    data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

    participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

    detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

    data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

    participant identification: participant's unique identifier (ID);

    user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

    participants.txt: the list of participant identifiers that have registered for the experiment.

    Analysis scripts

    The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

    analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

    requirements.r: An R script to install the required libraries for the analysis script.

    normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

    normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

    Dockerfile: Docker script to automate the analysis script from the collected data.

    Setup

    To replicate the experiment and the analysis of the results, only Docker is required.

    If you wish to manually replicate the experiment and collect your own data, you'll need to install:

    A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

    If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

    Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

    R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

    Usage

    Experiment replication

    This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

    To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

    cd experimentdocker-compose up

    This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

    In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

    Group N (no hints): http://localhost:3000/0CAN

    Group L (error locations): http://localhost:3000/CA0L

    Group E (counter-example): http://localhost:3000/350E

    Group D (error description): http://localhost:3000/27AD

    In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

    Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

    Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

    After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

    Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

    Analysis of other applications of the experiment

    This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

    The analysis script expects data in 4 CSV files,

  5. m

    Text Script Analytics Code for Automatic Video Generation

    • data.mendeley.com
    Updated Aug 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gaganpreet gagan (2025). Text Script Analytics Code for Automatic Video Generation [Dataset]. http://doi.org/10.17632/kgngzzs5c8.5
    Explore at:
    Dataset updated
    Aug 22, 2025
    Authors
    gaganpreet gagan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Python notebook (research work) provides a comprehensive solution for text analysis and hint extraction that will be useful for making computational scenes using input text .

    It includes a collection of functions that can be used to preprocess textual data, extract information such as characters, relationships, emotions, dates, times, addresses, locations, purposes, and hints from the text.

    Key Features:

    Preprocessing Collected Data: The notebook offers preprocessing capabilities to remove unwanted strings, normalize text data, and prepare it for further analysis. Character Extraction: The notebook includes functions to extract characters from the text, count the number of characters, and determine the number of male and female characters. Relationship Extraction: Functions are provided to calculate possible relationships among characters and extract the relationship names. Dominant Emotion Extraction: The notebook includes a function to extract the dominant emotion from the text. Date and Time Extraction: Functions are available to extract dates and times from the text, including handling phrases like "before," "after," "in the morning," and "in the evening." Address and Location Extraction: The notebook provides functions to extract addresses and locations from the text, including identifying specific places like offices, homes, rooms, or bathrooms. Purpose Extraction: Functions are included to extract the purpose of the text. Hint Collection: The notebook offers the ability to collect hints from the text based on specific keywords or phrases. Sample Implementations: Sample Python code is provided for each function, demonstrating how to use them effectively. This notebook serves as a valuable resource for text analysis tasks, assisting in extracting essential information and hints from textual data. It can be used in various domains such as natural language processing, sentiment analysis, and information retrieval. The code is well-documented and can be easily integrated into existing projects or workflows.

  6. Benchmark data set for MSPypeline, a python package for streamlined mass...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    xml
    Updated Jul 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Held; Ursula Klingmüller (2021). Benchmark data set for MSPypeline, a python package for streamlined mass spectrometry-based proteomics data analysis [Dataset]. https://data-staging.niaid.nih.gov/resources?id=pxd025792
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Jul 22, 2021
    Dataset provided by
    DKFZ Heidelberg
    Division Systems Biology of Signal Transduction, German Cancer Research Center (DKFZ), Heidelberg, 69120, Germany
    Authors
    Alexander Held; Ursula Klingmüller
    Variables measured
    Proteomics
    Description

    Mass spectrometry-based proteomics is increasingly employed in biology and medicine. To generate reliable information from large data sets and ensure comparability of results it is crucial to implement and standardize the quality control of the raw data, the data processing steps and the statistical analyses. The MSPypeline provides a platform for the import of MaxQuant output tables, the generation of quality control reports, the preprocessing of data including normalization and exploratory analyses by statistical inference plots. These standardized steps assess data quality, provide customizable figures and enable the identification of differentially expressed proteins to reach biologically relevant conclusions.

  7. Insurance_claims

    • kaggle.com
    • data.mendeley.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miannotti (2025). Insurance_claims [Dataset]. https://www.kaggle.com/datasets/mian91218/insurance-claims
    Explore at:
    zip(68984 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    Miannotti
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AQQAD, ABDELRAHIM (2023), “insurance_claims ”, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2

    https://data.mendeley.com/datasets/992mh7dk9y/2

    Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2

    Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.

    Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:

    Load the Dataset File

    insurance_df = pd.read_csv('insurance_claims.csv')

    • Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure.

    Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.

    Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.

    Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.

    Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).

    Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.

    Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.

    Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).

    Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.

  8. Brain Tumor CSV

    • kaggle.com
    zip
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akash Nath (2024). Brain Tumor CSV [Dataset]. https://www.kaggle.com/datasets/akashnath29/brain-tumor-csv/code
    Explore at:
    zip(538175483 bytes)Available download formats
    Dataset updated
    Oct 30, 2024
    Authors
    Akash Nath
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.

    Motivation and Use Cases

    Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.

    Data Structure

    This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).

    CSV File Contents

    • Pixel Values: Each row contains the pixel values of a single grayscale image, flattened into a 1-dimensional array. The original image dimensions vary, and rows in the CSV will correspondingly vary in length.
    • Simplified Access: By using a CSV format, this dataset avoids the need for specialized image processing libraries and can be easily loaded into data analysis and machine learning frameworks like Pandas, Scikit-Learn, and TensorFlow.

    How to Use This Dataset

    1. Loading the Data: The CSV can be loaded using standard data analysis libraries, making it compatible with Python, R, and other platforms.
    2. Data Preprocessing: Users may normalize pixel values (e.g., between 0 and 1) for deep learning applications.
    3. Splitting Data: While this dataset does not predefine training and testing splits, users can separate rows into training, validation, and test sets.
    4. Reshaping for Models: If needed, each row can be reshaped to the original dimensions (retrieved from the subfolder structure) to view or process as an image.

    Technical Details

    • Image Format: Grayscale MRI images, with pixel values ranging from 0 to 255.
    • Resolution: Original resolution, no resizing applied.
    • Size: Each row’s length varies according to the original dimensions of each MRI image.
    • Data Type: CSV file with integer pixel values.

    Acknowledgments

    This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.

  9. Koei Tecmo games

    • kaggle.com
    zip
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alonso Villa Rivera (2025). Koei Tecmo games [Dataset]. https://www.kaggle.com/datasets/alonsovillarivera/koei-tecmo-games
    Explore at:
    zip(35342 bytes)Available download formats
    Dataset updated
    Jun 20, 2025
    Authors
    Alonso Villa Rivera
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains a curated list of video games developed or published by Tecmo Koei, compiled via web scraping from Wikipedia. Tecmo and Koei, two prominent Japanese video game companies, merged in 2009, creating a legacy of titles across a wide range of platforms and genres.

    Purpose

    The main purpose of this dataset is to serve as a learning tool for practicing data cleaning, standardization, and exploratory analysis. Real-world data, even when sourced from structured platforms like Wikipedia, often comes with inconsistencies, missing values, and formatting issues. This dataset offers a realistic example of how to:

    • Clean textual data (e.g., standardizing genres and platforms).
    • Handle missing or inconsistent entries.
    • Normalize categorical values.
    • Prepare scraped data for machine learning or visualization tasks.
    • It's especially useful for students, junior data analysts, or anyone learning to work with messy data in Python using tools - like pandas, numpy, and regex.

    New in This Version: Enhanced Data and Platform Explosion

    This updated version of the dataset includes expanded coverage and a key pre-processed transformation to enhance its utility for analysis.

    Previously, the dataset provided core information about game titles, platforms, release dates, genres, developers, publishers, and descriptions. In this release, we've focused on two significant improvements:

    1. Expanded Game Coverage: We've broadened the scope of the original dataset to include a more comprehensive list of Tecmo Koei titles, ensuring a richer and more complete view of their extensive game catalog. This means you'll find even more games to analyze, providing a deeper understanding of their history and output.

    2. Pre-processed Platform Data (Exploded View): To facilitate more granular analysis, particularly for games released on multiple platforms, we've included a transformed version of the dataset where the Platforms column has been "exploded."

    Originally, the Platforms column might contain comma-separated values (e.g., "PC, PlayStation, Nintendo Switch"). This structure can be challenging for direct analysis when you want to count games per individual platform. The "exploded" dataset now presents each unique platform for a game as a separate row, duplicating the other game details. This means if "Ninja Gaiden" was released on "Xbox, PlayStation 3," it will appear as two separate rows in the exploded dataset—one for "Xbox" and one for "PlayStation 3."

    This transformation significantly simplifies tasks like:

    Counting games released on specific platforms. Analyzing platform-specific trends in genres or release dates. Creating more accurate visualizations of platform distribution. The original, untransformed data is also included, allowing users to practice the exploding technique themselves if desired.

  10. METAVERSE GAIT AUTHENTICATION DATASET (MGAD)

    • kaggle.com
    zip
    Updated Feb 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bits rmit (2025). METAVERSE GAIT AUTHENTICATION DATASET (MGAD) [Dataset]. https://www.kaggle.com/bitsrmit/metaverse-gait-authentication-dataset-mgad
    Explore at:
    zip(380503 bytes)Available download formats
    Dataset updated
    Feb 11, 2025
    Authors
    bits rmit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metaverse Gait Authentication Dataset (MGAD)

    The Metaverse Gait Authentication Dataset (MGAD) is a large-scale dataset designed for biometric authentication using gait patterns in virtual environments. It contains 5,000 simulated user records, generated using Unity 3D and processed with OpenPose & MediaPipe to extract 16 key gait-based features.

    This dataset is ideal for biometric security, AI-driven authentication, and gait analysis researchers.

    Key Features: ✔ 5,000 Users – Simulated gait data from a diverse range of individuals. ✔ 16 Gait Features – Includes stride length, step frequency, joint angles, and ground reaction forces. ✔ CSV Format – Easy to integrate into AI/ML models. ✔ Preprocessed & Cleaned – Ready for machine learning applications.

    Potential Use Cases: 🔹 Gait-based authentication for Metaverse security. 🔹 Human motion analysis in healthcare & sports. 🔹 AI-driven identity verification research. 🔹 Feature engineering & model training for biometric systems.

    How to Use: Load in Python: import pandas as pd
    data = pd.read_csv('MGAD.csv')
    print(data.head()) Preprocess & Normalize Features before training AI models. Train ML models (e.g., Random Forest, Autoencoders) for authentication. Citation: If you use MGAD in your research, please cite: Sandeep Ravikanti, "Metaverse Gait Authentication Dataset (MGAD)," 2025. DOI: https://dx.doi.org/10.21227/rvh5-8842

  11. Images used for training, validation, and testing.

    • kaggle.com
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chrysthian Chrisley (2024). Images used for training, validation, and testing. [Dataset]. https://www.kaggle.com/datasets/chrysthian/images-used-for-training-validation-and-testing
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2024
    Dataset provided by
    Kaggle
    Authors
    Chrysthian Chrisley
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Imports:

    # All Imports
    import os
    from matplotlib import pyplot as plt
    import pandas as pd
    from sklearn.calibration import LabelEncoder
    import seaborn as sns
    import matplotlib.image as mpimg
    import cv2
    import numpy as np
    import pickle
    
    # Tensflor and Keras Layer and Model and Optimize and Loss
    import tensorflow as tf
    from tensorflow import keras
    from keras import Sequential
    from keras.layers import *
    
    #Kernel Intilizer 
    from keras.optimizers import Adamax
    
    # PreTrained Model
    from keras.applications import *
    
    #Early Stopping
    from keras.callbacks import EarlyStopping
    import warnings 
    

    Warnings Suppression | Configuration

    # Warnings Remove 
    warnings.filterwarnings("ignore")
    
    # Define the base path for the training folder
    base_path = 'jaguar_cheetah/train'
    
    # Weights file
    weights_file = 'Model_train_weights.weights.h5'
    
    # Path to the saved or to save the model:
    model_file = 'Model-cheetah_jaguar_Treined.keras'
    
    # Model history
    history_path = 'training_history_cheetah_jaguar.pkl'
    
    # Initialize lists to store file paths and labels
    filepaths = []
    labels = []
    
    # Iterate over folders and files within the training directory
    for folder in ['Cheetah', 'Jaguar']:
      folder_path = os.path.join(base_path, folder)
      for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        filepaths.append(file_path)
        labels.append(folder)
    
    # Create the TRAINING dataframe
    file_path_series = pd.Series(filepaths , name= 'filepath')
    Label_path_series = pd.Series(labels , name = 'label')
    df_train = pd.concat([file_path_series ,Label_path_series ] , axis = 1)
    
    
    # Define the base path for the test folder
    directory = "jaguar_cheetah/test"
    
    filepath =[]
    label = []
    
    folds = os.listdir(directory)
    
    for fold in folds:
      f_path = os.path.join(directory , fold)
      
      imgs = os.listdir(f_path)
      
      for img in imgs:
        
        img_path = os.path.join(f_path , img)
        filepath.append(img_path)
        label.append(fold)
        
    # Create the TEST dataframe
    file_path_series = pd.Series(filepath , name= 'filepath')
    Label_path_series = pd.Series(label , name = 'label')
    df_test = pd.concat([file_path_series ,Label_path_series ] , axis = 1)
    
    # Display the first rows of the dataframe for verification
    #print(df_train)
    
    # Folders with Training and Test files
    data_dir = 'jaguar_cheetah/train'
    test_dir = 'jaguar_cheetah/test'
    
    # Image size 256x256
    IMAGE_SIZE = (256,256) 
    

    Tain | Test

    #print('Training Images:')
    
    # Create the TRAIN dataframe
    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.1,
      subset='training',
      seed=123,
      image_size=IMAGE_SIZE,
      batch_size=32)
    
    #Testing Data
    #print('Validation Images:')
    validation_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir, 
      validation_split=0.1,
      subset='validation',
      seed=123,
      image_size=IMAGE_SIZE,
      batch_size=32)
    
    print('Testing Images:')
    test_ds = tf.keras.utils.image_dataset_from_directory(
      test_dir, 
      seed=123,
      image_size=IMAGE_SIZE,
      batch_size=32)
    
    # Extract labels
    train_labels = train_ds.class_names
    test_labels = test_ds.class_names
    validation_labels = validation_ds.class_names
    
    # Encode labels
    # Defining the class labels
    class_labels = ['CHEETAH', 'JAGUAR'] 
    
    # Instantiate (encoder) LabelEncoder
    label_encoder = LabelEncoder()
    
    # Fit the label encoder on the class labels
    label_encoder.fit(class_labels)
    
    # Transform the labels for the training dataset
    train_labels_encoded = label_encoder.transform(train_labels)
    
    # Transform the labels for the validation dataset
    validation_labels_encoded = label_encoder.transform(validation_labels)
    
    # Transform the labels for the testing dataset
    test_labels_encoded = label_encoder.transform(test_labels)
    
    # Normalize the pixel values
    
    # Train files 
    train_ds = train_ds.map(lambda x, y: (x / 255.0, y))
    # Validate files
    validation_ds = validation_ds.map(lambda x, y: (x / 255.0, y))
    # Test files
    test_ds = test_ds.map(lambda x, y: (x / 255.0, y))
    
    #TRAINING VISUALIZATION
    #Count the occurrences of each category in the column
    count = df_train['label'].value_counts()
    
    # Create a figure with 2 subplots
    fig, axs = plt.subplots(1, 2, figsize=(12, 6), facecolor='white')
    
    # Plot a pie chart on the first subplot
    palette = sns.color_palette("viridis")
    sns.set_palette(palette)
    axs[0].pie(count, labels=count.index, autopct='%1.1f%%', startangle=140)
    axs[0].set_title('Distribution of Training Categories')
    
    # Plot a bar chart on the second subplot
    sns.barplot(x=count.index, y=count.values, ax=axs[1], palette="viridis")
    axs[1].set_title('Count of Training Categories')
    
    # Adjust the layout
    plt.tight_layout()
    
    # Visualize
    plt.show()
    
    # TEST VISUALIZATION
    count = df_test['label'].value_counts()
    
    # Create a figure with 2 subplots
    fig, axs = plt.subplots(1, 2, figsize=(12, 6), facec...
    
  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ryan Lagerquist; Ryan Lagerquist (2023). Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic" [Dataset]. http://doi.org/10.5281/zenodo.10081205
Organization logo

Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic"

Explore at:
tarAvailable download formats
Dataset updated
Nov 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ryan Lagerquist; Ryan Lagerquist
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The tar file contains two directories: data and models. Within "data," there are 4 subdirectories: "training" (the clean training data -- without perturbations), "training_all_perturbed_for_uq" (the lightly perturbed training data), "validation_all_perturbed_for_uq" (the moderately perturbed validation data), and "testing_all_perturbed_for_uq" (the heavily perturbed validation data). The data in these directories are unnormalized. The subdirectories "training" and "training_all_perturbed_for_uq" each contain a normalization file. These normalization files contain parameters used to normalize the data (from physical units to z-scores) for Experiment 1 and Experiment 2, respectively. To do the normalization, you can use the script normalize_examples.py in the code library (ml4rt) with the argument input_normalization_file_name set to one of these two file paths. The other arguments should be as follows:

--uniformize=1

--predictor_norm_type_string="z_score"

--vector_target_norm_type_string=""

--scalar_target_norm_type_string=""

Within the directory "models," there are 6 subdirectories: for the BNN-only models trained with clean and lightly perturbed data, for the CRPS-only models trained with clean and lightly perturbed data, and for the BNN/CRPS models trained with clean and lightly perturbed data. To read the models into Python, you can use the method neural_net.read_model in the ml4rt library.

Search
Clear search
Close search
Google apps
Main menu