12 datasets found
  1. Input data and some models (all except multi-model ensembles) for JAMES...

    • zenodo.org
    tar
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Lagerquist; Ryan Lagerquist (2023). Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic" [Dataset]. http://doi.org/10.5281/zenodo.10081205
    Explore at:
    tarAvailable download formats
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ryan Lagerquist; Ryan Lagerquist
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The tar file contains two directories: data and models. Within "data," there are 4 subdirectories: "training" (the clean training data -- without perturbations), "training_all_perturbed_for_uq" (the lightly perturbed training data), "validation_all_perturbed_for_uq" (the moderately perturbed validation data), and "testing_all_perturbed_for_uq" (the heavily perturbed validation data). The data in these directories are unnormalized. The subdirectories "training" and "training_all_perturbed_for_uq" each contain a normalization file. These normalization files contain parameters used to normalize the data (from physical units to z-scores) for Experiment 1 and Experiment 2, respectively. To do the normalization, you can use the script normalize_examples.py in the code library (ml4rt) with the argument input_normalization_file_name set to one of these two file paths. The other arguments should be as follows:

    --uniformize=1

    --predictor_norm_type_string="z_score"

    --vector_target_norm_type_string=""

    --scalar_target_norm_type_string=""

    Within the directory "models," there are 6 subdirectories: for the BNN-only models trained with clean and lightly perturbed data, for the CRPS-only models trained with clean and lightly perturbed data, and for the BNN/CRPS models trained with clean and lightly perturbed data. To read the models into Python, you can use the method neural_net.read_model in the ml4rt library.

  2. t

    Transformer network trained on simulated X-ray photoelectron spectroscopy...

    • researchdata.tuwien.at
    bin, csv, json, zip
    Updated Oct 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl (2025). Transformer network trained on simulated X-ray photoelectron spectroscopy data for organic and inorganic compounds [Dataset]. http://doi.org/10.48436/eybcx-t0a02
    Explore at:
    csv, json, bin, zipAvailable download formats
    Dataset updated
    Oct 17, 2025
    Dataset provided by
    TU Wien
    Authors
    Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    This data repository provides the underlying data and neural network training scripts associated with the manuscript titled "A Transformer Network for High-Throughput Materials Characterization with X-ray Photoelectron Spectroscopy" by Simperl and Werner published in the Journal of Applied Physics (https://doi.org/10.1063/5.0296600) (2025)

    All data files are released under the Creative Commons Attribution 4.0 International (CC-BY) license, while all code files are distributed under the MIT license.

    The repository contains simulated X-ray photoelectron spectroscopy (XPS) spectra stored as hdf5 files in the zipped (h5_files.zip) folder, which was generated using the software developed by the authors. The NIST Standard Reference Database 100 – Simulation of Electron Spectra for Surface Analysis (SESSA) is freely available at https://www.nist.gov/srd/nist-standard-reference-database-100.

    The neural network architecture is implemented using the PyTorch Lightning framework and is fully available within the attached materials as Transformer_SimulatedSpectra.py contained in the python_scripts.zip.

    The trained model and the list of materials for the train, test and validation sets are contained in the models.zip folder.

    The repository contains all the data necessary to replot the figures from the manuscript. These data are available in the form of .csv files or .h5 files for the spectra. In addition, the repository also contains a Python script (Plot_Data_Manuscript.ipynb) which is contained in the python_scripts.zip file.

    Context and methodology

    The dataset and accompanying Python code files included in this repository were used to train a transformer-based neural network capable of directly inferring chemical concentrations from simulated survey X-ray photoelectron spectroscopy (XPS) spectra of bulk compounds.

    The spectral dataset provided here represents the raw output from the SESSA software (version 2.2.2), prior to the normalization procedure described in the associated manuscript. This step of normalisation is of paramount importance for the effective training of the neural network.

    The repository contains the Python scripts utilised to execute the spectral simulations and the neural network training on the Vienna Scientific Cluster (VSC5) which is part of the Austrian Scientific Computing Infrastructure (ASC). In order to obtain guidance on the proper configuration of the Command Line Interface (CLI) tools required for SESSA, users are advised to consult the official SESSA manual, which is available at the following address: https://nvlpubs.nist.gov/nistpubs/NSRDS/NIST.NSRDS.100-2024.pdf.

    To run the neural network training we provided the requirements_nn_training.txt file that contains all the necessary python packages and version numbers. All other python scripts can be run locally with the python libraries listed in requirements_data_analysis.txt.

    Data details

    HDF5 (in zip folder): As described in the manuscript, we simulate X-ray photoelectron spectra for each of the 7,587 inorganic [1] and organic [2] materials in our dataset. To reflect realistic experimental conditions, each simulated spectrum was augmented by systematically varying parameters such as peak width, peak shift, and peak type—all configurable within the SESSA software—as well as by applying statistical Poisson noise to simulate varying signal-to-noise ratios. These modifications account for experimentally observed and material-specific spectral broadening, peak shifts, and detector-induced noise. Each material is represented by an individual HDF5 (.h5) file, named according to its chemical formula and mass density (in g/cm³). For example, the file for SiO2 with a density of 2.196 gcm-3 is named SiO2_2.196.h5. For more complex chemical formulas, such as Co(ClO4)2 with a density of 3.33 gcm-3, the file is named Co_ClO4_2_3.33.h5. Within each HDF5 file, the metadata for each spectrum is stored alongside a fixed energy axis and the corresponding intensity values. The spectral data are organized hierarchically by augmentation parameters in the following directory structure, e.g. for Ac_10.0.h5 we have SNR_0/WIDTH_0.3/SHIFT_-3.0/PEAK_gauss/Ac_10.0/. These files can be easily inspected with H5Web in Visual Studio Code or using h5py in Python or any other h5 interpretable program.

    Session Files: The .ses files are SESSA specific input files that can be directly loaded into SESSA to specify certain input parameters for the initilization (ini), the geometry (geo) and the simulation parameters (sim_para) and are required by the python script Simulation_Script_VSC_json.py to run the simulation on the cluster.

    Json Files: The two json files (MaterialsListVSC_gauss.json, MaterialsListVSC_lorentz.json) are used as the input files to the Python script Simulation_Script_VSC_json.py. These files contain all the material specific information for the SESSA simulation.

    csv files: The csv files are used to generate the plots from the manuscript described in the section "Plotting Scripts".

    npz files: The two .npz files (element_counts.npz, single_elements.npz) are python arrays that are needed by the Transformer_SimulatedSpectra.py script and contain the number of each single element in the dataset and an array of each single element present, respectively.

    SESSA Simulation Script

    There is one python file that sets the communication with SESSA:

    • Simulation_Script_VSC_json.py: This script is the heart of the simulation as it controls the communication through the CLI with SESSA using the specified input paramters in the .json and .ses files together with external functions specified in VSC_function.py

    Technical Details

    Simulation_Script_VSC_json.py: This script uses the functions of the VSC_function.py script (therefore needs to be placed in the same directory as this script) and can be called with the following command:

    python3 Simulation_Script_VSC_json.py MaterialsListVSC_gauss.json 0

    It simulates the spectrum for the material at index 0 in the .json file and with the corresponding parameters specified in the .json file.

    It is important that before running this script the following paths need to be specified:

    • sessa_path: The path to their SESSA installation in sessa_path and the path to their session files in
    • folder_path: The path to their .ses files. In this directory an output folder will be generated where all the output files, including the simulated spectra, are written to.

    To run SESSA on a computing cluster it is important to have a working Xvfb (virtual frame buffer) or a similar tool available to which any graphical output from SESSA can be written to.

    Neural Network Training Script

    Before running the training script it is important to normalize the data such that the squared integral of the spectrum is 1 (as described in the manuscript) and shown in the code: normalize_spectra.py

    For the neural network training we use the Transformer_SimulatedSpectra.py where the external functions used are specified in external_functions.py. This script contains the full description of the neural network architecture, the hyperparameter tuning and the Wandb logging.

    In the models.zip folder the fully trained network final_trained_model.ckpt presented in the manuscript is available as well as the list of training, validation and testing materials (test_materials_list.pt, train_materials_list.pt, val_materials_list.pt) where the corresponding spectra are extracted from the hdf5 files. The file types .ckpt and .pt can be read in by using the pytorch specific load functions in Python, e.g.

    torch.load(train_materials_list)

    Technical Details

    normalize_spectra.py: To run this script properly it is important to set up a python environment with the necessary libraries specified in the requirements_data_analysis.txt file. Then it can be called with

    python3 normalize_spectra.py

    where it is important to specify the path to the .h5 files containing the unnormalized spectra.

    Transformer_SimulatedSpectra.py: To run this script properly on the cluster it is important to set up a python environment with the necessary libraries specified in the requirements_nn_training.txt file. This script also relies on external_functions.py, single_elements.npz and element_counts.npz (that should be placed in the same directory as the python script) file. This is important for creating the datasets for training, validation and testing and ensures that all the single elements appear in the testing set. You can call this script (on the cluster) within a slurm script to start the GPU training.

    python3 Transformer_SimulatedSpectra.py

    It is important that before running this script the following paths need to be specified:

    • data_path: General path where all the data is stored
    • neural_network_data: The location where you keep your normalized hdf5 files
    • wandb_api_key: The api key to use

  3. Z

    Assessing the impact of hints in learning formal specification: Research...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jan 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
    Explore at:
    Dataset updated
    Jan 29, 2024
    Dataset provided by
    INESC TEC
    Centro de Computação Gráfica
    Authors
    Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

    Dataset

    The artifact contains the resources described below.

    Experiment resources

    The resources needed for replicating the experiment, namely in directory experiment:

    alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

    alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

    docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

    api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

    Experiment data

    The task database used in our application of the experiment, namely in directory data/experiment:

    Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

    identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

    Collected data

    Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

    data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

    data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

    participant identification: participant's unique identifier (ID);

    socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

    data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

    participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

    detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

    data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

    participant identification: participant's unique identifier (ID);

    user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

    participants.txt: the list of participant identifiers that have registered for the experiment.

    Analysis scripts

    The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

    analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

    requirements.r: An R script to install the required libraries for the analysis script.

    normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

    normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

    Dockerfile: Docker script to automate the analysis script from the collected data.

    Setup

    To replicate the experiment and the analysis of the results, only Docker is required.

    If you wish to manually replicate the experiment and collect your own data, you'll need to install:

    A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

    If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

    Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

    R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

    Usage

    Experiment replication

    This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

    To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

    cd experimentdocker-compose up

    This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

    In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

    Group N (no hints): http://localhost:3000/0CAN

    Group L (error locations): http://localhost:3000/CA0L

    Group E (counter-example): http://localhost:3000/350E

    Group D (error description): http://localhost:3000/27AD

    In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

    Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

    Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

    After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

    Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

    Analysis of other applications of the experiment

    This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

    The analysis script expects data in 4 CSV files,

  4. Metabolomics Data Preprocessing PQN PCA

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). Metabolomics Data Preprocessing PQN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/metabolomics-data-preprocessing-pqn-pca
    Explore at:
    zip(22763 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides a step-by-step pipeline for preprocessing metabolomics data.

    The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.

    Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.

    Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.

    Includes data visualization techniques to interpret PCA results effectively.

    Suitable for metabolomics researchers and data scientists working on omics data.

    Enables better reproducibility of preprocessing workflows for metabolomics studies.

    Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.

    Provides a Python-based notebook that is easy to adapt to new datasets.

    Includes example datasets and code snippets for immediate application.

    Helps users understand the impact of normalization on downstream statistical analyses.

    Supports integration with other metabolomics pipelines or machine learning workflows.

  5. n

    Research data underpinning "Investigating Reinforcement Learning Approaches...

    • data.ncl.ac.uk
    application/csv
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zheng Luo (2024). Research data underpinning "Investigating Reinforcement Learning Approaches In Stock Market Trading" [Dataset]. http://doi.org/10.25405/data.ncl.26539735.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Aug 13, 2024
    Dataset provided by
    Newcastle University
    Authors
    Zheng Luo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The final dataset utilised for the publication "Investigating Reinforcement Learning Approaches In Stock Market Trading" was processed by downloading and combining data from multiple reputable sources to suit the specific needs of this project. Raw data were retrieved by downloading them using a Python finance API. Afterwards, Python and NumPy were used to combine and normalise the data to create the final dataset.The raw data was sourced as follows:Stock Prices of NVIDIA & AMD, Financial Indexes, and Commodity Prices: Retrieved from Yahoo Finance.Economic Indicators: Collected from the US Federal Reserve.The dataset was normalised to minute intervals, and the stock prices were adjusted to account for stock splits.This dataset was used for exploring the application of reinforcement learning in stock market trading. After creating the dataset, it was used in s reinforcement learning environment to train several reinforcement learning algorithms, including deep Q-learning, policy networks, policy networks with baselines, actor-critic methods, and time series incorporation. The performance of these algorithms was then compared based on profit made and other financial evaluation metrics, to investigate the application of reinforcement learning algorithms in stock market trading.The attached 'README.txt' contains methodological information and a glossary of all the variables in the .csv file.

  6. m

    Text Script Analytics Code for Automatic Video Generation

    • data.mendeley.com
    Updated Aug 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gaganpreet gagan (2025). Text Script Analytics Code for Automatic Video Generation [Dataset]. http://doi.org/10.17632/kgngzzs5c8.5
    Explore at:
    Dataset updated
    Aug 22, 2025
    Authors
    gaganpreet gagan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Python notebook (research work) provides a comprehensive solution for text analysis and hint extraction that will be useful for making computational scenes using input text .

    It includes a collection of functions that can be used to preprocess textual data, extract information such as characters, relationships, emotions, dates, times, addresses, locations, purposes, and hints from the text.

    Key Features:

    Preprocessing Collected Data: The notebook offers preprocessing capabilities to remove unwanted strings, normalize text data, and prepare it for further analysis. Character Extraction: The notebook includes functions to extract characters from the text, count the number of characters, and determine the number of male and female characters. Relationship Extraction: Functions are provided to calculate possible relationships among characters and extract the relationship names. Dominant Emotion Extraction: The notebook includes a function to extract the dominant emotion from the text. Date and Time Extraction: Functions are available to extract dates and times from the text, including handling phrases like "before," "after," "in the morning," and "in the evening." Address and Location Extraction: The notebook provides functions to extract addresses and locations from the text, including identifying specific places like offices, homes, rooms, or bathrooms. Purpose Extraction: Functions are included to extract the purpose of the text. Hint Collection: The notebook offers the ability to collect hints from the text based on specific keywords or phrases. Sample Implementations: Sample Python code is provided for each function, demonstrating how to use them effectively. This notebook serves as a valuable resource for text analysis tasks, assisting in extracting essential information and hints from textual data. It can be used in various domains such as natural language processing, sentiment analysis, and information retrieval. The code is well-documented and can be easily integrated into existing projects or workflows.

  7. EdX, Coursera, and Udemy Course Data

    • kaggle.com
    zip
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karar Haitham (2025). EdX, Coursera, and Udemy Course Data [Dataset]. https://www.kaggle.com/datasets/kararhaitham/courses
    Explore at:
    zip(49594908 bytes)Available download formats
    Dataset updated
    Apr 11, 2025
    Authors
    Karar Haitham
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains course information and metadata scraped from two popular MOOC platforms: EdX and Coursera(Udemy will be added soon, although the script to scrape it available on my git https://github.com/karar-git/Quamus). It includes various course attributes such as descriptions, program details, and other relevant data, making it useful for building recommendation systems, educational tools, or performing analysis on online learning content.

    The dataset includes the following files:

    combine_preprocessing.py: A Python script that preprocesses and combines the raw data into a unified format. You must run this script to generate the processed dataset.
    
    combined_dataset.json: The final, preprocessed, and combined dataset of course metadata from both EdX and Coursera.
    
    edx_courses.json: Raw course data scraped from the EdX platform.
    
    edx_degree_programs.json: Data on degree programs available on EdX.
    
    edx_executive_education_paidstuff.json: Paid course and executive education data from EdX.
    
    edx_programs.json: Data on various programs available on EdX.
    
    processed_coursera_data.json: Processed course data scraped from Coursera.
    

    How to Use:

    To generate the final combined_dataset.json, you need to run the combine_preprocessing.py script. This script will process the raw data files, clean, normalize, and combine them into one unified dataset. Disclaimer:

    Data Source: The data in this dataset was scraped from publicly available information on EdX and Coursera. The scraping was done solely for educational and research purposes. The scraping process adheres to the terms of use of the respective platforms.
    
    Usage: This dataset is intended for non-commercial use only. Please use responsibly and adhere to the terms and conditions of the platforms from which the data was collected.
    
    No Warranty: This data is provided "as-is" without any warranty. Users are responsible for ensuring that their use of the data complies with the relevant platform policies.
    

    Models:

    In addition to the data, there are machine learning models related to this dataset available on GitHub. These models can help with content-based course recommendations and are built using the data you will find here. Specifically, the models include:

    A cosine similarity-based model for course recommendations.
    
    A two-tower model for personalized recommendations, trained using pseudo-labels.
    
    A transformer-based course predictor (work in progress) designed to suggest the next course based on a user's learning progression.
    

    Note:

    The dataset currently contains data only from EdX and Coursera. The script to scrape Udemy data can be found in the related GitHub repository.
    

    You will need to access the GitHub repository to view and experiment with the models and the Udemy scraping script. The models require the data files from this dataset to work properly.

    By uploading this dataset to Kaggle, you can explore these educational resources and leverage them for building custom educational tools or analyzing online course trends.

  8. Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals...

    • plos.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    François Serra; Davide Baù; Mike Goodstadt; David Castillo; Guillaume J. Filion; Marc A. Marti-Renom (2023). Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals structural features of the fly chromatin colors [Dataset]. http://doi.org/10.1371/journal.pcbi.1005665
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    François Serra; Davide Baù; Mike Goodstadt; David Castillo; Guillaume J. Filion; Marc A. Marti-Renom
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The sequence of a genome is insufficient to understand all genomic processes carried out in the cell nucleus. To achieve this, the knowledge of its three-dimensional architecture is necessary. Advances in genomic technologies and the development of new analytical methods, such as Chromosome Conformation Capture (3C) and its derivatives, provide unprecedented insights in the spatial organization of genomes. Here we present TADbit, a computational framework to analyze and model the chromatin fiber in three dimensions. Our package takes as input the sequencing reads of 3C-based experiments and performs the following main tasks: (i) pre-process the reads, (ii) map the reads to a reference genome, (iii) filter and normalize the interaction data, (iv) analyze the resulting interaction matrices, (v) build 3D models of selected genomic domains, and (vi) analyze the resulting models to characterize their structural properties. To illustrate the use of TADbit, we automatically modeled 50 genomic domains from the fly genome revealing differential structural features of the previously defined chromatin colors, establishing a link between the conformation of the genome and the local chromatin composition. TADbit provides three-dimensional models built from 3C-based experiments, which are ready for visualization and for characterizing their relation to gene expression and epigenetic states. TADbit is an open-source Python library available for download from https://github.com/3DGenomes/tadbit.

  9. Koei Tecmo games

    • kaggle.com
    zip
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alonso Villa Rivera (2025). Koei Tecmo games [Dataset]. https://www.kaggle.com/datasets/alonsovillarivera/koei-tecmo-games
    Explore at:
    zip(35342 bytes)Available download formats
    Dataset updated
    Jun 20, 2025
    Authors
    Alonso Villa Rivera
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains a curated list of video games developed or published by Tecmo Koei, compiled via web scraping from Wikipedia. Tecmo and Koei, two prominent Japanese video game companies, merged in 2009, creating a legacy of titles across a wide range of platforms and genres.

    Purpose

    The main purpose of this dataset is to serve as a learning tool for practicing data cleaning, standardization, and exploratory analysis. Real-world data, even when sourced from structured platforms like Wikipedia, often comes with inconsistencies, missing values, and formatting issues. This dataset offers a realistic example of how to:

    • Clean textual data (e.g., standardizing genres and platforms).
    • Handle missing or inconsistent entries.
    • Normalize categorical values.
    • Prepare scraped data for machine learning or visualization tasks.
    • It's especially useful for students, junior data analysts, or anyone learning to work with messy data in Python using tools - like pandas, numpy, and regex.

    New in This Version: Enhanced Data and Platform Explosion

    This updated version of the dataset includes expanded coverage and a key pre-processed transformation to enhance its utility for analysis.

    Previously, the dataset provided core information about game titles, platforms, release dates, genres, developers, publishers, and descriptions. In this release, we've focused on two significant improvements:

    1. Expanded Game Coverage: We've broadened the scope of the original dataset to include a more comprehensive list of Tecmo Koei titles, ensuring a richer and more complete view of their extensive game catalog. This means you'll find even more games to analyze, providing a deeper understanding of their history and output.

    2. Pre-processed Platform Data (Exploded View): To facilitate more granular analysis, particularly for games released on multiple platforms, we've included a transformed version of the dataset where the Platforms column has been "exploded."

    Originally, the Platforms column might contain comma-separated values (e.g., "PC, PlayStation, Nintendo Switch"). This structure can be challenging for direct analysis when you want to count games per individual platform. The "exploded" dataset now presents each unique platform for a game as a separate row, duplicating the other game details. This means if "Ninja Gaiden" was released on "Xbox, PlayStation 3," it will appear as two separate rows in the exploded dataset—one for "Xbox" and one for "PlayStation 3."

    This transformation significantly simplifies tasks like:

    Counting games released on specific platforms. Analyzing platform-specific trends in genres or release dates. Creating more accurate visualizations of platform distribution. The original, untransformed data is also included, allowing users to practice the exploding technique themselves if desired.

  10. Insurance_claims

    • kaggle.com
    • data.mendeley.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miannotti (2025). Insurance_claims [Dataset]. https://www.kaggle.com/datasets/mian91218/insurance-claims
    Explore at:
    zip(68984 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    Miannotti
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AQQAD, ABDELRAHIM (2023), “insurance_claims ”, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2

    https://data.mendeley.com/datasets/992mh7dk9y/2

    Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2

    Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.

    Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:

    Load the Dataset File

    insurance_df = pd.read_csv('insurance_claims.csv')

    • Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure.

    Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.

    Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.

    Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.

    Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).

    Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.

    Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.

    Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).

    Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.

  11. Brain Tumor CSV

    • kaggle.com
    zip
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akash Nath (2024). Brain Tumor CSV [Dataset]. https://www.kaggle.com/datasets/akashnath29/brain-tumor-csv/code
    Explore at:
    zip(538175483 bytes)Available download formats
    Dataset updated
    Oct 30, 2024
    Authors
    Akash Nath
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.

    Motivation and Use Cases

    Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.

    Data Structure

    This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).

    CSV File Contents

    • Pixel Values: Each row contains the pixel values of a single grayscale image, flattened into a 1-dimensional array. The original image dimensions vary, and rows in the CSV will correspondingly vary in length.
    • Simplified Access: By using a CSV format, this dataset avoids the need for specialized image processing libraries and can be easily loaded into data analysis and machine learning frameworks like Pandas, Scikit-Learn, and TensorFlow.

    How to Use This Dataset

    1. Loading the Data: The CSV can be loaded using standard data analysis libraries, making it compatible with Python, R, and other platforms.
    2. Data Preprocessing: Users may normalize pixel values (e.g., between 0 and 1) for deep learning applications.
    3. Splitting Data: While this dataset does not predefine training and testing splits, users can separate rows into training, validation, and test sets.
    4. Reshaping for Models: If needed, each row can be reshaped to the original dimensions (retrieved from the subfolder structure) to view or process as an image.

    Technical Details

    • Image Format: Grayscale MRI images, with pixel values ranging from 0 to 255.
    • Resolution: Original resolution, no resizing applied.
    • Size: Each row’s length varies according to the original dimensions of each MRI image.
    • Data Type: CSV file with integer pixel values.

    Acknowledgments

    This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.

  12. Images used for training, validation, and testing.

    • kaggle.com
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chrysthian Chrisley (2024). Images used for training, validation, and testing. [Dataset]. https://www.kaggle.com/datasets/chrysthian/images-used-for-training-validation-and-testing
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2024
    Dataset provided by
    Kaggle
    Authors
    Chrysthian Chrisley
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Imports:

    # All Imports
    import os
    from matplotlib import pyplot as plt
    import pandas as pd
    from sklearn.calibration import LabelEncoder
    import seaborn as sns
    import matplotlib.image as mpimg
    import cv2
    import numpy as np
    import pickle
    
    # Tensflor and Keras Layer and Model and Optimize and Loss
    import tensorflow as tf
    from tensorflow import keras
    from keras import Sequential
    from keras.layers import *
    
    #Kernel Intilizer 
    from keras.optimizers import Adamax
    
    # PreTrained Model
    from keras.applications import *
    
    #Early Stopping
    from keras.callbacks import EarlyStopping
    import warnings 
    

    Warnings Suppression | Configuration

    # Warnings Remove 
    warnings.filterwarnings("ignore")
    
    # Define the base path for the training folder
    base_path = 'jaguar_cheetah/train'
    
    # Weights file
    weights_file = 'Model_train_weights.weights.h5'
    
    # Path to the saved or to save the model:
    model_file = 'Model-cheetah_jaguar_Treined.keras'
    
    # Model history
    history_path = 'training_history_cheetah_jaguar.pkl'
    
    # Initialize lists to store file paths and labels
    filepaths = []
    labels = []
    
    # Iterate over folders and files within the training directory
    for folder in ['Cheetah', 'Jaguar']:
      folder_path = os.path.join(base_path, folder)
      for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        filepaths.append(file_path)
        labels.append(folder)
    
    # Create the TRAINING dataframe
    file_path_series = pd.Series(filepaths , name= 'filepath')
    Label_path_series = pd.Series(labels , name = 'label')
    df_train = pd.concat([file_path_series ,Label_path_series ] , axis = 1)
    
    
    # Define the base path for the test folder
    directory = "jaguar_cheetah/test"
    
    filepath =[]
    label = []
    
    folds = os.listdir(directory)
    
    for fold in folds:
      f_path = os.path.join(directory , fold)
      
      imgs = os.listdir(f_path)
      
      for img in imgs:
        
        img_path = os.path.join(f_path , img)
        filepath.append(img_path)
        label.append(fold)
        
    # Create the TEST dataframe
    file_path_series = pd.Series(filepath , name= 'filepath')
    Label_path_series = pd.Series(label , name = 'label')
    df_test = pd.concat([file_path_series ,Label_path_series ] , axis = 1)
    
    # Display the first rows of the dataframe for verification
    #print(df_train)
    
    # Folders with Training and Test files
    data_dir = 'jaguar_cheetah/train'
    test_dir = 'jaguar_cheetah/test'
    
    # Image size 256x256
    IMAGE_SIZE = (256,256) 
    

    Tain | Test

    #print('Training Images:')
    
    # Create the TRAIN dataframe
    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.1,
      subset='training',
      seed=123,
      image_size=IMAGE_SIZE,
      batch_size=32)
    
    #Testing Data
    #print('Validation Images:')
    validation_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir, 
      validation_split=0.1,
      subset='validation',
      seed=123,
      image_size=IMAGE_SIZE,
      batch_size=32)
    
    print('Testing Images:')
    test_ds = tf.keras.utils.image_dataset_from_directory(
      test_dir, 
      seed=123,
      image_size=IMAGE_SIZE,
      batch_size=32)
    
    # Extract labels
    train_labels = train_ds.class_names
    test_labels = test_ds.class_names
    validation_labels = validation_ds.class_names
    
    # Encode labels
    # Defining the class labels
    class_labels = ['CHEETAH', 'JAGUAR'] 
    
    # Instantiate (encoder) LabelEncoder
    label_encoder = LabelEncoder()
    
    # Fit the label encoder on the class labels
    label_encoder.fit(class_labels)
    
    # Transform the labels for the training dataset
    train_labels_encoded = label_encoder.transform(train_labels)
    
    # Transform the labels for the validation dataset
    validation_labels_encoded = label_encoder.transform(validation_labels)
    
    # Transform the labels for the testing dataset
    test_labels_encoded = label_encoder.transform(test_labels)
    
    # Normalize the pixel values
    
    # Train files 
    train_ds = train_ds.map(lambda x, y: (x / 255.0, y))
    # Validate files
    validation_ds = validation_ds.map(lambda x, y: (x / 255.0, y))
    # Test files
    test_ds = test_ds.map(lambda x, y: (x / 255.0, y))
    
    #TRAINING VISUALIZATION
    #Count the occurrences of each category in the column
    count = df_train['label'].value_counts()
    
    # Create a figure with 2 subplots
    fig, axs = plt.subplots(1, 2, figsize=(12, 6), facecolor='white')
    
    # Plot a pie chart on the first subplot
    palette = sns.color_palette("viridis")
    sns.set_palette(palette)
    axs[0].pie(count, labels=count.index, autopct='%1.1f%%', startangle=140)
    axs[0].set_title('Distribution of Training Categories')
    
    # Plot a bar chart on the second subplot
    sns.barplot(x=count.index, y=count.values, ax=axs[1], palette="viridis")
    axs[1].set_title('Count of Training Categories')
    
    # Adjust the layout
    plt.tight_layout()
    
    # Visualize
    plt.show()
    
    # TEST VISUALIZATION
    count = df_test['label'].value_counts()
    
    # Create a figure with 2 subplots
    fig, axs = plt.subplots(1, 2, figsize=(12, 6), facec...
    
  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ryan Lagerquist; Ryan Lagerquist (2023). Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic" [Dataset]. http://doi.org/10.5281/zenodo.10081205
Organization logo

Input data and some models (all except multi-model ensembles) for JAMES paper "Machine-learned uncertainty quantification is not magic"

Explore at:
tarAvailable download formats
Dataset updated
Nov 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ryan Lagerquist; Ryan Lagerquist
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The tar file contains two directories: data and models. Within "data," there are 4 subdirectories: "training" (the clean training data -- without perturbations), "training_all_perturbed_for_uq" (the lightly perturbed training data), "validation_all_perturbed_for_uq" (the moderately perturbed validation data), and "testing_all_perturbed_for_uq" (the heavily perturbed validation data). The data in these directories are unnormalized. The subdirectories "training" and "training_all_perturbed_for_uq" each contain a normalization file. These normalization files contain parameters used to normalize the data (from physical units to z-scores) for Experiment 1 and Experiment 2, respectively. To do the normalization, you can use the script normalize_examples.py in the code library (ml4rt) with the argument input_normalization_file_name set to one of these two file paths. The other arguments should be as follows:

--uniformize=1

--predictor_norm_type_string="z_score"

--vector_target_norm_type_string=""

--scalar_target_norm_type_string=""

Within the directory "models," there are 6 subdirectories: for the BNN-only models trained with clean and lightly perturbed data, for the CRPS-only models trained with clean and lightly perturbed data, and for the BNN/CRPS models trained with clean and lightly perturbed data. To read the models into Python, you can use the method neural_net.read_model in the ml4rt library.

Search
Clear search
Close search
Google apps
Main menu