17 datasets found
  1. Synthetic datasets of the UK Biobank cohort

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, pdf, zip
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
    Explore at:
    bin, csv, zip, pdfAvailable download formats
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

    The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

    The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

    • Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
    • Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

    Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

    The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

    Content

    The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

    • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
    • synthbdbasevar: baseline variables, mostly collected at recruitment.
    • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
    • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

    In addition, this repository provides these additional files:

    • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
    • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
    • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

    Generation of the synthetic data

    The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

    The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

    This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

  2. Cynthia Data - synthetic EHR records

    • kaggle.com
    zip
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Craig Calderone (2025). Cynthia Data - synthetic EHR records [Dataset]. https://www.kaggle.com/datasets/craigcynthiaai/cynthia-data-synthetic-ehr-records
    Explore at:
    zip(2654924 bytes)Available download formats
    Dataset updated
    Jan 24, 2025
    Authors
    Craig Calderone
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description: This dataset contains 5 sample PDF Electronic Health Records (EHRs), generated as part of a synthetic healthcare data project. The purpose of this dataset is to assist with sales distribution, offering potential users and stakeholders a glimpse of how synthetic EHRs can look and function. These records have been crafted to mimic realistic admission data while ensuring privacy and compliance with all data protection regulations.

    Key Features: 1. Synthetic Data: Entirely artificial data created for testing and demonstration purposes. 1. PDF Format: Records are presented in PDF format, commonly used in healthcare systems. 1. Diverse Use Cases: Useful for evaluating tools related to data parsing, machine learning in healthcare, or EHR management systems. 1. Rich Admission Details: Includes admission-related data that highlights the capabilities of synthetic EHR generation.

    Potential Use Cases:

    • Demonstrating EHR-related tools or services.
    • Benchmarking data parsing models for PDF health records.
    • Showcasing synthetic healthcare data in sales or marketing efforts.

    Feel free to use this dataset for non-commercial testing and demonstration purposes. Feedback and suggestions for improvements are always welcome!

  3. Nightingale Health Synthetic Cohort Data

    • kaggle.com
    zip
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luke Jostins-Dean (2024). Nightingale Health Synthetic Cohort Data [Dataset]. https://www.kaggle.com/datasets/lukejostinsdean/nightingale-health-synthetic-cohort-data
    Explore at:
    zip(376725468 bytes)Available download formats
    Dataset updated
    Oct 21, 2024
    Authors
    Luke Jostins-Dean
    Description

    A synthetic dataset including simulated versions of Nightingale Health's NMR quantification of 251 metabolites and biomarkers.

    Based on the UK Biobank synthetic data. See the UK Biobank Showcase schema for descriptions of the columns included.

  4. n

    CrossLoc Benchmark Datasets

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Mar 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iordan Doytchinov; Qi Yan; Jianhao Zheng; Simon Reding; Shanci Li (2022). CrossLoc Benchmark Datasets [Dataset]. http://doi.org/10.5061/dryad.mgqnk991c
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 27, 2022
    Dataset provided by
    École Polytechnique Fédérale de Lausanne
    Authors
    Iordan Doytchinov; Qi Yan; Jianhao Zheng; Simon Reding; Shanci Li
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    To study the data-scarcity mitigation for learning-based visual localization methods via sim-to-real transfer, we curate and now present the CrossLoc benchmark datasets—a multimodal aerial sim-to-real data available for flights above nature and urban terrains. Unlike the previous computer vision datasets focusing on localization in a single domain (mostly real RGB images), the provided benchmark datasets include various multimodal synthetic cues paired to all real photos. Complementary to the paired real and synthetic data, we offer rich synthetic data that efficiently fills the flight envelope volume in the vicinity of the real data.

    The synthetic data rendering was achieved using the proposed data generation workflow TOPO-DataGen. The provided CrossLoc datasets were used as an initial benchmark to showcase the use of synthetic data to assist visual localization in the real world with limited real data. Please refer to our main paper at https://arxiv.org/abs/2112.09081 and our code at https://github.com/TOPO-EPFL/CrossLoc for details. Methods The dataset collection, processing, and validation details are explained in our paper available at https://arxiv.org/abs/2112.09081 and our code available at https://github.com/TOPO-EPFL/CrossLoc.

  5. Data from: CarEvaluation

    • kaggle.com
    zip
    Updated Apr 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davor Budimir (2020). CarEvaluation [Dataset]. https://www.kaggle.com/davorbudimir/carevaluation
    Explore at:
    zip(5110 bytes)Available download formats
    Dataset updated
    Apr 27, 2020
    Authors
    Davor Budimir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a synthetic data set derived from a simple hierarchical decision model to demonstrate decision support system DEX . The decision model included six attributes, including buying and maintenance price, the number of passengers, size of the luggage booth, and evaluated the utility of the car from a buyer's perspective. All attributes were discrete, having from three to four values. The data set provides car's utility for all possible combinations of attribute values. The data set was originally created to showcase the ability of machine learning by function decomposition to recreate the hierarchy of the decision model.

  6. Synthetic Data for Precision Gauge Reading

    • kaggle.com
    zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Endava (2024). Synthetic Data for Precision Gauge Reading [Dataset]. https://www.kaggle.com/datasets/endava/synthetic-data-for-precision-gauge-reading/data
    Explore at:
    zip(2455661096 bytes)Available download formats
    Dataset updated
    Jul 11, 2024
    Dataset authored and provided by
    Endava
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset contains sample synthetic data used for training a solution for reading analog pressure gauges values. We have used this during the writing of our paper and blog(s) which showcase how synthetic data can be used to train and use computer vision models. We've chosen the topic of Analog Gauge Reading Understanding as it is a common problem in many industries and exemplifies how output from multiple models can be consumed in heuristics to get a final reading.

    Dataset contents

    The dataset contains the following: - subset of the synthetic data used for training, we have included the two latest versions of datasets. Each contains both the images and the coco annotations for segmentation and pose estimation. - inference data for the test videos available in the Kaggle dataset. For each video there is one CSV file which contains for every frame the bbox for the (main) gauge, keypoints locations for the needle tip, gauge center, min and max scale ticks, and the predicted reading.

  7. Selection of best model based on criteria.

    • plos.figshare.com
    xls
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahadee Al Mobin; Md. Kamrujjaman (2023). Selection of best model based on criteria. [Dataset]. http://doi.org/10.1371/journal.pone.0295803.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mahadee Al Mobin; Md. Kamrujjaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.

  8. S2 Data -

    • plos.figshare.com
    txt
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahadee Al Mobin; Md. Kamrujjaman (2023). S2 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0295803.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mahadee Al Mobin; Md. Kamrujjaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.

  9. Fabricated Fraud Detection

    • kaggle.com
    zip
    Updated Dec 2, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gilad (2019). Fabricated Fraud Detection [Dataset]. https://www.kaggle.com/giladmanor/fraud-detection
    Explore at:
    zip(354814093 bytes)Available download formats
    Dataset updated
    Dec 2, 2019
    Authors
    Gilad
    Description

    Demonstration of Synthetic data usability for Fraud Detection

    This Demonstration utilized a fraud detection data set and kernel, referenced below to showcase the accuracy and safety of using the products of the kymera fabrication machine

    The original data set we have used is the Synthetic Financial Datasets For Fraud Detection This file accurately mimics the original data set features while in fact generating the entire data set from scratch.

  10. Coefficients of SARIMA (1, 0, 0)(0, 1, 1)12.

    • plos.figshare.com
    xls
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahadee Al Mobin; Md. Kamrujjaman (2023). Coefficients of SARIMA (1, 0, 0)(0, 1, 1)12. [Dataset]. http://doi.org/10.1371/journal.pone.0295803.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mahadee Al Mobin; Md. Kamrujjaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.

  11. Healthcare Dataset

    • kaggle.com
    zip
    Updated May 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasad Patil (2024). Healthcare Dataset [Dataset]. https://www.kaggle.com/datasets/prasad22/healthcare-dataset/discussion
    Explore at:
    zip(3054550 bytes)Available download formats
    Dataset updated
    May 8, 2024
    Authors
    Prasad Patil
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context:

    This synthetic healthcare dataset has been created to serve as a valuable resource for data science, machine learning, and data analysis enthusiasts. It is designed to mimic real-world healthcare data, enabling users to practice, develop, and showcase their data manipulation and analysis skills in the context of the healthcare industry.

    Inspiration:

    The inspiration behind this dataset is rooted in the need for practical and diverse healthcare data for educational and research purposes. Healthcare data is often sensitive and subject to privacy regulations, making it challenging to access for learning and experimentation. To address this gap, I have leveraged Python's Faker library to generate a dataset that mirrors the structure and attributes commonly found in healthcare records. By providing this synthetic data, I hope to foster innovation, learning, and knowledge sharing in the healthcare analytics domain.

    Dataset Information:

    Each column provides specific information about the patient, their admission, and the healthcare services provided, making this dataset suitable for various data analysis and modeling tasks in the healthcare domain. Here's a brief explanation of each column in the dataset - - Name: This column represents the name of the patient associated with the healthcare record. - Age: The age of the patient at the time of admission, expressed in years. - Gender: Indicates the gender of the patient, either "Male" or "Female." - Blood Type: The patient's blood type, which can be one of the common blood types (e.g., "A+", "O-", etc.). - Medical Condition: This column specifies the primary medical condition or diagnosis associated with the patient, such as "Diabetes," "Hypertension," "Asthma," and more. - Date of Admission: The date on which the patient was admitted to the healthcare facility. - Doctor: The name of the doctor responsible for the patient's care during their admission. - Hospital: Identifies the healthcare facility or hospital where the patient was admitted. - Insurance Provider: This column indicates the patient's insurance provider, which can be one of several options, including "Aetna," "Blue Cross," "Cigna," "UnitedHealthcare," and "Medicare." - Billing Amount: The amount of money billed for the patient's healthcare services during their admission. This is expressed as a floating-point number. - Room Number: The room number where the patient was accommodated during their admission. - Admission Type: Specifies the type of admission, which can be "Emergency," "Elective," or "Urgent," reflecting the circumstances of the admission. - Discharge Date: The date on which the patient was discharged from the healthcare facility, based on the admission date and a random number of days within a realistic range. - Medication: Identifies a medication prescribed or administered to the patient during their admission. Examples include "Aspirin," "Ibuprofen," "Penicillin," "Paracetamol," and "Lipitor." - Test Results: Describes the results of a medical test conducted during the patient's admission. Possible values include "Normal," "Abnormal," or "Inconclusive," indicating the outcome of the test.

    Usage Scenarios:

    This dataset can be utilized for a wide range of purposes, including: - Developing and testing healthcare predictive models. - Practicing data cleaning, transformation, and analysis techniques. - Creating data visualizations to gain insights into healthcare trends. - Learning and teaching data science and machine learning concepts in a healthcare context. - You can treat it as a Multi-Class Classification Problem and solve it for Test Results which contains 3 categories(Normal, Abnormal, and Inconclusive).

    Acknowledgments:

    • I acknowledge the importance of healthcare data privacy and security and emphasize that this dataset is entirely synthetic. It does not contain any real patient information or violate any privacy regulations.
    • I hope that this dataset contributes to the advancement of data science and healthcare analytics and inspires new ideas. Feel free to explore, analyze, and share your findings with the Kaggle community.

    Image Credit:

    Image by BC Y from Pixabay

  12. Z

    Data for Cyrillic Reference Parsing

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    Updated Dec 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shapiro, Igor; Saier, Tarek; Färber, Michael (2021). Data for Cyrillic Reference Parsing [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5801913
    Explore at:
    Dataset updated
    Dec 24, 2021
    Dataset provided by
    Karlsruhe Institute of Technology (KIT)
    Authors
    Shapiro, Igor; Saier, Tarek; Färber, Michael
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We provide a synthetic reference data set covering over 100,000 labeled references (mostly Russian language) and a manually annotated set of real references (771 in number) gathered from multidisciplinary Cyrillic script publications.

    Background:

    Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches, tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature that is thereby excluded consists of publications written in Cyrillic script languages. To address this problem, we introduce a new multilingual and multidisciplinary data set of over 100,000 labeled reference strings. The data set covers multiple Cyrillic languages and contains over 700 manually labeled references, while the remaining are generated synthetically. With random samples of varying size of this data, we train multiple well-performing sequence labeling BERT models and thus show the usability of our proposed data set. To this end, we showcase an implementation of a multilingual BERT model trained on the synthetic data and evaluated on the manually labeled references. Our model achieves an F1 score of 0.93 and thereby significantly outperforms a state-of-the-art model we retrain and evaluate on our data.

    The code for generating the data set is available at https://github.com/igor261/Sequence-Labeling-for-Citation-Field-Extraction-from-Cyrillic-Script-References

    When using the data set, please cite the following paper:

    Igor Shapiro, Tarek Saier, Michael Färber: "Sequence Labeling for Citation Field Extraction from Cyrillic Script References". In Proceedings of the AAAI-22 Workshop on Scientific Document Understanding (SDU@AAAI'22), 2022.

  13. Coefficients of ARIMA(7,0,7).

    • plos.figshare.com
    xls
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahadee Al Mobin; Md. Kamrujjaman (2023). Coefficients of ARIMA(7,0,7). [Dataset]. http://doi.org/10.1371/journal.pone.0295803.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mahadee Al Mobin; Md. Kamrujjaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.

  14. Medical Appointment Scheduling System

    • kaggle.com
    zip
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    María Carolina Gonzalez Galtier (2024). Medical Appointment Scheduling System [Dataset]. https://www.kaggle.com/datasets/carogonzalezgaltier/medical-appointment-scheduling-system/discussion
    Explore at:
    zip(4274383 bytes)Available download formats
    Dataset updated
    Dec 3, 2024
    Authors
    María Carolina Gonzalez Galtier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset simulates a medical appointment scheduling system, designed to demonstrate practical applications of data generation techniques in the healthcare field. Although synthetic, the data is based on real-world values to enhance its realism and utility.

    Purpose

    The primary goals of this dataset are:

    • Learning: To help newcomers to data science or software development understand how data is structured and applied in real-world contexts.
    • Prototyping: To provide a foundation for developing and testing projects or features related to appointment scheduling systems.
    • Portfolio Showcase: To demonstrate skills in data manipulation, software development, and system design within a healthcare context.

    Dataset Structure

    The dataset contains three main tables:

    1. Slots Table

    • slot_id (Integer): Unique identifier for each time slot.
    • appointment_date (Date): Date of the appointment.
    • appointment_time (Time): Scheduled time of the appointment (15-minute intervals).
    • is_available (Boolean): Indicates if the slot is available (True) or not (False).

    2. Patients Table

    • patient_id (Integer): Unique identifier for each patient.
    • name (String, up to 60 characters): Full name of the patient.
    • sex (String): Gender of the patient ('Male', 'Female', 'Non-binary').
    • dob (Date): Date of birth in YYYY-MM-DD format.
    • insurance (String, up to 30 characters): Name of the patient's insurance provider from a predefined list of fictitious names.

    3. Appointments Table

    • appointment_id (Integer): Unique identifier for each appointment.
    • slot_id (Integer): References the slot in the Slots table.
    • scheduling_date (Date): Date when the appointment was scheduled.
    • appointment_date (Date): Date of the appointment.
    • appointment_time (Time): Scheduled time of the appointment.
    • scheduling_interval (Integer): Days between scheduling date and appointment date.
    • status (String): Appointment status ('available', 'scheduled', 'completed', 'cancelled', 'no-show').
    • check_in_time (Time): Actual time the patient checked in.
    • appointment_duration (Float): Duration of the appointment in minutes.
    • start_time (Time): Actual start time of the appointment.
    • end_time (Time): Actual end time of the appointment.
    • waiting_time (Float): Waiting time in minutes.
    • patient_id (Integer): References the patient in the Patients table.
    • sex (String): Gender of the patient.
    • age (Integer): Age of the patient.
    • age_group (String): Age group category of the patient.

    Key Parameters

    The dataset simulates a medical office operating Monday to Friday, from 8:00 AM to 6:00 PM, with appointments scheduled every 15 minutes (4 per hour). Key parameters include:

    • Booking Horizon: Appointments can be scheduled up to 30 days in advance.
    • Fill Rate: 90% of available slots are filled.
    • Rebooking Rate: 50% of cancelled appointments are rescheduled.
    • Average Scheduling Interval: Appointments are scheduled an average of 7 days in advance.
    • Appointment Duration:
      • Mean: 17.4 minutes.
      • Median: 15.8 minutes.
    • Patient Arrival Times:
      • 84.4% of patients arrive before their scheduled time.
      • Average early arrival: 10 minutes early.
    • Appointment Status Rates: Outcomes include:
      • Attended.
      • Cancelled (in advance).
      • No-show (missed without cancellation).
      • Unknown (unspecified or indeterminate).
    • Future Appointments: Simulated for the next 30 days, following an exponentially decreasing occupancy rate model.
    • Patient Visit Frequency: Patients visit an average of 1.2 times per year.
    • Age Groups: Defined in 5-year intervals, starting at 15 years and above.
    • Insurance Data:
      • A Pareto principle distribution is applied to simulate realistic market coverage.
      • Fictitious names are used for insurance providers.

    Patient Demographics

    • Names: Generated using the Faker library to create realistic, unique names.
    • Age and Sex: Based on real-world outpatient attendance data, excluding pediatric patients (under 15 years).

    Date Ranges

    • Covered Period: January 1, 2015, to December 31, 2024.
    • Reference Date: December 1, 2024, dividing past attended appointments from future appointments.

    References

    1. Tai-Seale, M., McGuire, T. G., & Zhang, W. (2007). Time allocation in primary care office visits. Health Services Research, 42(5), 1871–1894. https://doi.org/10.1111/j.1475-6773.2006.00689.x
    2. Cerruti, B., Garavaldi, D., & Lerario, A. (2023). Patient's punctuality in an outpatient clinic: the role of age, medical branch and geographical factors. BMC Health Services Research, 23(1), 1385. [https://doi.org/10.1186/s12913-...
  15. Realistic Loan Approval Dataset | US & Canada

    • kaggle.com
    zip
    Updated Nov 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parth Patel2130 (2025). Realistic Loan Approval Dataset | US & Canada [Dataset]. https://www.kaggle.com/datasets/parthpatel2130/realistic-loan-approval-dataset-us-and-canada
    Explore at:
    zip(1717268 bytes)Available download formats
    Dataset updated
    Nov 1, 2025
    Authors
    Parth Patel2130
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    Canada, United States
    Description

    🏦 Synthetic Loan Approval Dataset

    A Realistic, High-Quality Dataset for Credit Risk Modelling

    🎯 Why This Dataset?

    Most loan datasets on Kaggle have unrealistic patterns where:

    1. ❌ Credit scores don't matter
    2. ❌ Approval logic is backwards
    3. ❌ Models learn nonsense patterns

    Unlike most loan datasets available online, this one is built on real banking criteria from US and Canadian financial institutions. Drawing from 3 years of hands-on finance industry experience, the dataset incorporates realistic correlations and business logic that reflect how actual lending decisions are made. This makes it perfect for data scientists looking to build portfolio projects that showcase not just coding ability, but genuine understanding of credit risk modelling.

    📊 Dataset Overview

    MetricValue
    Total Records50,000
    Features20 (customer_id + 18 predictors + 1 target)
    Target Distribution55% Approved, 45% Rejected
    Missing Values0 (Complete dataset)
    Product TypesCredit Card, Personal Loan, Line of Credit
    MarketUnited States & Canada
    Use CaseBinary Classification (Approved/Rejected)

    🔑 Key Features

    Identifier:

    -Customer ID (unique identifier for each application)

    Demographics:

    -Age, Occupation Status, Years Employed

    Financial Profile:

    -Annual Income, Credit Score, Credit History Length -Savings/Assets, Current Debt

    Credit Behaviour:

    -Defaults on File, Delinquencies, Derogatory Marks

    Loan Request:

    -Product Type, Loan Intent, Loan Amount, Interest Rate

    Calculated Ratios:

    -Debt-to-Income, Loan-to-Income, Payment-to-Income

    💡 What Makes This Dataset Special?

    1️⃣ Real-World Approval Logic The dataset implements actual banking criteria: - DTI ratio > 50% = automatic rejection - Defaults on file = instant reject - Credit score bands match real lending thresholds - Employment verification for loans ≥$20K

    2️⃣ Realistic Correlations - Higher income → Better credit scores - Older applicants → Longer credit history - Students → Lower income, special treatment for small loans - Loan intent affects approval (Education best, Debt Consolidation worst)

    3️⃣ Product-Specific Rules - Credit Cards: More lenient, higher limits - Personal Loans: Standard criteria, up to $100K - Line of Credit: Capped at $50K, manual review for high amounts

    4️⃣ Edge Cases Included - Young applicants (age 18) building first credit - Students with thin credit files - Self-employed with variable income - High debt-to-income ratios - Multiple delinquencies

    🎓 Perfect For - Machine Learning Practice: Binary classification with real patterns - Credit Risk Modelling: Learn actual lending criteria - Portfolio Projects: Build impressive, explainable models - Feature Engineering: Rich dataset with meaningful relationships - Business Analytics: Understand financial decision-making

    📈 Quick Stats

    Approval Rates by Product - Credit Card: 60.4% more lenient) - Personal Loan: 46.9 (standard) - Line of Credit: 52.6% (moderate)

    Loan Intent (Best → Worst Approval Odds) 1. Education (63% approved) 2. Personal (58% approved) 3. Medical/Home (52% approved) 4. Business (48% approved) 5. Debt Consolidation (40% approved)

    Credit Score Distribution - Mean: 644 - Range: 300-850 - Realistic bell curve around 600-700

    Income Distribution - Mean: $50,063 - Median: $41,608 - Range: $15K - $250K

    🎯 Expected Model Performance

    With proper feature engineering and tuning: - Accuracy: 75-85% - ROC-AUC: 0.80-0.90 - F1-Score: 0.75-0.85

    Important: Feature importance should show: 1. Credit Score (most important) 2. Debt-to-Income Ratio 3. Delinquencies 4. Loan Amount 5. Income

    If your model shows different patterns, something's wrong!

    🏆 Use Cases & Projects

    Beginner - Binary classification with XGBoost/Random Forest - EDA and visualization practice - Feature importance analysis

    Intermediate - Custom threshold optimization (profit maximization) - Cost-sensitive learning (false positive vs false negative) - Ensemble methods and stacking

    Advanced - Explainable AI (SHAP, LIME) - Fairness analysis across demographics - Production-ready API with FastAPI/Flask - Streamlit deployment with business rules

    ⚠️ Important Notes

    This is SYNTHETIC Data - Generated based on real banking criteria - No real customer data was used - Safe for public sharing and portfolio use

    Limitations - Simplified approval logic (real banks use 100+ factors) - No temporal component (no time series) - Single country/currency assumed (USD) - No external factors (economy, market conditions)

    Educational Purpose This dataset is designed for: - Learning credit risk modeling - Portfolio projects - ML practice - Understanding lending criteria

    NOT for: - Actual lending decisions - Financial advice - Production use without validation

    🤝 Contributing

    Found an issue? Have suggestions? - Open an issue on GitHub - Suggest i...

  16. Minimal dataset for the study.

    • plos.figshare.com
    xlsx
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernát Nógrádi; Tamás Ferenc Polgár; Valéria Meszlényi; Zalán Kádár; Péter Hertelendy; Anett Csáti; László Szpisjak; Dóra Halmi; Barbara Erdélyi-Furka; Máté Tóth; Fanny Molnár; Dávid Tóth; Zsófia Bősze; Krisztina Boda; Péter Klivényi; László Siklós; Roland Patai (2024). Minimal dataset for the study. [Dataset]. http://doi.org/10.1371/journal.pone.0310028.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Bernát Nógrádi; Tamás Ferenc Polgár; Valéria Meszlényi; Zalán Kádár; Péter Hertelendy; Anett Csáti; László Szpisjak; Dóra Halmi; Barbara Erdélyi-Furka; Máté Tóth; Fanny Molnár; Dávid Tóth; Zsófia Bősze; Krisztina Boda; Péter Klivényi; László Siklós; Roland Patai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ChatGPT, a general artificial intelligence, has been recognized as a powerful tool in scientific writing and programming but its use as a medical tool is largely overlooked. The general accessibility, rapid response time and comprehensive training database might enable ChatGPT to serve as a diagnostic augmentation tool in certain clinical settings. The diagnostic process in neurology is often challenging and complex. In certain time-sensitive scenarios, rapid evaluation and diagnostic decisions are needed, while in other cases clinicians are faced with rare disorders and atypical disease manifestations. Due to these factors, the diagnostic accuracy in neurology is often suboptimal. Here we evaluated whether ChatGPT can be utilized as a valuable and innovative diagnostic augmentation tool in various neurological settings. We used synthetic data generated by neurological experts to represent descriptive anamneses of patients with known neurology-related diseases, then the probability for an appropriate diagnosis made by ChatGPT was measured. To give clarity to the accuracy of the AI-determined diagnosis, all cases have been cross-validated by other experts and general medical doctors as well. We found that ChatGPT-determined diagnostic accuracy (ranging from 68.5% ± 3.28% to 83.83% ± 2.73%) can reach the accuracy of other experts (81.66% ± 2.02%), furthermore, it surpasses the probability of an appropriate diagnosis if the examiner is a general medical doctor (57.15% ± 2.64%). Our results showcase the efficacy of general artificial intelligence like ChatGPT as a diagnostic augmentation tool in medicine. In the future, AI-based supporting tools might be useful amendments in medical practice and help to improve the diagnostic process in neurology.

  17. Sua Música Challenge: Recommendation System

    • kaggle.com
    zip
    Updated Jul 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osvaldo Pereira (2023). Sua Música Challenge: Recommendation System [Dataset]. https://www.kaggle.com/datasets/osvaldopereira/sua-msica-recommendation-system
    Explore at:
    zip(5512 bytes)Available download formats
    Dataset updated
    Jul 23, 2023
    Authors
    Osvaldo Pereira
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Sua Música

    Sua Música (suamusica.com.br) is one of the largest online platforms in Latin America and the ultimate online destination for Brazilian music enthusiasts. Whether you're a passionate listener, a budding musician, or simply curious about the rich sounds of Brazilian culture, you've come to the right place. At suamusica.com.br, we revolutionize the way music is shared and enjoyed in Brazil. Our platform offers a vast collection of songs, albums, and playlists spanning various genres and artists, ensuring that there's something for everyone. From samba and bossa nova to funk and pagode, our extensive catalog covers it all. Explore, discover, and create personalized playlists that match your mood and taste. Immerse yourself in exclusive content, such as live performances, interviews, and behind-the-scenes glimpses into the lives of your favorite musicians. Join our thriving community of music lovers, connect with fellow fans, and embark on a musical journey that will transport you to the vibrant world of Brazilian music. Experience the rhythm, energy, and diversity of suamusica.com.br, and let the melodies of Brazil captivate your senses.

    Welcome to the Kaggle challenge dedicated to creating a recommendation system for the suamusica.com.br platform! If you're passionate about music and data science, this challenge is the perfect opportunity to showcase your skills and contribute to enhancing the music experience for users of the suamusica.com.br platform, with its vast collection of songs and genres, presents an exciting opportunity to develop an intelligent recommendation system that can suggest personalized music choices to users based on their preferences. By participating in this challenge, you'll dive into the world of collaborative filtering, machine learning algorithms, and data analysis to create a recommendation system that will revolutionize how users discover new music on suamusica.com.br. Join us on this exciting journey and let's unlock the power of data to provide personalized music recommendations to millions of users.

    The Challenge

    This challenge is a little bit different than what Kaggle users are used to. It is not about only machine learning and high accuracy. We expect you to create a pipeline for a recommendation system for a music streaming platform. We provided a script that creates synthetic data: one of them contains transactional data with the plays amount by user and by day, the second contains dimensional data correlating the id of tracks with the id of artists and musical genre, and the final dataset contains metrics about artists. The actual values are not the main point of the challenge, but the pipeline is. Focus on the algorithms you can use, and what type of features you can use, based on the fact that it is a streaming platform, so think about average track duration, likes, follows, plays received on specific days of the week or specific times of the day, bpm of songs, genres, and so forth. The synthetic data generation scripts were also left as a challenge if you want to improve them, for example, creating correlation between features, or adding metric features to transactional data, or more information to the dimensional datasets. Explore all the information available and be thorough on the pipeline description, the ETL is also very important, and to name technologies (stacks) is also very important, for example, the use of AWS Lambdas or Airflow to orchestrate the whole pipeline. The codes are written in Python, mostly NumPy. Feel free to explore other libraries with out-of-the-box solutions, but have in mind that we will score higher points for creative and technical solutions with deterministic mathematical and statistical algorithms. Another important point is that, afterward, the model is deployed, describe how would you monitor the performance of your system, describe the performance indicators (KPIs) you would use and how you would measure them.

    Disclaimer

    The scripts provided by the Data Science team of Sua Música do not contain information about the platform database, the averages and standard deviations do not represent statistical population information of the platform users. The data structure is also generic and represent usual relational refined datasets that any streaming platform data team would possess.

    What we expect

    1. Illustration of the model pipeline (ETL, data mining, deploy and evaluation)
    2. Final dataset with id_user as rows with a list of tracks for each user.
    3. Formal description of the algorithm used to determine the tracks for the users
    4. Detailed methodology of evaluation of the model (name at least one KPI that indicates performance)
    5. Documented code

    Evaluation

    1. Grade from 0 to 10 on the deep knowledge of data pipelines, ETL and data engineering. For example knowing how to deal with Big ...
  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
Organization logo

Synthetic datasets of the UK Biobank cohort

Explore at:
bin, csv, zip, pdfAvailable download formats
Dataset updated
Sep 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

  • Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
  • Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

  • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
  • synthbdbasevar: baseline variables, mostly collected at recruitment.
  • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
  • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

  • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
  • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
  • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

Search
Clear search
Close search
Google apps
Main menu