17 datasets found

Synthetic datasets of the UK Biobank cohort
zenodo.org
data.niaid.nih.gov
bin, csv, pdf, zip
Updated Sep 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
Explore at:
bin, csv, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13983170
Dataset updated
Sep 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]

Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.

synthbdbasevar: baseline variables, mostly collected at recruitment.

synthpmdata: annual average exposure to PM_2.5 for each participant reconstructed using their residential history.

synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.

asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).

Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM_2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data, including the annual PM_2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM_2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
Cynthia Data - synthetic EHR records
kaggle.com
zip
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Craig Calderone (2025). Cynthia Data - synthetic EHR records [Dataset]. https://www.kaggle.com/datasets/craigcynthiaai/cynthia-data-synthetic-ehr-records
Explore at:
zip(2654924 bytes)Available download formats
Dataset updated
Jan 24, 2025
Authors
Craig Calderone
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Description: This dataset contains 5 sample PDF Electronic Health Records (EHRs), generated as part of a synthetic healthcare data project. The purpose of this dataset is to assist with sales distribution, offering potential users and stakeholders a glimpse of how synthetic EHRs can look and function. These records have been crafted to mimic realistic admission data while ensuring privacy and compliance with all data protection regulations.

Key Features: 1. Synthetic Data: Entirely artificial data created for testing and demonstration purposes. 1. PDF Format: Records are presented in PDF format, commonly used in healthcare systems. 1. Diverse Use Cases: Useful for evaluating tools related to data parsing, machine learning in healthcare, or EHR management systems. 1. Rich Admission Details: Includes admission-related data that highlights the capabilities of synthetic EHR generation.

Potential Use Cases:

Demonstrating EHR-related tools or services.

Benchmarking data parsing models for PDF health records.

Showcasing synthetic healthcare data in sales or marketing efforts.

Feel free to use this dataset for non-commercial testing and demonstration purposes. Feedback and suggestions for improvements are always welcome!
Nightingale Health Synthetic Cohort Data
kaggle.com
zip
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luke Jostins-Dean (2024). Nightingale Health Synthetic Cohort Data [Dataset]. https://www.kaggle.com/datasets/lukejostinsdean/nightingale-health-synthetic-cohort-data
Explore at:
zip(376725468 bytes)Available download formats
Dataset updated
Oct 21, 2024
Authors
Luke Jostins-Dean
Description
A synthetic dataset including simulated versions of Nightingale Health's NMR quantification of 251 metabolites and biomarkers.

Based on the UK Biobank synthetic data. See the UK Biobank Showcase schema for descriptions of the columns included.
n
CrossLoc Benchmark Datasets
data.niaid.nih.gov
datadryad.org
zip
Updated Mar 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iordan Doytchinov; Qi Yan; Jianhao Zheng; Simon Reding; Shanci Li (2022). CrossLoc Benchmark Datasets [Dataset]. http://doi.org/10.5061/dryad.mgqnk991c
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.mgqnk991c
Dataset updated
Mar 27, 2022
Dataset provided by
École Polytechnique Fédérale de Lausanne
Authors
Iordan Doytchinov; Qi Yan; Jianhao Zheng; Simon Reding; Shanci Li
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
To study the data-scarcity mitigation for learning-based visual localization methods via sim-to-real transfer, we curate and now present the CrossLoc benchmark datasets—a multimodal aerial sim-to-real data available for flights above nature and urban terrains. Unlike the previous computer vision datasets focusing on localization in a single domain (mostly real RGB images), the provided benchmark datasets include various multimodal synthetic cues paired to all real photos. Complementary to the paired real and synthetic data, we offer rich synthetic data that efficiently fills the flight envelope volume in the vicinity of the real data.

The synthetic data rendering was achieved using the proposed data generation workflow TOPO-DataGen. The provided CrossLoc datasets were used as an initial benchmark to showcase the use of synthetic data to assist visual localization in the real world with limited real data. Please refer to our main paper at https://arxiv.org/abs/2112.09081 and our code at https://github.com/TOPO-EPFL/CrossLoc for details. Methods The dataset collection, processing, and validation details are explained in our paper available at https://arxiv.org/abs/2112.09081 and our code available at https://github.com/TOPO-EPFL/CrossLoc.
Data from: CarEvaluation
kaggle.com
zip
Updated Apr 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davor Budimir (2020). CarEvaluation [Dataset]. https://www.kaggle.com/davorbudimir/carevaluation
Explore at:
zip(5110 bytes)Available download formats
Dataset updated
Apr 27, 2020
Authors
Davor Budimir
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is a synthetic data set derived from a simple hierarchical decision model to demonstrate decision support system DEX . The decision model included six attributes, including buying and maintenance price, the number of passengers, size of the luggage booth, and evaluated the utility of the car from a buyer's perspective. All attributes were discrete, having from three to four values. The data set provides car's utility for all possible combinations of attribute values. The data set was originally created to showcase the ability of machine learning by function decomposition to recreate the hierarchy of the decision model.
Synthetic Data for Precision Gauge Reading
kaggle.com
zip
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Endava (2024). Synthetic Data for Precision Gauge Reading [Dataset]. https://www.kaggle.com/datasets/endava/synthetic-data-for-precision-gauge-reading/data
Explore at:
zip(2455661096 bytes)Available download formats
Dataset updated
Jul 11, 2024
Dataset authored and provided by
Endava
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Overview

This dataset contains sample synthetic data used for training a solution for reading analog pressure gauges values. We have used this during the writing of our paper and blog(s) which showcase how synthetic data can be used to train and use computer vision models. We've chosen the topic of Analog Gauge Reading Understanding as it is a common problem in many industries and exemplifies how output from multiple models can be consumed in heuristics to get a final reading.

Dataset contents

The dataset contains the following: - subset of the synthetic data used for training, we have included the two latest versions of datasets. Each contains both the images and the coco annotations for segmentation and pose estimation. - inference data for the test videos available in the Kaggle dataset. For each video there is one CSV file which contains for every frame the bbox for the (main) gauge, keypoints locations for the needle tip, gauge center, min and max scale ticks, and the predicted reading.
Selection of best model based on criteria.
plos.figshare.com
xls
Updated Dec 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahadee Al Mobin; Md. Kamrujjaman (2023). Selection of best model based on criteria. [Dataset]. http://doi.org/10.1371/journal.pone.0295803.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295803.t009
Dataset updated
Dec 14, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mahadee Al Mobin; Md. Kamrujjaman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.
S2 Data -
plos.figshare.com
txt
Updated Dec 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahadee Al Mobin; Md. Kamrujjaman (2023). S2 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0295803.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295803.s002
Dataset updated
Dec 14, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mahadee Al Mobin; Md. Kamrujjaman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.
Fabricated Fraud Detection
kaggle.com
zip
Updated Dec 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gilad (2019). Fabricated Fraud Detection [Dataset]. https://www.kaggle.com/giladmanor/fraud-detection
Explore at:
zip(354814093 bytes)Available download formats
Dataset updated
Dec 2, 2019
Authors
Gilad
Description
Demonstration of Synthetic data usability for Fraud Detection

This Demonstration utilized a fraud detection data set and kernel, referenced below to showcase the accuracy and safety of using the products of the kymera fabrication machine

The original data set we have used is the Synthetic Financial Datasets For Fraud Detection This file accurately mimics the original data set features while in fact generating the entire data set from scratch.
Coefficients of SARIMA (1, 0, 0)(0, 1, 1)12.
plos.figshare.com
xls
Updated Dec 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahadee Al Mobin; Md. Kamrujjaman (2023). Coefficients of SARIMA (1, 0, 0)(0, 1, 1)12. [Dataset]. http://doi.org/10.1371/journal.pone.0295803.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295803.t007
Dataset updated
Dec 14, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mahadee Al Mobin; Md. Kamrujjaman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.
Healthcare Dataset
kaggle.com
zip
Updated May 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Patil (2024). Healthcare Dataset [Dataset]. https://www.kaggle.com/datasets/prasad22/healthcare-dataset/discussion
Explore at:
zip(3054550 bytes)Available download formats
Dataset updated
May 8, 2024
Authors
Prasad Patil
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context:

This synthetic healthcare dataset has been created to serve as a valuable resource for data science, machine learning, and data analysis enthusiasts. It is designed to mimic real-world healthcare data, enabling users to practice, develop, and showcase their data manipulation and analysis skills in the context of the healthcare industry.

Inspiration:

The inspiration behind this dataset is rooted in the need for practical and diverse healthcare data for educational and research purposes. Healthcare data is often sensitive and subject to privacy regulations, making it challenging to access for learning and experimentation. To address this gap, I have leveraged Python's Faker library to generate a dataset that mirrors the structure and attributes commonly found in healthcare records. By providing this synthetic data, I hope to foster innovation, learning, and knowledge sharing in the healthcare analytics domain.

Dataset Information:

Each column provides specific information about the patient, their admission, and the healthcare services provided, making this dataset suitable for various data analysis and modeling tasks in the healthcare domain. Here's a brief explanation of each column in the dataset - - Name: This column represents the name of the patient associated with the healthcare record. - Age: The age of the patient at the time of admission, expressed in years. - Gender: Indicates the gender of the patient, either "Male" or "Female." - Blood Type: The patient's blood type, which can be one of the common blood types (e.g., "A+", "O-", etc.). - Medical Condition: This column specifies the primary medical condition or diagnosis associated with the patient, such as "Diabetes," "Hypertension," "Asthma," and more. - Date of Admission: The date on which the patient was admitted to the healthcare facility. - Doctor: The name of the doctor responsible for the patient's care during their admission. - Hospital: Identifies the healthcare facility or hospital where the patient was admitted. - Insurance Provider: This column indicates the patient's insurance provider, which can be one of several options, including "Aetna," "Blue Cross," "Cigna," "UnitedHealthcare," and "Medicare." - Billing Amount: The amount of money billed for the patient's healthcare services during their admission. This is expressed as a floating-point number. - Room Number: The room number where the patient was accommodated during their admission. - Admission Type: Specifies the type of admission, which can be "Emergency," "Elective," or "Urgent," reflecting the circumstances of the admission. - Discharge Date: The date on which the patient was discharged from the healthcare facility, based on the admission date and a random number of days within a realistic range. - Medication: Identifies a medication prescribed or administered to the patient during their admission. Examples include "Aspirin," "Ibuprofen," "Penicillin," "Paracetamol," and "Lipitor." - Test Results: Describes the results of a medical test conducted during the patient's admission. Possible values include "Normal," "Abnormal," or "Inconclusive," indicating the outcome of the test.

Usage Scenarios:

This dataset can be utilized for a wide range of purposes, including: - Developing and testing healthcare predictive models. - Practicing data cleaning, transformation, and analysis techniques. - Creating data visualizations to gain insights into healthcare trends. - Learning and teaching data science and machine learning concepts in a healthcare context. - You can treat it as a Multi-Class Classification Problem and solve it for Test Results which contains 3 categories(Normal, Abnormal, and Inconclusive).

Acknowledgments:

I acknowledge the importance of healthcare data privacy and security and emphasize that this dataset is entirely synthetic. It does not contain any real patient information or violate any privacy regulations.

I hope that this dataset contributes to the advancement of data science and healthcare analytics and inspires new ideas. Feel free to explore, analyze, and share your findings with the Kaggle community.

Image Credit:

Image by BC Y from Pixabay
Z
Data for Cyrillic Reference Parsing
data.niaid.nih.gov
nde-dev.biothings.io
Updated Dec 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shapiro, Igor; Saier, Tarek; Färber, Michael (2021). Data for Cyrillic Reference Parsing [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5801913
Explore at:
Dataset updated
Dec 24, 2021
Dataset provided by
Karlsruhe Institute of Technology (KIT)
Authors
Shapiro, Igor; Saier, Tarek; Färber, Michael
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We provide a synthetic reference data set covering over 100,000 labeled references (mostly Russian language) and a manually annotated set of real references (771 in number) gathered from multidisciplinary Cyrillic script publications.

Background:

Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches, tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature that is thereby excluded consists of publications written in Cyrillic script languages. To address this problem, we introduce a new multilingual and multidisciplinary data set of over 100,000 labeled reference strings. The data set covers multiple Cyrillic languages and contains over 700 manually labeled references, while the remaining are generated synthetically. With random samples of varying size of this data, we train multiple well-performing sequence labeling BERT models and thus show the usability of our proposed data set. To this end, we showcase an implementation of a multilingual BERT model trained on the synthetic data and evaluated on the manually labeled references. Our model achieves an F1 score of 0.93 and thereby significantly outperforms a state-of-the-art model we retrain and evaluate on our data.

The code for generating the data set is available at https://github.com/igor261/Sequence-Labeling-for-Citation-Field-Extraction-from-Cyrillic-Script-References

When using the data set, please cite the following paper:

Igor Shapiro, Tarek Saier, Michael Färber: "Sequence Labeling for Citation Field Extraction from Cyrillic Script References". In Proceedings of the AAAI-22 Workshop on Scientific Document Understanding (SDU@AAAI'22), 2022.
Coefficients of ARIMA(7,0,7).
plos.figshare.com
xls
Updated Dec 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahadee Al Mobin; Md. Kamrujjaman (2023). Coefficients of ARIMA(7,0,7). [Dataset]. http://doi.org/10.1371/journal.pone.0295803.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295803.t010
Dataset updated
Dec 14, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mahadee Al Mobin; Md. Kamrujjaman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.
Medical Appointment Scheduling System
kaggle.com
zip
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
María Carolina Gonzalez Galtier (2024). Medical Appointment Scheduling System [Dataset]. https://www.kaggle.com/datasets/carogonzalezgaltier/medical-appointment-scheduling-system/discussion
Explore at:
zip(4274383 bytes)Available download formats
Dataset updated
Dec 3, 2024
Authors
María Carolina Gonzalez Galtier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset simulates a medical appointment scheduling system, designed to demonstrate practical applications of data generation techniques in the healthcare field. Although synthetic, the data is based on real-world values to enhance its realism and utility.

Purpose

The primary goals of this dataset are:

Learning: To help newcomers to data science or software development understand how data is structured and applied in real-world contexts.

Prototyping: To provide a foundation for developing and testing projects or features related to appointment scheduling systems.

Portfolio Showcase: To demonstrate skills in data manipulation, software development, and system design within a healthcare context.

Dataset Structure

The dataset contains three main tables:

1. Slots Table

slot_id (Integer): Unique identifier for each time slot.

appointment_date (Date): Date of the appointment.

appointment_time (Time): Scheduled time of the appointment (15-minute intervals).

is_available (Boolean): Indicates if the slot is available (True) or not (False).

2. Patients Table

patient_id (Integer): Unique identifier for each patient.

name (String, up to 60 characters): Full name of the patient.

sex (String): Gender of the patient ('Male', 'Female', 'Non-binary').

dob (Date): Date of birth in YYYY-MM-DD format.

insurance (String, up to 30 characters): Name of the patient's insurance provider from a predefined list of fictitious names.

3. Appointments Table

appointment_id (Integer): Unique identifier for each appointment.

slot_id (Integer): References the slot in the Slots table.

scheduling_date (Date): Date when the appointment was scheduled.

appointment_date (Date): Date of the appointment.

appointment_time (Time): Scheduled time of the appointment.

scheduling_interval (Integer): Days between scheduling date and appointment date.

status (String): Appointment status ('available', 'scheduled', 'completed', 'cancelled', 'no-show').

check_in_time (Time): Actual time the patient checked in.

appointment_duration (Float): Duration of the appointment in minutes.

start_time (Time): Actual start time of the appointment.

end_time (Time): Actual end time of the appointment.

waiting_time (Float): Waiting time in minutes.

patient_id (Integer): References the patient in the Patients table.

sex (String): Gender of the patient.

age (Integer): Age of the patient.

age_group (String): Age group category of the patient.

Key Parameters

The dataset simulates a medical office operating Monday to Friday, from 8:00 AM to 6:00 PM, with appointments scheduled every 15 minutes (4 per hour). Key parameters include:

Booking Horizon: Appointments can be scheduled up to 30 days in advance.

Fill Rate: 90% of available slots are filled.

Rebooking Rate: 50% of cancelled appointments are rescheduled.

Average Scheduling Interval: Appointments are scheduled an average of 7 days in advance.

Appointment Duration:

Mean: 17.4 minutes.

Median: 15.8 minutes.

Patient Arrival Times:

84.4% of patients arrive before their scheduled time.

Average early arrival: 10 minutes early.

Appointment Status Rates: Outcomes include:

Attended.

Cancelled (in advance).

No-show (missed without cancellation).

Unknown (unspecified or indeterminate).

Future Appointments: Simulated for the next 30 days, following an exponentially decreasing occupancy rate model.

Patient Visit Frequency: Patients visit an average of 1.2 times per year.

Age Groups: Defined in 5-year intervals, starting at 15 years and above.

Insurance Data:

A Pareto principle distribution is applied to simulate realistic market coverage.

Fictitious names are used for insurance providers.

Patient Demographics

Names: Generated using the Faker library to create realistic, unique names.

Age and Sex: Based on real-world outpatient attendance data, excluding pediatric patients (under 15 years).

Date Ranges

Covered Period: January 1, 2015, to December 31, 2024.

Reference Date: December 1, 2024, dividing past attended appointments from future appointments.

References

Tai-Seale, M., McGuire, T. G., & Zhang, W. (2007). Time allocation in primary care office visits. Health Services Research, 42(5), 1871–1894. https://doi.org/10.1111/j.1475-6773.2006.00689.x

Cerruti, B., Garavaldi, D., & Lerario, A. (2023). Patient's punctuality in an outpatient clinic: the role of age, medical branch and geographical factors. BMC Health Services Research, 23(1), 1385. [https://doi.org/10.1186/s12913-...

Realistic Loan Approval Dataset | US & Canada

kaggle.com

zip

Updated Nov 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Parth Patel2130 (2025). Realistic Loan Approval Dataset | US & Canada [Dataset]. https://www.kaggle.com/datasets/parthpatel2130/realistic-loan-approval-dataset-us-and-canada

Explore at:

zip(1717268 bytes)Available download formats

Dataset updated

Nov 1, 2025

Authors

Parth Patel2130

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Area covered

Canada, United States

Description

🏦 Synthetic Loan Approval Dataset

A Realistic, High-Quality Dataset for Credit Risk Modelling

🎯 Why This Dataset?

Most loan datasets on Kaggle have unrealistic patterns where:

❌ Credit scores don't matter
❌ Approval logic is backwards
❌ Models learn nonsense patterns

Unlike most loan datasets available online, this one is built on real banking criteria from US and Canadian financial institutions. Drawing from 3 years of hands-on finance industry experience, the dataset incorporates realistic correlations and business logic that reflect how actual lending decisions are made. This makes it perfect for data scientists looking to build portfolio projects that showcase not just coding ability, but genuine understanding of credit risk modelling.

📊 Dataset Overview

Metric	Value
Total Records	50,000
Features	20 (customer_id + 18 predictors + 1 target)
Target Distribution	55% Approved, 45% Rejected
Missing Values	0 (Complete dataset)
Product Types	Credit Card, Personal Loan, Line of Credit
Market	United States & Canada
Use Case	Binary Classification (Approved/Rejected)

🔑 Key Features

Identifier:

-Customer ID (unique identifier for each application)

Demographics:

-Age, Occupation Status, Years Employed

Financial Profile:

-Annual Income, Credit Score, Credit History Length -Savings/Assets, Current Debt

Credit Behaviour:

-Defaults on File, Delinquencies, Derogatory Marks

Loan Request:

-Product Type, Loan Intent, Loan Amount, Interest Rate

Calculated Ratios:

-Debt-to-Income, Loan-to-Income, Payment-to-Income

💡 What Makes This Dataset Special?

1️⃣ Real-World Approval Logic The dataset implements actual banking criteria: - DTI ratio > 50% = automatic rejection - Defaults on file = instant reject - Credit score bands match real lending thresholds - Employment verification for loans ≥$20K

2️⃣ Realistic Correlations - Higher income → Better credit scores - Older applicants → Longer credit history - Students → Lower income, special treatment for small loans - Loan intent affects approval (Education best, Debt Consolidation worst)

3️⃣ Product-Specific Rules - Credit Cards: More lenient, higher limits - Personal Loans: Standard criteria, up to $100K - Line of Credit: Capped at $50K, manual review for high amounts

4️⃣ Edge Cases Included - Young applicants (age 18) building first credit - Students with thin credit files - Self-employed with variable income - High debt-to-income ratios - Multiple delinquencies

🎓 Perfect For - Machine Learning Practice: Binary classification with real patterns - Credit Risk Modelling: Learn actual lending criteria - Portfolio Projects: Build impressive, explainable models - Feature Engineering: Rich dataset with meaningful relationships - Business Analytics: Understand financial decision-making

📈 Quick Stats

Approval Rates by Product - Credit Card: 60.4% more lenient) - Personal Loan: 46.9 (standard) - Line of Credit: 52.6% (moderate)

Loan Intent (Best → Worst Approval Odds) 1. Education (63% approved) 2. Personal (58% approved) 3. Medical/Home (52% approved) 4. Business (48% approved) 5. Debt Consolidation (40% approved)

Credit Score Distribution - Mean: 644 - Range: 300-850 - Realistic bell curve around 600-700

Income Distribution - Mean: $50,063 - Median: $41,608 - Range: $15K - $250K

🎯 Expected Model Performance

With proper feature engineering and tuning: - Accuracy: 75-85% - ROC-AUC: 0.80-0.90 - F1-Score: 0.75-0.85

Important: Feature importance should show: 1. Credit Score (most important) 2. Debt-to-Income Ratio 3. Delinquencies 4. Loan Amount 5. Income

If your model shows different patterns, something's wrong!

🏆 Use Cases & Projects

Beginner - Binary classification with XGBoost/Random Forest - EDA and visualization practice - Feature importance analysis

Intermediate - Custom threshold optimization (profit maximization) - Cost-sensitive learning (false positive vs false negative) - Ensemble methods and stacking

Advanced - Explainable AI (SHAP, LIME) - Fairness analysis across demographics - Production-ready API with FastAPI/Flask - Streamlit deployment with business rules

⚠️ Important Notes

This is SYNTHETIC Data - Generated based on real banking criteria - No real customer data was used - Safe for public sharing and portfolio use

Limitations - Simplified approval logic (real banks use 100+ factors) - No temporal component (no time series) - Single country/currency assumed (USD) - No external factors (economy, market conditions)

Educational Purpose This dataset is designed for: - Learning credit risk modeling - Portfolio projects - ML practice - Understanding lending criteria

NOT for: - Actual lending decisions - Financial advice - Production use without validation

🤝 Contributing

Found an issue? Have suggestions? - Open an issue on GitHub - Suggest i...

Minimal dataset for the study.
plos.figshare.com
xlsx
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bernát Nógrádi; Tamás Ferenc Polgár; Valéria Meszlényi; Zalán Kádár; Péter Hertelendy; Anett Csáti; László Szpisjak; Dóra Halmi; Barbara Erdélyi-Furka; Máté Tóth; Fanny Molnár; Dávid Tóth; Zsófia Bősze; Krisztina Boda; Péter Klivényi; László Siklós; Roland Patai (2024). Minimal dataset for the study. [Dataset]. http://doi.org/10.1371/journal.pone.0310028.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310028.s002
Dataset updated
Oct 9, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Bernát Nógrádi; Tamás Ferenc Polgár; Valéria Meszlényi; Zalán Kádár; Péter Hertelendy; Anett Csáti; László Szpisjak; Dóra Halmi; Barbara Erdélyi-Furka; Máté Tóth; Fanny Molnár; Dávid Tóth; Zsófia Bősze; Krisztina Boda; Péter Klivényi; László Siklós; Roland Patai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ChatGPT, a general artificial intelligence, has been recognized as a powerful tool in scientific writing and programming but its use as a medical tool is largely overlooked. The general accessibility, rapid response time and comprehensive training database might enable ChatGPT to serve as a diagnostic augmentation tool in certain clinical settings. The diagnostic process in neurology is often challenging and complex. In certain time-sensitive scenarios, rapid evaluation and diagnostic decisions are needed, while in other cases clinicians are faced with rare disorders and atypical disease manifestations. Due to these factors, the diagnostic accuracy in neurology is often suboptimal. Here we evaluated whether ChatGPT can be utilized as a valuable and innovative diagnostic augmentation tool in various neurological settings. We used synthetic data generated by neurological experts to represent descriptive anamneses of patients with known neurology-related diseases, then the probability for an appropriate diagnosis made by ChatGPT was measured. To give clarity to the accuracy of the AI-determined diagnosis, all cases have been cross-validated by other experts and general medical doctors as well. We found that ChatGPT-determined diagnostic accuracy (ranging from 68.5% ± 3.28% to 83.83% ± 2.73%) can reach the accuracy of other experts (81.66% ± 2.02%), furthermore, it surpasses the probability of an appropriate diagnosis if the examiner is a general medical doctor (57.15% ± 2.64%). Our results showcase the efficacy of general artificial intelligence like ChatGPT as a diagnostic augmentation tool in medicine. In the future, AI-based supporting tools might be useful amendments in medical practice and help to improve the diagnostic process in neurology.
Sua Música Challenge: Recommendation System
kaggle.com
zip
Updated Jul 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Osvaldo Pereira (2023). Sua Música Challenge: Recommendation System [Dataset]. https://www.kaggle.com/datasets/osvaldopereira/sua-msica-recommendation-system
Explore at:
zip(5512 bytes)Available download formats
Dataset updated
Jul 23, 2023
Authors
Osvaldo Pereira
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Sua Música

Sua Música (suamusica.com.br) is one of the largest online platforms in Latin America and the ultimate online destination for Brazilian music enthusiasts. Whether you're a passionate listener, a budding musician, or simply curious about the rich sounds of Brazilian culture, you've come to the right place. At suamusica.com.br, we revolutionize the way music is shared and enjoyed in Brazil. Our platform offers a vast collection of songs, albums, and playlists spanning various genres and artists, ensuring that there's something for everyone. From samba and bossa nova to funk and pagode, our extensive catalog covers it all. Explore, discover, and create personalized playlists that match your mood and taste. Immerse yourself in exclusive content, such as live performances, interviews, and behind-the-scenes glimpses into the lives of your favorite musicians. Join our thriving community of music lovers, connect with fellow fans, and embark on a musical journey that will transport you to the vibrant world of Brazilian music. Experience the rhythm, energy, and diversity of suamusica.com.br, and let the melodies of Brazil captivate your senses.

Welcome to the Kaggle challenge dedicated to creating a recommendation system for the suamusica.com.br platform! If you're passionate about music and data science, this challenge is the perfect opportunity to showcase your skills and contribute to enhancing the music experience for users of the suamusica.com.br platform, with its vast collection of songs and genres, presents an exciting opportunity to develop an intelligent recommendation system that can suggest personalized music choices to users based on their preferences. By participating in this challenge, you'll dive into the world of collaborative filtering, machine learning algorithms, and data analysis to create a recommendation system that will revolutionize how users discover new music on suamusica.com.br. Join us on this exciting journey and let's unlock the power of data to provide personalized music recommendations to millions of users.

The Challenge

This challenge is a little bit different than what Kaggle users are used to. It is not about only machine learning and high accuracy. We expect you to create a pipeline for a recommendation system for a music streaming platform. We provided a script that creates synthetic data: one of them contains transactional data with the plays amount by user and by day, the second contains dimensional data correlating the id of tracks with the id of artists and musical genre, and the final dataset contains metrics about artists. The actual values are not the main point of the challenge, but the pipeline is. Focus on the algorithms you can use, and what type of features you can use, based on the fact that it is a streaming platform, so think about average track duration, likes, follows, plays received on specific days of the week or specific times of the day, bpm of songs, genres, and so forth. The synthetic data generation scripts were also left as a challenge if you want to improve them, for example, creating correlation between features, or adding metric features to transactional data, or more information to the dimensional datasets. Explore all the information available and be thorough on the pipeline description, the ETL is also very important, and to name technologies (stacks) is also very important, for example, the use of AWS Lambdas or Airflow to orchestrate the whole pipeline. The codes are written in Python, mostly NumPy. Feel free to explore other libraries with out-of-the-box solutions, but have in mind that we will score higher points for creative and technical solutions with deterministic mathematical and statistical algorithms. Another important point is that, afterward, the model is deployed, describe how would you monitor the performance of your system, describe the performance indicators (KPIs) you would use and how you would measure them.

Disclaimer

The scripts provided by the Data Science team of Sua Música do not contain information about the platform database, the averages and standard deviations do not represent statistical population information of the platform users. The data structure is also generic and represent usual relational refined datasets that any streaming platform data team would possess.

What we expect

Illustration of the model pipeline (ETL, data mining, deploy and evaluation)

Final dataset with id_user as rows with a list of tracks for each user.

Formal description of the algorithm used to determine the tracks for the users

Detailed methodology of evaluation of the model (name at least one KPI that indicates performance)

Documented code

Evaluation

Grade from 0 to 10 on the deep knowledge of data pipelines, ETL and data engineering. For example knowing how to deal with Big ...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170

Synthetic datasets of the UK Biobank cohort

Explore at:

bin, csv, zip, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13983170

Dataset updated

Sep 17, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
synthbdbasevar: baseline variables, mostly collected at recruitment.
synthpmdata: annual average exposure to PM_2.5 for each participant reconstructed using their residential history.
synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM_2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data, including the annual PM_2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM_2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

Clear search

Close search

Google apps

Main menu

Synthetic datasets of the UK Biobank cohort

Content

Generation of the synthetic data

Cynthia Data - synthetic EHR records

Nightingale Health Synthetic Cohort Data

CrossLoc Benchmark Datasets

Data from: CarEvaluation

Synthetic Data for Precision Gauge Reading

Overview

Dataset contents

Selection of best model based on criteria.

S2 Data -

Fabricated Fraud Detection

Demonstration of Synthetic data usability for Fraud Detection

Coefficients of SARIMA (1, 0, 0)(0, 1, 1)12.

Healthcare Dataset

Context:

Inspiration:

Dataset Information:

Usage Scenarios:

Acknowledgments:

Image Credit:

Data for Cyrillic Reference Parsing

Coefficients of ARIMA(7,0,7).

Medical Appointment Scheduling System

Purpose

Dataset Structure

1. Slots Table

2. Patients Table

3. Appointments Table

Key Parameters

Patient Demographics

Date Ranges

References

Realistic Loan Approval Dataset | US & Canada

Minimal dataset for the study.

Sua Música Challenge: Recommendation System

Sua Música

The Challenge

Disclaimer

What we expect

Evaluation

Synthetic datasets of the UK Biobank cohort

Content

Generation of the synthetic data

`Context:`

`Inspiration:`

`Dataset Information:`

`Usage Scenarios:`

`Acknowledgments:`

`Image Credit:`