100+ datasets found

Synthetic Healthcare Database for Research (SyH-DR)
catalog.data.gov
healthdata.gov
+2more
Updated Sep 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agency for Healthcare Research and Quality (2023). Synthetic Healthcare Database for Research (SyH-DR) [Dataset]. https://catalog.data.gov/dataset/synthetic-healthcare-database-for-research-syh-dr
Explore at:
Dataset updated
Sep 16, 2023
Dataset provided by
Agency for Healthcare Research and Qualityhttp://www.ahrq.gov/
Description
The Agency for Healthcare Research and Quality (AHRQ) created SyH-DR from eligibility and claims files for Medicare, Medicaid, and commercial insurance plans in calendar year 2016. SyH-DR contains data from a nationally representative sample of insured individuals for the 2016 calendar year. SyH-DR uses synthetic data elements at the claim level to resemble the marginal distribution of the original data elements. SyH-DR person-level data elements are not synthetic, but identifying information is aggregated or masked.
Australian synthetic healthcare data with Synthea
researchdata.edu.au
data.csiro.au
datadownload
Updated Jul 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Grimes; Michael Lawley; Roc Reguant Comellas; Sankalp Khanna; David Hansen; Denis Bauer; Parnesh Raniga; Hoa Ngo; Donna Truran; Hamed Hassanzadeh; Mitchell O'Brien; Ibrahima Diouf (2024). Australian synthetic healthcare data with Synthea [Dataset]. http://doi.org/10.25919/EFCW-BM49
Explore at:
datadownloadAvailable download formats
Unique identifier
https://doi.org/10.25919/EFCW-BM49
Dataset updated
Jul 4, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
John Grimes; Michael Lawley; Roc Reguant Comellas; Sankalp Khanna; David Hansen; Denis Bauer; Parnesh Raniga; Hoa Ngo; Donna Truran; Hamed Hassanzadeh; Mitchell O'Brien; Ibrahima Diouf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Australia
Description
We developed an Australianised version of Synthea. Synthea is a synthetic data generation software that uses publicly available population aggregate statistics such as demographics, disease prevalence and incidence rates, and health reports. Synthea generates data based on manually curated models of clinical workflows and disease progression that cover a patient’s entire life and does not use real patient data; guaranteeing a completely synthetic dataset. We generated 117,258 synthetic patients from Queensland.
Mental Health Synthetic Dataset
kaggle.com
zip
Updated Sep 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anwesha (2024). Mental Health Synthetic Dataset [Dataset]. https://www.kaggle.com/datasets/anweshaghosh123/mental-health-synthetic-dataset
Explore at:
zip(77135 bytes)Available download formats
Dataset updated
Sep 29, 2024
Authors
Anwesha
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The dataset is a synthetic mental health dataset designed for use in predictive analytics, machine learning models, and research purposes. The dataset contains simulated patient information related to mental health conditions, symptoms, therapies, and other factors affecting mental well-being. Given the sensitivity of real-world mental health data, synthetic datasets provide a safe alternative for research and development without risking the privacy of individuals.

This dataset aims to provide a foundation for developing mental health applications that predict conditions, suggest therapies, and assess factors like stress and mood levels. It's intended to enhance the understanding of patient conditions in clinical or research settings, supporting AI-driven therapeutic solutions.

The features in this dataset are inspired by real-world factors commonly considered in mental health diagnostics and treatment. For instance:

Symptoms: Reflects psychological or physical symptoms patients may report during clinical sessions.

Therapy History: Considers the impact of previous treatments on current conditions.

Mood and Stress Levels: Important mental health markers that help in evaluating a patient's state of well-being.

By using synthetic data, this dataset allows for the development and testing of AI models without the ethical concerns tied to real patient data. The dataset could be used for:

Predictive analytics in mental health apps.

Training chatbots or virtual assistants to provide real-time therapy recommendations.

Educational purposes, where students and researchers can explore mental health prediction models.
Synthetic Healthcare Dataset
kaggle.com
zip
Updated May 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Divya bhavana (2024). Synthetic Healthcare Dataset [Dataset]. https://www.kaggle.com/datasets/divyabhavana/synthetic-healthcare-dataset
Explore at:
zip(20168 bytes)Available download formats
Dataset updated
May 15, 2024
Authors
Divya bhavana
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
The "Synthetic Healthcare Dataset: "Demographics, Conditions, Treatments, and Outcomes for Research and Analysis" is a complete synthesis of realistic but fictitious data that represents various aspects of healthcare. The database contains data about the patient demographics: age, gender, and region, as well as the medical conditions diagnosed, the treatments administered and the outcomes observed.

The dataset has been created to resemble actual healthcare situations and can be used for research and analysis in the healthcare field. Researchers, data scientists, and healthcare professionals can use this dataset to discover the patterns, trends, and correlations related to disease prevalence, treatment effectiveness, patient outcomes, and other aspects. Besides, it is a good source for creating and testing models designed to enhance healthcare decision-making and patient care.

Through the collection of a wide variety of data, including patient characteristics, medical conditions, treatments and outcomes, this synthetic dataset provides a multifaceted base for conducting numerous analyses and experiments in the area of healthcare analytics.

Columns:

Patient_ID: Unique identifier for each patient.

Age: Age of the patient.

Gender: Gender of the patient.

Medical_Condition: The medical condition the patient is diagnosed with.

Treatment: The treatment administered to the patient.

Outcome: The outcome of the treatment (e.g., Improved, Stable, Worsened).

Insurance_Type: Type of insurance the patient has (e.g., Private, Public, Medicare).

Income: Annual income of the patient.

Region: Geographic region where the patient is located.

Smoking_Status: Smoking status of the patient (e.g., Non-smoker, Former smoker, Current smoker).

Admission_Type: Type of admission to the hospital (e.g., Elective, Emergency, Urgent).

Hospital_ID: Unique identifier for the hospital where the patient was treated.

Length_of_Stay: Length of hospital stay in days.
d
Synthetic version of anonymized Norway Registry data containing...
search.dataone.org
dataverse.azure.uit.no
+2more
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chauhan, Pavitra (2024). Synthetic version of anonymized Norway Registry data containing prescriptions and hospitalization of the patients [Dataset]. http://doi.org/10.18710/YABAGM
Explore at:
Unique identifier
https://doi.org/10.18710/YABAGM
Dataset updated
Sep 25, 2024
Dataset provided by
DataverseNO
Authors
Chauhan, Pavitra
Time period covered
Jan 1, 2011 - Jan 1, 2013
Description
This dataset represents synthetic data derived from anonymized Norwegian Registry Data of pa aged 65 and above from 2011 to 2013. It includes the Norwegian Patient Registry (NPR), which contains hospitalization details, and the Norwegian Prescription Database (NorPD), which contains prescription details. The NPR and NorPD datasets are combined into a single CSV file. This real dataset was part of a project to study medication use in the elderly and its association with hospitalization. The project has ethical approval from the Regional Committees for Medical and Health Research Ethics in Norway (REK-Nord number: 2014/2182). The dataset was anonymized to ensure that the synthetic version could not reasonably be identical to any real-life individuals. The anonymization process was done as follows: first, only relevant information was kept from the original data set. Second, individuals' birth year and gender were replaced with randomly generated values within a plausible range of values. And last, all dates were replaced with randomly generated dates. This dataset was sufficiently scrambled to generate a synthetic dataset and was only used for the current study. The dataset has details related to Patient, Prescriber, Hospitalization, Diagnosis, Location, Medications, Prescriptions, and Prescriptions dispatched. A publication using this data to create a machine learning model for predicting hospitalization risk is under review.
d
Medical records of 30K Synthea synthetic patients
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, AJ (2023). Medical records of 30K Synthea synthetic patients [Dataset]. http://doi.org/10.7910/DVN/BWDKXS
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/BWDKXS
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Chen, AJ
Description
The dataset has 2 populations of Synthea synthetic patients generated by Synthea tool. Each population has 15K patients with original medical records in CSV files. Because the total file size is >3GB in each population, the files are compressed in zip file. Synthea records are in domains similar to those in real EMR, including patients, encounters, conditions (diagnosis), observations, medications, and procedures. The data was first used in building ML models for lung cancer risk prediction. For more information, see the published paper in Nature Scientific Reports (https://www.nature.com/articles/s41598-022-23011-4)
u
Example (synthetic) electronic health record data
rdr.ucl.ac.uk
application/csv
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steve Harris; Wai Shing Lai (2024). Example (synthetic) electronic health record data [Dataset]. http://doi.org/10.5522/04/25676298.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.5522/04/25676298.v1
Dataset updated
Apr 24, 2024
Dataset provided by
University College London
Authors
Steve Harris; Wai Shing Lai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
These data are modelled using the OMOP Common Data Model v5.3.Correlated Data SourceNG tube vocabulariesGeneration RulesThe patient’s age should be between 18 and 100 at the moment of the visit.Ethnicity data is using 2021 census data in England and Wales (Census in England and Wales 2021) .Gender is equally distributed between Male and Female (50% each).Every person in the record has a link in procedure_occurrence with the concept “Checking the position of nasogastric tube using X-ray”2% of person records have a link in procedure_occurrence with the concept of “Plain chest X-ray”60% of visit_occurrence has visit concept “Inpatient Visit”, while 40% have “Emergency Room Visit”NotesVersion 0Generated by man-made rule/story generatorStructural correct, all tables linked with the relationshipWe used national ethnicity data to generate a realistic distribution (see below)2011 Race Census figure in England and WalesEthnic Group : Population(%)Asian or Asian British: Bangladeshi - 1.1Asian or Asian British: Chinese - 0.7Asian or Asian British: Indian - 3.1Asian or Asian British: Pakistani - 2.7Asian or Asian British: any other Asian background -1.6Black or African or Caribbean or Black British: African - 2.5Black or African or Caribbean or Black British: Caribbean - 1Black or African or Caribbean or Black British: other Black or African or Caribbean background - 0.5Mixed multiple ethnic groups: White and Asian - 0.8Mixed multiple ethnic groups: White and Black African - 0.4Mixed multiple ethnic groups: White and Black Caribbean - 0.9Mixed multiple ethnic groups: any other Mixed or multiple ethnic background - 0.8White: English or Welsh or Scottish or Northern Irish or British - 74.4White: Irish - 0.9White: Gypsy or Irish Traveller - 0.1White: any other White background - 6.4Other ethnic group: any other ethnic group - 1.6Other ethnic group: Arab - 0.6

Lifestyle and Health Risk Prediction

kaggle.com

zip

Updated Oct 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arif Miah (2025). Lifestyle and Health Risk Prediction [Dataset]. https://www.kaggle.com/datasets/miadul/lifestyle-and-health-risk-prediction

Explore at:

zip(61139 bytes)Available download formats

Dataset updated

Oct 19, 2025

Authors

Arif Miah

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📘 Description:

This synthetic health dataset simulates real-world lifestyle and wellness data for individuals. It is designed to help data scientists, machine learning engineers, and students build and test health risk prediction models safely — without using sensitive medical data.

The dataset includes features such as age, weight, height, exercise habits, sleep hours, sugar intake, smoking, alcohol consumption, marital status, and profession, along with a synthetic health_risk label generated using a heuristic rule-based algorithm that mimics realistic risk behavior patterns.

🧾 Columns Description:

Column Name	Description	Type	Example
`age`	Age of the person (years)	Numeric	35
`weight`	Body weight in kilograms	Numeric	70
`height`	Height in centimeters	Numeric	172
`exercise`	Exercise frequency level	Categorical (`none`, `low`, `medium`, `high`)	`medium`
`sleep`	Average hours of sleep per night	Numeric	7
`sugar_intake`	Level of sugar consumption	Categorical (`low`, `medium`, `high`)	`high`
`smoking`	Smoking habit	Categorical (`yes`, `no`)	`no`
`alcohol`	Alcohol consumption habit	Categorical (`yes`, `no`)	`yes`
`married`	Marital status	Categorical (`yes`, `no`)	`yes`
`profession`	Type of work or profession	Categorical (`office_worker`, `teacher`, `doctor`, `engineer`, etc.)	`teacher`
`bmi`	Body Mass Index calculated as weight / (height²)	Numeric	24.5
`health_risk`	Target label showing overall health risk	Categorical (`low`, `high`)	`high`

🧩 Use Cases:

Health Risk Prediction: Train classification models (Logistic Regression, RandomForest, XGBoost, CatBoost) to predict health risk (low / high).
Feature Importance Analysis: Identify which lifestyle factors most influence health risk.
Data Preprocessing & EDA Practice: Use this dataset for data cleaning, encoding, and visualization practice.
Model Explainability Projects: Use SHAP or LIME to explain how different lifestyle habits affect predictions.
Streamlit or Flask Web App Development: Build a real-time web app that predicts health risk from user input.

💡 Case Study Example:

Imagine you are a data scientist building a Health Risk Prediction App for a wellness startup. You want to analyze how exercise, sleep, and sugar intake affect overall health risk. This dataset helps you simulate those relationships without handling sensitive medical data.

You could:

Perform EDA to find correlations between age, BMI, and health risk.
Train a model using Random Forest to predict health_risk.
Deploy a Streamlit app where users can input their lifestyle information and get a risk score instantly.

⚙️ Technical Information:

Rows: 5,000 (adjustable, you can create more)
Columns: 12
Target variable: health_risk
Data type: Mixed (Numeric + Categorical)
Source: Fully synthetic, generated using Python (NumPy, Faker)

📈 License:

CC0: Public Domain You are free to use this dataset for research, learning, or commercial projects.

🌍 Author:

Created by Arif Miah Machine Learning Engineer | Kaggle Expert | Data Scientist 📧 arifmiahcse@gmail.com

Synthetic Healthcare Admissions Dataset
kaggle.com
zip
Updated Sep 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yash (2025). Synthetic Healthcare Admissions Dataset [Dataset]. https://www.kaggle.com/datasets/yashdev01/synthetic-healthcare-admissions
Explore at:
zip(1581549 bytes)Available download formats
Dataset updated
Sep 2, 2025
Authors
Yash
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
🏥 Synthetic Healthcare Admissions Dataset

The Synthetic Healthcare Admissions dataset is a synthetically generated healthcare dataset that mimics patient hospital admission records. It is designed to provide researchers, data scientists, and machine learning practitioners with realistic healthcare data while preserving patient privacy and avoiding exposure of sensitive information.

📂 Dataset Overview

Type: Tabular / Structured data

Domain: Healthcare, Electronic Health Records (EHR)

Content: Synthetic hospital admission records

Use Cases:

Predictive modeling of patient outcomes

Length of stay estimation

Readmission prediction

Resource allocation & optimization in healthcare

Experimentation with ML models without privacy risks

⚙️ Features (common fields included in admissions data)

Patient Demographics: Age, Gender, Ethnicity

Admission Details: Admission type, Admission date, Discharge date

Clinical Data: Diagnosis codes (ICD-like), Procedures, Comorbidities

Hospital Metrics: Length of stay, Department/Unit info

Synthetic Identifiers: Randomized patient IDs

✅ Why Synthetic?

Real healthcare data is heavily restricted due to HIPAA and GDPR compliance. This dataset provides a privacy-safe alternative, allowing open research while maintaining the structure and statistical properties of real hospital admissions data.

🔬 Applications

Benchmarking healthcare ML models

Developing explainable AI solutions in clinical settings

Testing NLP/ML pipelines for structured EHR data

Teaching and training purposes
m
Synthetic Synthea patient datasets for lung cancer risk prediction machine...
data.mendeley.com
Updated Oct 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anjun Chen (2022). Synthetic Synthea patient datasets for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.17632/b24cb4nn8h.1
Explore at:
Unique identifier
https://doi.org/10.17632/b24cb4nn8h.1
Dataset updated
Oct 31, 2022
Authors
Anjun Chen
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction and simulation study of learning health systems.

In subfolder "unconverted": Five populations of 30K patients were generated by the Synthea patient generator. About 1100 lung cancer patients and 3000 control patients (without lung cancer) were selected and their electronic health records (EHR) were processed to data table files ready for machine learning using common algorithms like XGBoost.

In root directory: The five 30K-patient datasets were combined sequentially to form 5 different size datasets, from 30K to 150K patients. The new datasets were resampled to keep all lung cancer patients plus about 3x control patients. The ML-ready table files also had the continuous numeric values converted to categorical values.

Because Synthea patients are closely resemble real patients, the Synthea patient data can be used to develop and test ML algorithms and pipelines, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns.

The first LHS simulation study titled "Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data" has been published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).
H
10,000 Synthetic Medicare Patient Records
dataverse.harvard.edu
Updated Nov 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Hall (2019). 10,000 Synthetic Medicare Patient Records [Dataset]. http://doi.org/10.7910/DVN/QDXLWR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/QDXLWR
Dataset updated
Nov 4, 2019
Dataset provided by
Harvard Dataverse
Authors
Dylan Hall
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains 10,000 synthetic patient records representing a scaled-down US Medicare population. The records were generated by Synthea ( https://github.com/synthetichealth/synthea ) and are completely synthetic and contain no real patient data. This data is presented free of cost and free of restrictions. Each record is stored as one file in HL7 FHIR R4 ( https://www.hl7.org/fhir/ ) containing one Bundle, in JSON. For more information on how this specific population was created, or to generate your own at any scale, see: https://github.com/synthetichealth/populations/tree/master/medicare
Synthea Generated Synthetic Data in FHIR
console.cloud.google.com
Updated Jul 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The MITRE Corporation (2023). Synthea Generated Synthetic Data in FHIR [Dataset]. https://console.cloud.google.com/marketplace/product/mitre/synthea-fhir?hl=fr
Explore at:
Dataset updated
Jul 27, 2023
Dataset authored and provided by
The MITRE Corporationhttps://www.mitre.org/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The Synthea Generated Synthetic Data in FHIR hosts over 1 million synthetic patient records generated using Synthea in FHIR format. Exported from the Google Cloud Healthcare API FHIR Store into BigQuery using analytics schema . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . This public dataset is also available in Google Cloud Storage and available free to use. The URL for the GCS bucket is gs://gcp-public-data--synthea-fhir-data-1m-patients. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. Please cite SyntheaTM as: Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079
T
Synthetic Suicide Prevention Dataset with SDoH
data.va.gov
datahub.va.gov
+3more
csv, xlsx, xml
Updated Feb 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VHA (2021). Synthetic Suicide Prevention Dataset with SDoH [Dataset]. https://www.data.va.gov/dataset/Synthetic-Suicide-Prevention-Dataset-with-SDoH/h5zp-pekf
Explore at:
csv, xlsx, xmlAvailable download formats
Dataset updated
Feb 18, 2021
Dataset authored and provided by
VHA
Description
The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.
The Health Gym v2.0 Synthetic Antiretroviral Therapy (ART) for HIV Dataset
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Kuo (2023). The Health Gym v2.0 Synthetic Antiretroviral Therapy (ART) for HIV Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.22827878.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22827878.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nicholas Kuo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
===###

This synthetic dataset, centred on ART for HIV, was synthesised employing the model outlined in reference [1], incorporating the techniques of WGAN-GP+G_EOT+VAE+Buffer.

This dataset serves as a principal resource for the Centre for Big Data Research in Health (CBDRH) Datathon (see: CBDRH Health Data Science Datathon 2023 (cbdrh-hds-datathon-2023.github.io)). Its primary purpose is to advance the Health Data Analytics (HDAT) courses at the University of New South Wales (UNSW), providing students with exposure to synthetic yet realistic datasets that simulate real-world data.

The dataset is composed of 534,960 records, distributed over 15 distinct columns, and is preserved in a CSV format with a size of 39.1 MB. It contains information about 8,916 synthetic patients over a period of 60 months, with data summarised on a monthly basis. The total number of records corresponds to the product of the synthetic patient count and the record duration in months, thus equating to 8,916 multiplied by 60.

The dataset's structure encompasses 15 columns, which include 13 variables pertinent to ART for HIV as delineated in reference [1], a unique patient identifier, and a further variable signifying the specific time point.

===

This dataset forms part of a continuous series of work, building upon reference [2]. For further details, kindly refer to our papers: [1] Kuo, Nicholas I., Louisa Jorm, and Sebastiano Barbieri. "Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV." arXiv preprint arXiv:2208.08655 (2022). [2] Kuo, Nicholas I-Hsien, et al. "The Health Gym: synthetic health-related datasets for the development of reinforcement learning algorithms." Scientific Data 9.1 (2022): 693.

===

Latest edit: 16th May 2023.
D
Synthetic Data In Healthcare Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Synthetic Data In Healthcare Market Research Report 2033 [Dataset]. https://dataintelo.com/report/synthetic-data-in-healthcare-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Oct 1, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Synthetic Data in Healthcare Market Outlook

According to our latest research, the global synthetic data in healthcare market size reached USD 457.8 million in 2024 and is expected to grow at a robust CAGR of 34.2% during the forecast period, reaching USD 5.68 billion by 2033. This remarkable growth is driven by the escalating demand for advanced data solutions that address privacy concerns, enable improved AI model training, and facilitate seamless data sharing across the healthcare ecosystem. The increasing adoption of digital health technologies, stringent data privacy regulations, and rising investments in artificial intelligence are among the key factors fueling the expansion of the synthetic data in healthcare market.

One of the primary growth factors for the synthetic data in healthcare market is the growing need for privacy-preserving data solutions. As healthcare organizations grapple with stringent regulations such as HIPAA and GDPR, the use of real patient data for research, analytics, and AI model training has become increasingly challenging. Synthetic data, which mimics real-world patient information without exposing sensitive personal details, has emerged as a viable alternative. This approach not only ensures compliance with regulatory requirements but also mitigates the risks associated with data breaches and unauthorized access. The ability to generate diverse, high-quality synthetic datasets is empowering healthcare providers, payers, and researchers to drive innovation while maintaining patient confidentiality.

Another significant driver is the rapid advancement of artificial intelligence and machine learning applications within the healthcare sector. AI models require vast and varied datasets to achieve high accuracy and reliability, especially in complex domains such as medical imaging, drug discovery, and predictive analytics. However, access to comprehensive and representative real-world data is often limited by privacy constraints and data silos. Synthetic data bridges this gap by providing scalable, customizable, and bias-free datasets that enhance the performance of AI algorithms. This not only accelerates the development and deployment of AI-driven healthcare solutions but also fosters collaboration among stakeholders by enabling secure data sharing and benchmarking.

The synthetic data in healthcare market is further propelled by the increasing adoption of digital transformation initiatives across the industry. Hospitals, pharmaceutical companies, research institutions, and contract research organizations (CROs) are leveraging synthetic data to streamline clinical trials, improve patient data management, and optimize resource allocation. The integration of synthetic data into electronic health records (EHRs), telemedicine platforms, and health information exchanges is facilitating seamless interoperability and data-driven decision-making. Moreover, the growing emphasis on value-based care, population health management, and personalized medicine is creating new opportunities for synthetic data solutions to enhance healthcare delivery and outcomes.

From a regional perspective, North America continues to dominate the synthetic data in healthcare market, accounting for the largest revenue share in 2024. This leadership is attributed to the presence of advanced healthcare infrastructure, a strong focus on innovation, and proactive regulatory frameworks that support digital health adoption. Europe follows closely, driven by increasing investments in healthcare IT, a collaborative research environment, and robust data protection regulations. The Asia Pacific region is emerging as a high-growth market, fueled by expanding healthcare access, rising government initiatives, and the proliferation of digital health technologies. Latin America and the Middle East & Africa are also witnessing steady growth, supported by improving healthcare infrastructure and growing awareness of the benefits of synthetic data.

Component Analysis

The synthetic data in healthcare market is segmented by component into software and services, each playing a pivotal role in the industry’s ecosystem. The software segment encompasses a wide range of solutions designed to generate, manage, and validate synthetic datasets for various healthcare applications. These software platforms leverage advanced algorithms, machine learning techniques, and data modeling tools to create high-fidelity synthetic data that mimics real-world patient
h
synthetic-health-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harshita Chhaparia, synthetic-health-dataset [Dataset]. https://huggingface.co/datasets/ieruygfvkihesugdfvx/synthetic-health-dataset
Explore at:
Authors
Harshita Chhaparia
Description
ieruygfvkihesugdfvx/synthetic-health-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Synthetic Dataset of Emergency Healthcare Services
figshare.com
csv
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Ferreira (2024). Synthetic Dataset of Emergency Healthcare Services [Dataset]. http://doi.org/10.6084/m9.figshare.28012784.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28012784.v1
Dataset updated
Dec 12, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Marco Ferreira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was generated using Simio simulation software. The simulations model patient flow in healthcare settings, capturing key metrics such as queue times, length of stay (LOS) for patients, and nurse utilization rates. Each CSV file contains time-series data, with measured variables including patient waiting times, resource utilization percentages, and service durations.## File Overview**CheckBloodPressure.csv** - (9 KB): Contains blood pressure Server records of patients.**CheckPatientType.csv** - (19 KB): Identifies the type of each patient (e.g., 1 or 3).**Fill_Information.csv** - (2 KB): Fill information records for new patients.**MedicalRecord1.csv** - (10 KB): Medical record dataset for patient type 1.**MedicalRecord2.csv** - (4 KB): Medical record dataset for patient type 2.**MedicalRecord3.csv** - (2 KB): Medical record dataset for patient type 3.**MedicalRecord4.csv** - (13 KB): Medical record dataset for patient type 4.**OutPatientDepartment.csv** - (18 KB): Data related to the satisfaction and length of stay of an given patient.**Triage.csv** - (13 KB): Data related to the triage process.**README.txt** - (4 KB): Documentation of the dataset, including structure, metadata, and usage.## Common Fields Across Files**Patient ID** (Integer): Unique identifier for each patient.**Patient Type** (Integer): Classification of patient (e.g., 1, 4).**Medical Records Arrival Time** (DateTime): Timestamp of the patient's first arrival in the medical record department.**Exiting Time** (DateTime): Timestamp when the patient exits a Server.**Waiting Time (min)** (Real): Total waiting time before being attended to.**Resource Used** (String): Resource (e.g., Operator) allocated to the patient.**Utilization %** (Real): Utilization rate of the resource as a percentage.**Queue Count Before Processing** (Integer): Number of patients in the queue before processing begins.**Queue Count After Processing** (Integer): Number of patients in the queue after processing ends.**Queue Difference** (Integer): Difference between the before and after queue counts.**Length of Stay (min)** (Real): Total time spent in the simulation by the patient.**LOS without Queues (min)** (Real): Length of stay excluding any queuing time.**Satisfaction %** (Real): Patient satisfaction rating based on their experience.**New Patient?** (String): Indicates if this is a new patient or a returning one.
R
Synthetic Dataset of Emergency Healthcare Services
datarepositorium.uminho.pt
data-staging.niaid.nih.gov
+1more
csv, txt
Updated Jan 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Repositório de Dados da Universidade do Minho (2025). Synthetic Dataset of Emergency Healthcare Services [Dataset]. http://doi.org/10.34622/datarepositorium/AKSZQG
Explore at:
csv(1259), txt(4064)Available download formats
Unique identifier
https://doi.org/10.34622/datarepositorium/AKSZQG
Dataset updated
Jan 17, 2025
Dataset provided by
Repositório de Dados da Universidade do Minho
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Synthetic dataset of emergency services comprised of several CSV files that we have generated using a simulation software. This dataset is open for public use; please cite our work if used in research or applications. File Overview CheckBloodPressure.csv** - (9 KB): Contains blood pressure Server records of patients. CheckPatientType.csv** - (19 KB): Identifies the type of each patient (e.g., 1 or 3). Fill_Information.csv - (2 KB): Fill information records for new patients. MedicalRecord1.csv - (10 KB): Medical record dataset for patient type 1. MedicalRecord2.csv - (4 KB): Medical record dataset for patient type 2. MedicalRecord3.csv - (2 KB): Medical record dataset for patient type 3. MedicalRecord4.csv - (13 KB): Medical record dataset for patient type 4. OutPatientDepartment.csv - (18 KB): Data related to the satisfaction and length of stay of an given patient. Triage.csv - (13 KB): Data related to the triage process. README.txt - (4 KB): Documentation of the dataset, including structure, metadata, and usage. Common Fields Across Files Patient ID (Integer): Unique identifier for each patient. Patient Type (Integer): Classification of patient (e.g., 1, 4). Medical Records Arrival Time (DateTime): Timestamp of the patient's first arrival in the medical record department. Exiting Time (DateTime): Timestamp when the patient exits a Server. Waiting Time (min) (Real): Total waiting time before being attended to. Resource Used (String): Resource (e.g., Operator) allocated to the patient. Utilization % (Real): Utilization rate of the resource as a percentage. Queue Count Before Processing (Integer): Number of patients in the queue before processing begins. Queue Count After Processing (Integer): Number of patients in the queue after processing ends. Queue Difference (Integer): Difference between the before and after queue counts. Length of Stay (min) (Real): Total time spent in the simulation by the patient. LOS without Queues (min) (Real): Length of stay excluding any queuing time. Satisfaction % (Real): Patient satisfaction rating based on their experience. New Patient? (String): Indicates if this is a new patient or a returning one.
G
Healthcare Synthetic-Data Governance Services Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Healthcare Synthetic-Data Governance Services Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/healthcare-synthetic-data-governance-services-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Aug 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Healthcare Synthetic-Data Governance Services Market Outlook

As per our latest research, the global healthcare synthetic-data governance services market size reached USD 1.14 billion in 2024, demonstrating a robust momentum in the adoption of synthetic data solutions across the healthcare sector. The industry is expanding at a CAGR of 29.3% and is forecasted to attain a value of USD 8.71 billion by 2033. This exceptional growth is primarily driven by the increasing demand for privacy-preserving data solutions, escalating regulatory pressures, and the need for high-quality data to fuel advanced healthcare analytics and artificial intelligence (AI) applications.

The healthcare synthetic-data governance services market is experiencing exponential growth due to the growing emphasis on data privacy and security in healthcare environments. As healthcare organizations increasingly integrate digital technologies and electronic health records (EHRs), there is a concurrent rise in concerns around patient data confidentiality and compliance with global data protection regulations such as HIPAA, GDPR, and others. Synthetic data, which mimics real patient data without exposing sensitive information, is becoming a preferred solution for training AI models, conducting clinical research, and enabling data sharing across organizations. The market is further propelled by the rising adoption of AI and machine learning in healthcare, which necessitates vast, high-quality datasets that can be safely used without breaching patient privacy. This has led to a surge in demand for robust governance frameworks and services that ensure the ethical and compliant use of synthetic data throughout its lifecycle.

Another significant growth factor is the increasing complexity and volume of healthcare data, which is making traditional data anonymization techniques less effective. As healthcare providers, pharmaceutical companies, and research institutes seek to leverage big data analytics and advanced modeling, they are turning to synthetic data to overcome data scarcity and bias issues. Synthetic-data governance services play a crucial role in standardizing processes, ensuring data quality, and maintaining regulatory compliance while facilitating seamless data sharing and collaboration. The market is also witnessing an upsurge in partnerships between healthcare organizations and technology vendors, aiming to co-develop tailored governance solutions that address specific clinical, operational, and research needs. This collaborative ecosystem is fostering innovation and accelerating the deployment of synthetic-data governance frameworks globally.

Furthermore, the healthcare synthetic-data governance services market is benefiting from increased investments by both public and private sectors in digital health infrastructure. Governments and regulatory bodies are actively supporting initiatives that promote data-driven healthcare innovation while safeguarding patient rights. The proliferation of cloud computing and the emergence of interoperable health information systems are making it easier for organizations to implement synthetic-data governance solutions at scale. Additionally, the COVID-19 pandemic has highlighted the critical need for secure, accessible, and compliant data management practices, further intensifying demand for synthetic-data governance services. These factors collectively position the market for sustained long-term growth.

Synthetic Health Data is revolutionizing the way healthcare organizations approach data privacy and security. By creating realistic but fictional datasets, synthetic health data allows researchers and developers to work with information that mirrors real patient data without exposing sensitive details. This approach not only enhances privacy but also provides a valuable resource for testing new healthcare technologies and methodologies. As the demand for synthetic health data grows, it is becoming an integral part of the healthcare data ecosystem, supporting innovation while ensuring compliance with stringent data protection regulations.

Regionally, North America continues to dominate the healthcare synthetic-data governance services market, owing to its advanced healthcare IT ecosystem, strong regulatory frameworks, and high adoption of AI-driven healthcare solutions. Europe follows closely, with stringent
h
Synthetic-Medical-Speech-Dataset
huggingface.co
Updated Oct 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hani. M (2025). Synthetic-Medical-Speech-Dataset [Dataset]. https://huggingface.co/datasets/Hani89/Synthetic-Medical-Speech-Dataset
Explore at:
Dataset updated
Oct 23, 2025
Authors
Hani. M
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Synthetic Medical Speech Dataset

Overview

Synthetic Medical Speech Dataset is a synthetic dataset of audio–text pairs designed for developing and evaluating automatic speech recognition (ASR) models in the medical domain.The corpus contains thousands of short audio clips generated from medically relevant text using a text-to-speech (TTS) system.Each clip is paired with its corresponding transcript.Because all content is synthetically produced, the dataset does not contain… See the full description on the dataset page: https://huggingface.co/datasets/Hani89/Synthetic-Medical-Speech-Dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Agency for Healthcare Research and Quality (2023). Synthetic Healthcare Database for Research (SyH-DR) [Dataset]. https://catalog.data.gov/dataset/synthetic-healthcare-database-for-research-syh-dr

Synthetic Healthcare Database for Research (SyH-DR)

Explore at:

9 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Sep 16, 2023

Dataset provided by

Agency for Healthcare Research and Qualityhttp://www.ahrq.gov/

Description

The Agency for Healthcare Research and Quality (AHRQ) created SyH-DR from eligibility and claims files for Medicare, Medicaid, and commercial insurance plans in calendar year 2016. SyH-DR contains data from a nationally representative sample of insured individuals for the 2016 calendar year. SyH-DR uses synthetic data elements at the claim level to resemble the marginal distribution of the original data elements. SyH-DR person-level data elements are not synthetic, but identifying information is aggregated or masked.

Clear search

Close search

Google apps

Main menu

Synthetic Healthcare Database for Research (SyH-DR)

Australian synthetic healthcare data with Synthea

Mental Health Synthetic Dataset

Synthetic Healthcare Dataset

Columns:

Synthetic version of anonymized Norway Registry data containing...

Medical records of 30K Synthea synthetic patients

Example (synthetic) electronic health record data

Lifestyle and Health Risk Prediction

📘 Description:

🧾 Columns Description:

🧩 Use Cases:

💡 Case Study Example:

⚙️ Technical Information:

📈 License:

🌍 Author:

Synthetic Healthcare Admissions Dataset

🏥 Synthetic Healthcare Admissions Dataset

📂 Dataset Overview

⚙️ Features (common fields included in admissions data)

✅ Why Synthetic?

🔬 Applications

Synthetic Synthea patient datasets for lung cancer risk prediction machine...

10,000 Synthetic Medicare Patient Records

Synthea Generated Synthetic Data in FHIR

Synthetic Suicide Prevention Dataset with SDoH

The Health Gym v2.0 Synthetic Antiretroviral Therapy (ART) for HIV Dataset

===###

===

===

Synthetic Data In Healthcare Market Research Report 2033

Synthetic Data in Healthcare Market Outlook

Component Analysis

synthetic-health-dataset

Synthetic Dataset of Emergency Healthcare Services

Synthetic Dataset of Emergency Healthcare Services

Healthcare Synthetic-Data Governance Services Market Research Report 2033

Healthcare Synthetic-Data Governance Services Market Outlook

Synthetic-Medical-Speech-Dataset

Synthetic Healthcare Database for Research (SyH-DR)See More Versions

Synthetic Healthcare Database for Research (SyH-DR)