100+ datasets found
  1. Synthetic Healthcare Database for Research (SyH-DR)

    • catalog.data.gov
    • healthdata.gov
    • +2more
    Updated Sep 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agency for Healthcare Research and Quality (2023). Synthetic Healthcare Database for Research (SyH-DR) [Dataset]. https://catalog.data.gov/dataset/synthetic-healthcare-database-for-research-syh-dr
    Explore at:
    Dataset updated
    Sep 16, 2023
    Dataset provided by
    Agency for Healthcare Research and Qualityhttp://www.ahrq.gov/
    Description

    The Agency for Healthcare Research and Quality (AHRQ) created SyH-DR from eligibility and claims files for Medicare, Medicaid, and commercial insurance plans in calendar year 2016. SyH-DR contains data from a nationally representative sample of insured individuals for the 2016 calendar year. SyH-DR uses synthetic data elements at the claim level to resemble the marginal distribution of the original data elements. SyH-DR person-level data elements are not synthetic, but identifying information is aggregated or masked.

  2. Australian synthetic healthcare data with Synthea

    • researchdata.edu.au
    • data.csiro.au
    datadownload
    Updated Jul 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Grimes; Michael Lawley; Roc Reguant Comellas; Sankalp Khanna; David Hansen; Denis Bauer; Parnesh Raniga; Hoa Ngo; Donna Truran; Hamed Hassanzadeh; Mitchell O'Brien; Ibrahima Diouf (2024). Australian synthetic healthcare data with Synthea [Dataset]. http://doi.org/10.25919/EFCW-BM49
    Explore at:
    datadownloadAvailable download formats
    Dataset updated
    Jul 4, 2024
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    John Grimes; Michael Lawley; Roc Reguant Comellas; Sankalp Khanna; David Hansen; Denis Bauer; Parnesh Raniga; Hoa Ngo; Donna Truran; Hamed Hassanzadeh; Mitchell O'Brien; Ibrahima Diouf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Australia
    Description

    We developed an Australianised version of Synthea. Synthea is a synthetic data generation software that uses publicly available population aggregate statistics such as demographics, disease prevalence and incidence rates, and health reports. Synthea generates data based on manually curated models of clinical workflows and disease progression that cover a patient’s entire life and does not use real patient data; guaranteeing a completely synthetic dataset. We generated 117,258 synthetic patients from Queensland.

  3. Mental Health Synthetic Dataset

    • kaggle.com
    zip
    Updated Sep 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anwesha (2024). Mental Health Synthetic Dataset [Dataset]. https://www.kaggle.com/datasets/anweshaghosh123/mental-health-synthetic-dataset
    Explore at:
    zip(77135 bytes)Available download formats
    Dataset updated
    Sep 29, 2024
    Authors
    Anwesha
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The dataset is a synthetic mental health dataset designed for use in predictive analytics, machine learning models, and research purposes. The dataset contains simulated patient information related to mental health conditions, symptoms, therapies, and other factors affecting mental well-being. Given the sensitivity of real-world mental health data, synthetic datasets provide a safe alternative for research and development without risking the privacy of individuals.

    This dataset aims to provide a foundation for developing mental health applications that predict conditions, suggest therapies, and assess factors like stress and mood levels. It's intended to enhance the understanding of patient conditions in clinical or research settings, supporting AI-driven therapeutic solutions.

    The features in this dataset are inspired by real-world factors commonly considered in mental health diagnostics and treatment. For instance:

    Symptoms: Reflects psychological or physical symptoms patients may report during clinical sessions.

    Therapy History: Considers the impact of previous treatments on current conditions.

    Mood and Stress Levels: Important mental health markers that help in evaluating a patient's state of well-being.

    By using synthetic data, this dataset allows for the development and testing of AI models without the ethical concerns tied to real patient data. The dataset could be used for:

    • Predictive analytics in mental health apps.
    • Training chatbots or virtual assistants to provide real-time therapy recommendations.
    • Educational purposes, where students and researchers can explore mental health prediction models.
  4. Synthetic Healthcare Dataset

    • kaggle.com
    zip
    Updated May 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divya bhavana (2024). Synthetic Healthcare Dataset [Dataset]. https://www.kaggle.com/datasets/divyabhavana/synthetic-healthcare-dataset
    Explore at:
    zip(20168 bytes)Available download formats
    Dataset updated
    May 15, 2024
    Authors
    Divya bhavana
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    The "Synthetic Healthcare Dataset: "Demographics, Conditions, Treatments, and Outcomes for Research and Analysis" is a complete synthesis of realistic but fictitious data that represents various aspects of healthcare. The database contains data about the patient demographics: age, gender, and region, as well as the medical conditions diagnosed, the treatments administered and the outcomes observed.

    The dataset has been created to resemble actual healthcare situations and can be used for research and analysis in the healthcare field. Researchers, data scientists, and healthcare professionals can use this dataset to discover the patterns, trends, and correlations related to disease prevalence, treatment effectiveness, patient outcomes, and other aspects. Besides, it is a good source for creating and testing models designed to enhance healthcare decision-making and patient care.

    Through the collection of a wide variety of data, including patient characteristics, medical conditions, treatments and outcomes, this synthetic dataset provides a multifaceted base for conducting numerous analyses and experiments in the area of healthcare analytics.

    Columns:

    Patient_ID: Unique identifier for each patient.

    Age: Age of the patient.

    Gender: Gender of the patient.

    Medical_Condition: The medical condition the patient is diagnosed with.

    Treatment: The treatment administered to the patient.

    Outcome: The outcome of the treatment (e.g., Improved, Stable, Worsened).

    Insurance_Type: Type of insurance the patient has (e.g., Private, Public, Medicare).

    Income: Annual income of the patient.

    Region: Geographic region where the patient is located.

    Smoking_Status: Smoking status of the patient (e.g., Non-smoker, Former smoker, Current smoker).

    Admission_Type: Type of admission to the hospital (e.g., Elective, Emergency, Urgent).

    Hospital_ID: Unique identifier for the hospital where the patient was treated.

    Length_of_Stay: Length of hospital stay in days.

  5. d

    Synthetic version of anonymized Norway Registry data containing...

    • search.dataone.org
    • dataverse.azure.uit.no
    • +2more
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chauhan, Pavitra (2024). Synthetic version of anonymized Norway Registry data containing prescriptions and hospitalization of the patients [Dataset]. http://doi.org/10.18710/YABAGM
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    DataverseNO
    Authors
    Chauhan, Pavitra
    Time period covered
    Jan 1, 2011 - Jan 1, 2013
    Description

    This dataset represents synthetic data derived from anonymized Norwegian Registry Data of pa aged 65 and above from 2011 to 2013. It includes the Norwegian Patient Registry (NPR), which contains hospitalization details, and the Norwegian Prescription Database (NorPD), which contains prescription details. The NPR and NorPD datasets are combined into a single CSV file. This real dataset was part of a project to study medication use in the elderly and its association with hospitalization. The project has ethical approval from the Regional Committees for Medical and Health Research Ethics in Norway (REK-Nord number: 2014/2182). The dataset was anonymized to ensure that the synthetic version could not reasonably be identical to any real-life individuals. The anonymization process was done as follows: first, only relevant information was kept from the original data set. Second, individuals' birth year and gender were replaced with randomly generated values within a plausible range of values. And last, all dates were replaced with randomly generated dates. This dataset was sufficiently scrambled to generate a synthetic dataset and was only used for the current study. The dataset has details related to Patient, Prescriber, Hospitalization, Diagnosis, Location, Medications, Prescriptions, and Prescriptions dispatched. A publication using this data to create a machine learning model for predicting hospitalization risk is under review.

  6. d

    Medical records of 30K Synthea synthetic patients

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, AJ (2023). Medical records of 30K Synthea synthetic patients [Dataset]. http://doi.org/10.7910/DVN/BWDKXS
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Chen, AJ
    Description

    The dataset has 2 populations of Synthea synthetic patients generated by Synthea tool. Each population has 15K patients with original medical records in CSV files. Because the total file size is >3GB in each population, the files are compressed in zip file. Synthea records are in domains similar to those in real EMR, including patients, encounters, conditions (diagnosis), observations, medications, and procedures. The data was first used in building ML models for lung cancer risk prediction. For more information, see the published paper in Nature Scientific Reports (https://www.nature.com/articles/s41598-022-23011-4)

  7. u

    Example (synthetic) electronic health record data

    • rdr.ucl.ac.uk
    application/csv
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steve Harris; Wai Shing Lai (2024). Example (synthetic) electronic health record data [Dataset]. http://doi.org/10.5522/04/25676298.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Apr 24, 2024
    Dataset provided by
    University College London
    Authors
    Steve Harris; Wai Shing Lai
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    These data are modelled using the OMOP Common Data Model v5.3.Correlated Data SourceNG tube vocabulariesGeneration RulesThe patient’s age should be between 18 and 100 at the moment of the visit.Ethnicity data is using 2021 census data in England and Wales (Census in England and Wales 2021) .Gender is equally distributed between Male and Female (50% each).Every person in the record has a link in procedure_occurrence with the concept “Checking the position of nasogastric tube using X-ray”2% of person records have a link in procedure_occurrence with the concept of “Plain chest X-ray”60% of visit_occurrence has visit concept “Inpatient Visit”, while 40% have “Emergency Room Visit”NotesVersion 0Generated by man-made rule/story generatorStructural correct, all tables linked with the relationshipWe used national ethnicity data to generate a realistic distribution (see below)2011 Race Census figure in England and WalesEthnic Group : Population(%)Asian or Asian British: Bangladeshi - 1.1Asian or Asian British: Chinese - 0.7Asian or Asian British: Indian - 3.1Asian or Asian British: Pakistani - 2.7Asian or Asian British: any other Asian background -1.6Black or African or Caribbean or Black British: African - 2.5Black or African or Caribbean or Black British: Caribbean - 1Black or African or Caribbean or Black British: other Black or African or Caribbean background - 0.5Mixed multiple ethnic groups: White and Asian - 0.8Mixed multiple ethnic groups: White and Black African - 0.4Mixed multiple ethnic groups: White and Black Caribbean - 0.9Mixed multiple ethnic groups: any other Mixed or multiple ethnic background - 0.8White: English or Welsh or Scottish or Northern Irish or British - 74.4White: Irish - 0.9White: Gypsy or Irish Traveller - 0.1White: any other White background - 6.4Other ethnic group: any other ethnic group - 1.6Other ethnic group: Arab - 0.6

  8. Lifestyle and Health Risk Prediction

    • kaggle.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Lifestyle and Health Risk Prediction [Dataset]. https://www.kaggle.com/datasets/miadul/lifestyle-and-health-risk-prediction
    Explore at:
    zip(61139 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    Arif Miah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📘 Description:

    This synthetic health dataset simulates real-world lifestyle and wellness data for individuals. It is designed to help data scientists, machine learning engineers, and students build and test health risk prediction models safely — without using sensitive medical data.

    The dataset includes features such as age, weight, height, exercise habits, sleep hours, sugar intake, smoking, alcohol consumption, marital status, and profession, along with a synthetic health_risk label generated using a heuristic rule-based algorithm that mimics realistic risk behavior patterns.

    🧾 Columns Description:

    Column NameDescriptionTypeExample
    ageAge of the person (years)Numeric35
    weightBody weight in kilogramsNumeric70
    heightHeight in centimetersNumeric172
    exerciseExercise frequency levelCategorical (none, low, medium, high)medium
    sleepAverage hours of sleep per nightNumeric7
    sugar_intakeLevel of sugar consumptionCategorical (low, medium, high)high
    smokingSmoking habitCategorical (yes, no)no
    alcoholAlcohol consumption habitCategorical (yes, no)yes
    marriedMarital statusCategorical (yes, no)yes
    professionType of work or professionCategorical (office_worker, teacher, doctor, engineer, etc.)teacher
    bmiBody Mass Index calculated as weight / (height²)Numeric24.5
    health_riskTarget label showing overall health riskCategorical (low, high)high

    🧩 Use Cases:

    1. Health Risk Prediction: Train classification models (Logistic Regression, RandomForest, XGBoost, CatBoost) to predict health risk (low / high).

    2. Feature Importance Analysis: Identify which lifestyle factors most influence health risk.

    3. Data Preprocessing & EDA Practice: Use this dataset for data cleaning, encoding, and visualization practice.

    4. Model Explainability Projects: Use SHAP or LIME to explain how different lifestyle habits affect predictions.

    5. Streamlit or Flask Web App Development: Build a real-time web app that predicts health risk from user input.

    💡 Case Study Example:

    Imagine you are a data scientist building a Health Risk Prediction App for a wellness startup. You want to analyze how exercise, sleep, and sugar intake affect overall health risk. This dataset helps you simulate those relationships without handling sensitive medical data.

    You could:

    • Perform EDA to find correlations between age, BMI, and health risk.
    • Train a model using Random Forest to predict health_risk.
    • Deploy a Streamlit app where users can input their lifestyle information and get a risk score instantly.

    ⚙️ Technical Information:

    • Rows: 5,000 (adjustable, you can create more)
    • Columns: 12
    • Target variable: health_risk
    • Data type: Mixed (Numeric + Categorical)
    • Source: Fully synthetic, generated using Python (NumPy, Faker)

    📈 License:

    CC0: Public Domain You are free to use this dataset for research, learning, or commercial projects.

    🌍 Author:

    Created by Arif Miah Machine Learning Engineer | Kaggle Expert | Data Scientist 📧 arifmiahcse@gmail.com

  9. Synthetic Healthcare Admissions Dataset

    • kaggle.com
    zip
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yash (2025). Synthetic Healthcare Admissions Dataset [Dataset]. https://www.kaggle.com/datasets/yashdev01/synthetic-healthcare-admissions
    Explore at:
    zip(1581549 bytes)Available download formats
    Dataset updated
    Sep 2, 2025
    Authors
    Yash
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    🏥 Synthetic Healthcare Admissions Dataset

    The Synthetic Healthcare Admissions dataset is a synthetically generated healthcare dataset that mimics patient hospital admission records. It is designed to provide researchers, data scientists, and machine learning practitioners with realistic healthcare data while preserving patient privacy and avoiding exposure of sensitive information.

    📂 Dataset Overview

    • Type: Tabular / Structured data
    • Domain: Healthcare, Electronic Health Records (EHR)
    • Content: Synthetic hospital admission records
    • Use Cases:
      • Predictive modeling of patient outcomes
      • Length of stay estimation
      • Readmission prediction
      • Resource allocation & optimization in healthcare
      • Experimentation with ML models without privacy risks

    ⚙️ Features (common fields included in admissions data)

    • Patient Demographics: Age, Gender, Ethnicity
    • Admission Details: Admission type, Admission date, Discharge date
    • Clinical Data: Diagnosis codes (ICD-like), Procedures, Comorbidities
    • Hospital Metrics: Length of stay, Department/Unit info
    • Synthetic Identifiers: Randomized patient IDs

    ✅ Why Synthetic?

    Real healthcare data is heavily restricted due to HIPAA and GDPR compliance. This dataset provides a privacy-safe alternative, allowing open research while maintaining the structure and statistical properties of real hospital admissions data.

    🔬 Applications

    • Benchmarking healthcare ML models
    • Developing explainable AI solutions in clinical settings
    • Testing NLP/ML pipelines for structured EHR data
    • Teaching and training purposes
  10. m

    Synthetic Synthea patient datasets for lung cancer risk prediction machine...

    • data.mendeley.com
    Updated Oct 31, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anjun Chen (2022). Synthetic Synthea patient datasets for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.17632/b24cb4nn8h.1
    Explore at:
    Dataset updated
    Oct 31, 2022
    Authors
    Anjun Chen
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction and simulation study of learning health systems.

    1. In subfolder "unconverted": Five populations of 30K patients were generated by the Synthea patient generator. About 1100 lung cancer patients and 3000 control patients (without lung cancer) were selected and their electronic health records (EHR) were processed to data table files ready for machine learning using common algorithms like XGBoost.

    2. In root directory: The five 30K-patient datasets were combined sequentially to form 5 different size datasets, from 30K to 150K patients. The new datasets were resampled to keep all lung cancer patients plus about 3x control patients. The ML-ready table files also had the continuous numeric values converted to categorical values.

    Because Synthea patients are closely resemble real patients, the Synthea patient data can be used to develop and test ML algorithms and pipelines, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns.

    The first LHS simulation study titled "Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data" has been published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

  11. H

    10,000 Synthetic Medicare Patient Records

    • dataverse.harvard.edu
    Updated Nov 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Hall (2019). 10,000 Synthetic Medicare Patient Records [Dataset]. http://doi.org/10.7910/DVN/QDXLWR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Dylan Hall
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains 10,000 synthetic patient records representing a scaled-down US Medicare population. The records were generated by Synthea ( https://github.com/synthetichealth/synthea ) and are completely synthetic and contain no real patient data. This data is presented free of cost and free of restrictions. Each record is stored as one file in HL7 FHIR R4 ( https://www.hl7.org/fhir/ ) containing one Bundle, in JSON. For more information on how this specific population was created, or to generate your own at any scale, see: https://github.com/synthetichealth/populations/tree/master/medicare

  12. Synthea Generated Synthetic Data in FHIR

    • console.cloud.google.com
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The MITRE Corporation (2023). Synthea Generated Synthetic Data in FHIR [Dataset]. https://console.cloud.google.com/marketplace/product/mitre/synthea-fhir?hl=fr
    Explore at:
    Dataset updated
    Jul 27, 2023
    Dataset authored and provided by
    The MITRE Corporationhttps://www.mitre.org/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Synthea Generated Synthetic Data in FHIR hosts over 1 million synthetic patient records generated using Synthea in FHIR format. Exported from the Google Cloud Healthcare API FHIR Store into BigQuery using analytics schema . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . This public dataset is also available in Google Cloud Storage and available free to use. The URL for the GCS bucket is gs://gcp-public-data--synthea-fhir-data-1m-patients. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. Please cite SyntheaTM as: Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079

  13. T

    Synthetic Suicide Prevention Dataset with SDoH

    • data.va.gov
    • datahub.va.gov
    • +3more
    csv, xlsx, xml
    Updated Feb 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VHA (2021). Synthetic Suicide Prevention Dataset with SDoH [Dataset]. https://www.data.va.gov/dataset/Synthetic-Suicide-Prevention-Dataset-with-SDoH/h5zp-pekf
    Explore at:
    csv, xlsx, xmlAvailable download formats
    Dataset updated
    Feb 18, 2021
    Dataset authored and provided by
    VHA
    Description

    The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.

  14. The Health Gym v2.0 Synthetic Antiretroviral Therapy (ART) for HIV Dataset

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Kuo (2023). The Health Gym v2.0 Synthetic Antiretroviral Therapy (ART) for HIV Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.22827878.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nicholas Kuo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ===###

    This synthetic dataset, centred on ART for HIV, was synthesised employing the model outlined in reference [1], incorporating the techniques of WGAN-GP+G_EOT+VAE+Buffer.

    This dataset serves as a principal resource for the Centre for Big Data Research in Health (CBDRH) Datathon (see: CBDRH Health Data Science Datathon 2023 (cbdrh-hds-datathon-2023.github.io)). Its primary purpose is to advance the Health Data Analytics (HDAT) courses at the University of New South Wales (UNSW), providing students with exposure to synthetic yet realistic datasets that simulate real-world data.

    The dataset is composed of 534,960 records, distributed over 15 distinct columns, and is preserved in a CSV format with a size of 39.1 MB. It contains information about 8,916 synthetic patients over a period of 60 months, with data summarised on a monthly basis. The total number of records corresponds to the product of the synthetic patient count and the record duration in months, thus equating to 8,916 multiplied by 60.

    The dataset's structure encompasses 15 columns, which include 13 variables pertinent to ART for HIV as delineated in reference [1], a unique patient identifier, and a further variable signifying the specific time point.

    ===

    This dataset forms part of a continuous series of work, building upon reference [2]. For further details, kindly refer to our papers: [1] Kuo, Nicholas I., Louisa Jorm, and Sebastiano Barbieri. "Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV." arXiv preprint arXiv:2208.08655 (2022). [2] Kuo, Nicholas I-Hsien, et al. "The Health Gym: synthetic health-related datasets for the development of reinforcement learning algorithms." Scientific Data 9.1 (2022): 693.

    ===

    Latest edit: 16th May 2023.

  15. D

    Synthetic Data In Healthcare Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Synthetic Data In Healthcare Market Research Report 2033 [Dataset]. https://dataintelo.com/report/synthetic-data-in-healthcare-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Synthetic Data in Healthcare Market Outlook



    According to our latest research, the global synthetic data in healthcare market size reached USD 457.8 million in 2024 and is expected to grow at a robust CAGR of 34.2% during the forecast period, reaching USD 5.68 billion by 2033. This remarkable growth is driven by the escalating demand for advanced data solutions that address privacy concerns, enable improved AI model training, and facilitate seamless data sharing across the healthcare ecosystem. The increasing adoption of digital health technologies, stringent data privacy regulations, and rising investments in artificial intelligence are among the key factors fueling the expansion of the synthetic data in healthcare market.




    One of the primary growth factors for the synthetic data in healthcare market is the growing need for privacy-preserving data solutions. As healthcare organizations grapple with stringent regulations such as HIPAA and GDPR, the use of real patient data for research, analytics, and AI model training has become increasingly challenging. Synthetic data, which mimics real-world patient information without exposing sensitive personal details, has emerged as a viable alternative. This approach not only ensures compliance with regulatory requirements but also mitigates the risks associated with data breaches and unauthorized access. The ability to generate diverse, high-quality synthetic datasets is empowering healthcare providers, payers, and researchers to drive innovation while maintaining patient confidentiality.




    Another significant driver is the rapid advancement of artificial intelligence and machine learning applications within the healthcare sector. AI models require vast and varied datasets to achieve high accuracy and reliability, especially in complex domains such as medical imaging, drug discovery, and predictive analytics. However, access to comprehensive and representative real-world data is often limited by privacy constraints and data silos. Synthetic data bridges this gap by providing scalable, customizable, and bias-free datasets that enhance the performance of AI algorithms. This not only accelerates the development and deployment of AI-driven healthcare solutions but also fosters collaboration among stakeholders by enabling secure data sharing and benchmarking.




    The synthetic data in healthcare market is further propelled by the increasing adoption of digital transformation initiatives across the industry. Hospitals, pharmaceutical companies, research institutions, and contract research organizations (CROs) are leveraging synthetic data to streamline clinical trials, improve patient data management, and optimize resource allocation. The integration of synthetic data into electronic health records (EHRs), telemedicine platforms, and health information exchanges is facilitating seamless interoperability and data-driven decision-making. Moreover, the growing emphasis on value-based care, population health management, and personalized medicine is creating new opportunities for synthetic data solutions to enhance healthcare delivery and outcomes.




    From a regional perspective, North America continues to dominate the synthetic data in healthcare market, accounting for the largest revenue share in 2024. This leadership is attributed to the presence of advanced healthcare infrastructure, a strong focus on innovation, and proactive regulatory frameworks that support digital health adoption. Europe follows closely, driven by increasing investments in healthcare IT, a collaborative research environment, and robust data protection regulations. The Asia Pacific region is emerging as a high-growth market, fueled by expanding healthcare access, rising government initiatives, and the proliferation of digital health technologies. Latin America and the Middle East & Africa are also witnessing steady growth, supported by improving healthcare infrastructure and growing awareness of the benefits of synthetic data.



    Component Analysis



    The synthetic data in healthcare market is segmented by component into software and services, each playing a pivotal role in the industry’s ecosystem. The software segment encompasses a wide range of solutions designed to generate, manage, and validate synthetic datasets for various healthcare applications. These software platforms leverage advanced algorithms, machine learning techniques, and data modeling tools to create high-fidelity synthetic data that mimics real-world patient

  16. h

    synthetic-health-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harshita Chhaparia, synthetic-health-dataset [Dataset]. https://huggingface.co/datasets/ieruygfvkihesugdfvx/synthetic-health-dataset
    Explore at:
    Authors
    Harshita Chhaparia
    Description

    ieruygfvkihesugdfvx/synthetic-health-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. Synthetic Dataset of Emergency Healthcare Services

    • figshare.com
    csv
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Ferreira (2024). Synthetic Dataset of Emergency Healthcare Services [Dataset]. http://doi.org/10.6084/m9.figshare.28012784.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Marco Ferreira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was generated using Simio simulation software. The simulations model patient flow in healthcare settings, capturing key metrics such as queue times, length of stay (LOS) for patients, and nurse utilization rates. Each CSV file contains time-series data, with measured variables including patient waiting times, resource utilization percentages, and service durations.## File Overview**CheckBloodPressure.csv** - (9 KB): Contains blood pressure Server records of patients.**CheckPatientType.csv** - (19 KB): Identifies the type of each patient (e.g., 1 or 3).**Fill_Information.csv** - (2 KB): Fill information records for new patients.**MedicalRecord1.csv** - (10 KB): Medical record dataset for patient type 1.**MedicalRecord2.csv** - (4 KB): Medical record dataset for patient type 2.**MedicalRecord3.csv** - (2 KB): Medical record dataset for patient type 3.**MedicalRecord4.csv** - (13 KB): Medical record dataset for patient type 4.**OutPatientDepartment.csv** - (18 KB): Data related to the satisfaction and length of stay of an given patient.**Triage.csv** - (13 KB): Data related to the triage process.**README.txt** - (4 KB): Documentation of the dataset, including structure, metadata, and usage.## Common Fields Across Files**Patient ID** (Integer): Unique identifier for each patient.**Patient Type** (Integer): Classification of patient (e.g., 1, 4).**Medical Records Arrival Time** (DateTime): Timestamp of the patient's first arrival in the medical record department.**Exiting Time** (DateTime): Timestamp when the patient exits a Server.**Waiting Time (min)** (Real): Total waiting time before being attended to.**Resource Used** (String): Resource (e.g., Operator) allocated to the patient.**Utilization %** (Real): Utilization rate of the resource as a percentage.**Queue Count Before Processing** (Integer): Number of patients in the queue before processing begins.**Queue Count After Processing** (Integer): Number of patients in the queue after processing ends.**Queue Difference** (Integer): Difference between the before and after queue counts.**Length of Stay (min)** (Real): Total time spent in the simulation by the patient.**LOS without Queues (min)** (Real): Length of stay excluding any queuing time.**Satisfaction %** (Real): Patient satisfaction rating based on their experience.**New Patient?** (String): Indicates if this is a new patient or a returning one.

  18. R

    Synthetic Dataset of Emergency Healthcare Services

    • datarepositorium.uminho.pt
    • data-staging.niaid.nih.gov
    • +1more
    csv, txt
    Updated Jan 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Repositório de Dados da Universidade do Minho (2025). Synthetic Dataset of Emergency Healthcare Services [Dataset]. http://doi.org/10.34622/datarepositorium/AKSZQG
    Explore at:
    csv(1259), txt(4064)Available download formats
    Dataset updated
    Jan 17, 2025
    Dataset provided by
    Repositório de Dados da Universidade do Minho
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Synthetic dataset of emergency services comprised of several CSV files that we have generated using a simulation software. This dataset is open for public use; please cite our work if used in research or applications. File Overview CheckBloodPressure.csv** - (9 KB): Contains blood pressure Server records of patients. CheckPatientType.csv** - (19 KB): Identifies the type of each patient (e.g., 1 or 3). Fill_Information.csv - (2 KB): Fill information records for new patients. MedicalRecord1.csv - (10 KB): Medical record dataset for patient type 1. MedicalRecord2.csv - (4 KB): Medical record dataset for patient type 2. MedicalRecord3.csv - (2 KB): Medical record dataset for patient type 3. MedicalRecord4.csv - (13 KB): Medical record dataset for patient type 4. OutPatientDepartment.csv - (18 KB): Data related to the satisfaction and length of stay of an given patient. Triage.csv - (13 KB): Data related to the triage process. README.txt - (4 KB): Documentation of the dataset, including structure, metadata, and usage. Common Fields Across Files Patient ID (Integer): Unique identifier for each patient. Patient Type (Integer): Classification of patient (e.g., 1, 4). Medical Records Arrival Time (DateTime): Timestamp of the patient's first arrival in the medical record department. Exiting Time (DateTime): Timestamp when the patient exits a Server. Waiting Time (min) (Real): Total waiting time before being attended to. Resource Used (String): Resource (e.g., Operator) allocated to the patient. Utilization % (Real): Utilization rate of the resource as a percentage. Queue Count Before Processing (Integer): Number of patients in the queue before processing begins. Queue Count After Processing (Integer): Number of patients in the queue after processing ends. Queue Difference (Integer): Difference between the before and after queue counts. Length of Stay (min) (Real): Total time spent in the simulation by the patient. LOS without Queues (min) (Real): Length of stay excluding any queuing time. Satisfaction % (Real): Patient satisfaction rating based on their experience. New Patient? (String): Indicates if this is a new patient or a returning one.

  19. G

    Healthcare Synthetic-Data Governance Services Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Healthcare Synthetic-Data Governance Services Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/healthcare-synthetic-data-governance-services-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Healthcare Synthetic-Data Governance Services Market Outlook




    As per our latest research, the global healthcare synthetic-data governance services market size reached USD 1.14 billion in 2024, demonstrating a robust momentum in the adoption of synthetic data solutions across the healthcare sector. The industry is expanding at a CAGR of 29.3% and is forecasted to attain a value of USD 8.71 billion by 2033. This exceptional growth is primarily driven by the increasing demand for privacy-preserving data solutions, escalating regulatory pressures, and the need for high-quality data to fuel advanced healthcare analytics and artificial intelligence (AI) applications.




    The healthcare synthetic-data governance services market is experiencing exponential growth due to the growing emphasis on data privacy and security in healthcare environments. As healthcare organizations increasingly integrate digital technologies and electronic health records (EHRs), there is a concurrent rise in concerns around patient data confidentiality and compliance with global data protection regulations such as HIPAA, GDPR, and others. Synthetic data, which mimics real patient data without exposing sensitive information, is becoming a preferred solution for training AI models, conducting clinical research, and enabling data sharing across organizations. The market is further propelled by the rising adoption of AI and machine learning in healthcare, which necessitates vast, high-quality datasets that can be safely used without breaching patient privacy. This has led to a surge in demand for robust governance frameworks and services that ensure the ethical and compliant use of synthetic data throughout its lifecycle.




    Another significant growth factor is the increasing complexity and volume of healthcare data, which is making traditional data anonymization techniques less effective. As healthcare providers, pharmaceutical companies, and research institutes seek to leverage big data analytics and advanced modeling, they are turning to synthetic data to overcome data scarcity and bias issues. Synthetic-data governance services play a crucial role in standardizing processes, ensuring data quality, and maintaining regulatory compliance while facilitating seamless data sharing and collaboration. The market is also witnessing an upsurge in partnerships between healthcare organizations and technology vendors, aiming to co-develop tailored governance solutions that address specific clinical, operational, and research needs. This collaborative ecosystem is fostering innovation and accelerating the deployment of synthetic-data governance frameworks globally.




    Furthermore, the healthcare synthetic-data governance services market is benefiting from increased investments by both public and private sectors in digital health infrastructure. Governments and regulatory bodies are actively supporting initiatives that promote data-driven healthcare innovation while safeguarding patient rights. The proliferation of cloud computing and the emergence of interoperable health information systems are making it easier for organizations to implement synthetic-data governance solutions at scale. Additionally, the COVID-19 pandemic has highlighted the critical need for secure, accessible, and compliant data management practices, further intensifying demand for synthetic-data governance services. These factors collectively position the market for sustained long-term growth.



    Synthetic Health Data is revolutionizing the way healthcare organizations approach data privacy and security. By creating realistic but fictional datasets, synthetic health data allows researchers and developers to work with information that mirrors real patient data without exposing sensitive details. This approach not only enhances privacy but also provides a valuable resource for testing new healthcare technologies and methodologies. As the demand for synthetic health data grows, it is becoming an integral part of the healthcare data ecosystem, supporting innovation while ensuring compliance with stringent data protection regulations.




    Regionally, North America continues to dominate the healthcare synthetic-data governance services market, owing to its advanced healthcare IT ecosystem, strong regulatory frameworks, and high adoption of AI-driven healthcare solutions. Europe follows closely, with stringent

  20. h

    Synthetic-Medical-Speech-Dataset

    • huggingface.co
    Updated Oct 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hani. M (2025). Synthetic-Medical-Speech-Dataset [Dataset]. https://huggingface.co/datasets/Hani89/Synthetic-Medical-Speech-Dataset
    Explore at:
    Dataset updated
    Oct 23, 2025
    Authors
    Hani. M
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Synthetic Medical Speech Dataset

      Overview
    

    Synthetic Medical Speech Dataset is a synthetic dataset of audio–text pairs designed for developing and evaluating automatic speech recognition (ASR) models in the medical domain.The corpus contains thousands of short audio clips generated from medically relevant text using a text-to-speech (TTS) system.Each clip is paired with its corresponding transcript.Because all content is synthetically produced, the dataset does not contain… See the full description on the dataset page: https://huggingface.co/datasets/Hani89/Synthetic-Medical-Speech-Dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Agency for Healthcare Research and Quality (2023). Synthetic Healthcare Database for Research (SyH-DR) [Dataset]. https://catalog.data.gov/dataset/synthetic-healthcare-database-for-research-syh-dr
Organization logo

Synthetic Healthcare Database for Research (SyH-DR)

Explore at:
9 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 16, 2023
Dataset provided by
Agency for Healthcare Research and Qualityhttp://www.ahrq.gov/
Description

The Agency for Healthcare Research and Quality (AHRQ) created SyH-DR from eligibility and claims files for Medicare, Medicaid, and commercial insurance plans in calendar year 2016. SyH-DR contains data from a nationally representative sample of insured individuals for the 2016 calendar year. SyH-DR uses synthetic data elements at the claim level to resemble the marginal distribution of the original data elements. SyH-DR person-level data elements are not synthetic, but identifying information is aggregated or masked.

Search
Clear search
Close search
Google apps
Main menu