10 datasets found
  1. d

    Medical records of 30K Synthea synthetic patients

    • search.dataone.org
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, AJ (2023). Medical records of 30K Synthea synthetic patients [Dataset]. http://doi.org/10.7910/DVN/BWDKXS
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Chen, AJ
    Description

    The dataset has 2 populations of Synthea synthetic patients generated by Synthea tool. Each population has 15K patients with original medical records in CSV files. Because the total file size is >3GB in each population, the files are compressed in zip file. Synthea records are in domains similar to those in real EMR, including patients, encounters, conditions (diagnosis), observations, medications, and procedures. The data was first used in building ML models for lung cancer risk prediction. For more information, see the published paper in Nature Scientific Reports (https://www.nature.com/articles/s41598-022-23011-4)

  2. d

    Medical records of 100 Synthea patients

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, AJ (2023). Medical records of 100 Synthea patients [Dataset]. http://doi.org/10.7910/DVN/VBEKZO
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Chen, AJ
    Description

    The dataset has 1 population of 100 Synthea synthetic patients generated by Synthea tool. After unzipped, original medical records are in CSV files. Synthea domain records are similar to those in real EMR, including patients, encounters, conditions (diagnosis), observations, medications, and procedures.

  3. Synthea synthetic patient generator data in OMOP Common Data Model

    • registry.opendata.aws
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Web Sevices (2023). Synthea synthetic patient generator data in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/synthea-omop/
    Explore at:
    Dataset updated
    Jan 4, 2023
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government (although a citation would be appreciated). You can read our first academic paper here: https://doi.org/10.1093/jamia/ocx079

  4. 🩺 Synthetic AL Medical Records

    • kaggle.com
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2023). 🩺 Synthetic AL Medical Records [Dataset]. https://www.kaggle.com/datasets/mexwell/synthetic-al-medical-records/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kaggle
    Authors
    mexwell
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains the core data to be used for health analytics instructional activities in the Deparment of Healthcare Administration and Informatics at Samford University.

    Patient data was generated using Synthea[1]. See https://synthea.mitre.org/ for more information.

    To regenerate this dataset locally, see Getting Started. This dataset was generated with the following options:

    java -jar synthea-with-dependencies.jar -s 51 -p 100 --exporter.csv.export true Alabama

    Citation

    1 - Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079

    Acknowlegement

    Foto von Hush Naidoo Jade Photography auf Unsplash

  5. Synthetic Suicide Prevention Dataset with SDoH

    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Veterans Affairs (2025). Synthetic Suicide Prevention Dataset with SDoH [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/synthetic-suicide-prevention-dataset-with-sdoh
    Explore at:
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    United States Department of Veterans Affairshttp://va.gov/
    Description

    The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://res1githubd-o-tcom.vcapture.xyz/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://res1githubd-o-tcom.vcapture.xyz/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://res1githubd-o-tcom.vcapture.xyz/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://res1githubd-o-tcom.vcapture.xyz/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.

  6. Synthea Generated Synthetic Data in FHIR

    • console.cloud.google.com
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The MITRE Corporation (2023). Synthea Generated Synthetic Data in FHIR [Dataset]. https://console.cloud.google.com/marketplace/product/mitre/synthea-fhir?hl=de
    Explore at:
    Dataset updated
    Jul 15, 2023
    Dataset authored and provided by
    The MITRE Corporationhttps://www.mitre.org/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Synthea Generated Synthetic Data in FHIR hosts over 1 million synthetic patient records generated using Synthea in FHIR format. Exported from the Google Cloud Healthcare API FHIR Store into BigQuery using analytics schema . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . This public dataset is also available in Google Cloud Storage and available free to use. The URL for the GCS bucket is gs://gcp-public-data--synthea-fhir-data-1m-patients. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. Please cite SyntheaTM as: Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079

  7. d

    Synthea lung cancer synthetic patient data series for ML

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, AJ (2023). Synthea lung cancer synthetic patient data series for ML [Dataset]. http://doi.org/10.7910/DVN/Q5LK5A
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Chen, AJ
    Description

    These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction in simulation of ML-enabled learning health systems. Five populations of 30K patients were generated by the Synthea patient generator. They were combined sequentially to form 5 different size populations, from 30K to 150K patients. Patients with or without lung cancer were selected roughly at 1:3 ratio and their electronic health records (EHR) were processed to data table files ready for machine learning. The ML-ready table files also have the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble to real patients, these ML-ready dataset can be used to develop and test ML algorithms, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first use of these datasets was in a LHS simulation study, which was published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

  8. h

    cnotesum

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgia Institute of Technology, cnotesum [Dataset]. https://huggingface.co/datasets/GeorgiaTech/cnotesum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Georgia Institute of Technology
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Synthetic Clinical Notes based on Synthea and Summary Generated via LLAMA 2

  9. H

    Synthea stroke synthetic patient data series for risk prediction ML

    • dataverse.harvard.edu
    Updated Nov 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AJ Chen (2022). Synthea stroke synthetic patient data series for risk prediction ML [Dataset]. http://doi.org/10.7910/DVN/LBD9GU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 14, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    AJ Chen
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LBD9GUhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LBD9GU

    Description

    These synthetic patient datasets were created for machine learning (ML) study of stroke risk prediction. Five populations of 30K patients were generated by the Synthea patient generator. They were combined sequentially to form 5 different size populations, from 30K to 150K patients. Patients with or without stroke were selected roughly at 1:3 ratio and their electronic health records (EHR) were processed to data table files ready for machine learning. The ML-ready table files also have the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble to real patients, these ML-ready dataset can be used to develop and test ML algorithms, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first use of these datasets was in a LHS simulation study, which was published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

  10. Synthetic EHRs for Benchmarking System Performance

    • kaggle.com
    Updated Apr 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Abbasi (2025). Synthetic EHRs for Benchmarking System Performance [Dataset]. http://doi.org/10.34740/kaggle/dsv/11614066
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmed Abbasi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains FHIR-compatible Electronic Health Records (EHR) generated using the Synthea synthetic patient generator. It is specifically designed to benchmark the performance of a blockchain-based EHR solution, NexaEHR, which utilizes smart contracts and IPFS for data storage and management.

    The dataset includes 1000 EHR records (files), each representing a separate synthetic record for a persona, with varying sizes. The largest record is approximately 80 MB, simulating the average record size a patient might accumulate annually. These records are intended for testing and evaluating the scalability, efficiency, and effectiveness of blockchain technology in managing and securing healthcare data within a decentralized system.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Chen, AJ (2023). Medical records of 30K Synthea synthetic patients [Dataset]. http://doi.org/10.7910/DVN/BWDKXS

Medical records of 30K Synthea synthetic patients

Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Chen, AJ
Description

The dataset has 2 populations of Synthea synthetic patients generated by Synthea tool. Each population has 15K patients with original medical records in CSV files. Because the total file size is >3GB in each population, the files are compressed in zip file. Synthea records are in domains similar to those in real EMR, including patients, encounters, conditions (diagnosis), observations, medications, and procedures. The data was first used in building ML models for lung cancer risk prediction. For more information, see the published paper in Nature Scientific Reports (https://www.nature.com/articles/s41598-022-23011-4)

Search
Clear search
Close search
Google apps
Main menu