19 datasets found
  1. d

    Synthea lung cancer synthetic patient data series for ML

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, AJ (2023). Synthea lung cancer synthetic patient data series for ML [Dataset]. http://doi.org/10.7910/DVN/Q5LK5A
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Chen, AJ
    Description

    These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction in simulation of ML-enabled learning health systems. Five populations of 30K patients were generated by the Synthea patient generator. They were combined sequentially to form 5 different size populations, from 30K to 150K patients. Patients with or without lung cancer were selected roughly at 1:3 ratio and their electronic health records (EHR) were processed to data table files ready for machine learning. The ML-ready table files also have the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble to real patients, these ML-ready dataset can be used to develop and test ML algorithms, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first use of these datasets was in a LHS simulation study, which was published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

  2. H

    Medical records of 30K Synthea synthetic patients

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Nov 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AJ Chen (2022). Medical records of 30K Synthea synthetic patients [Dataset]. http://doi.org/10.7910/DVN/BWDKXS
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    AJ Chen
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/BWDKXShttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/BWDKXS

    Description

    The dataset has 2 populations of Synthea synthetic patients generated by Synthea tool. Each population has 15K patients with original medical records in CSV files. Because the total file size is >3GB in each population, the files are compressed in zip file. Synthea records are in domains similar to those in real EMR, including patients, encounters, conditions (diagnosis), observations, medications, and procedures. The data was first used in building ML models for lung cancer risk prediction. For more information, see the published paper in Nature Scientific Reports (https://www.nature.com/articles/s41598-022-23011-4)

  3. Synthea synthetic patient generator data in OMOP Common Data Model

    • registry.opendata.aws
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Web Sevices (2023). Synthea synthetic patient generator data in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/synthea-omop/
    Explore at:
    Dataset updated
    Jan 4, 2023
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government (although a citation would be appreciated). You can read our first academic paper here: https://doi.org/10.1093/jamia/ocx079

  4. Synthea Generated Synthetic Data in FHIR

    • console.cloud.google.com
    Updated Jun 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The MITRE Corporation (2020). Synthea Generated Synthetic Data in FHIR [Dataset]. https://console.cloud.google.com/marketplace/product/mitre/synthea-fhir
    Explore at:
    Dataset updated
    Jun 10, 2020
    Dataset authored and provided by
    The MITRE Corporationhttps://www.mitre.org/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Synthea Generated Synthetic Data in FHIR hosts over 1 million synthetic patient records generated using Synthea in FHIR format. Exported from the Google Cloud Healthcare API FHIR Store into BigQuery using analytics schema . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . This public dataset is also available in Google Cloud Storage and available free to use. The URL for the GCS bucket is gs://gcp-public-data--synthea-fhir-data-1m-patients. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. Please cite SyntheaTM as: Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079

  5. H

    Synthea synthetic patient data for lung cancer risk prediction machine...

    • dataverse.harvard.edu
    Updated Nov 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AJ Chen (2022). Synthea synthetic patient data for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.7910/DVN/GD5XWE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    AJ Chen
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/GD5XWEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/GD5XWE

    Description

    This dataset contains Synthea synthetic patient data used in building ML models for lung cancer risk prediction. The ML models are used to simulate ML-enabled LHS. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data. For data source and methods, see the first ML-LHS simulation paper published in Nature Scientific Reports: https://www.nature.com/articles/s41598-022-23011-4.

  6. Synthetic Suicide Prevention Dataset with SDoH

    • catalog.data.gov
    • datahub.va.gov
    • +2more
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Veterans Affairs (2025). Synthetic Suicide Prevention Dataset with SDoH [Dataset]. https://catalog.data.gov/dataset/synthetic-suicide-prevention-dataset-with-sdoh
    Explore at:
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    United States Department of Veterans Affairshttp://va.gov/
    Description

    The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.

  7. d

    10,000 Synthetic Medicare Patient Records

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hall, Dylan (2023). 10,000 Synthetic Medicare Patient Records [Dataset]. http://doi.org/10.7910/DVN/QDXLWR
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Hall, Dylan
    Description

    This dataset contains 10,000 synthetic patient records representing a scaled-down US Medicare population. The records were generated by Synthea ( https://github.com/synthetichealth/synthea ) and are completely synthetic and contain no real patient data. This data is presented free of cost and free of restrictions. Each record is stored as one file in HL7 FHIR R4 ( https://www.hl7.org/fhir/ ) containing one Bundle, in JSON. For more information on how this specific population was created, or to generate your own at any scale, see: https://github.com/synthetichealth/populations/tree/master/medicare

  8. H

    Synthea synthetic patient data for stroke risk prediction machine learning

    • dataverse.harvard.edu
    Updated Nov 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AJ Chen (2022). Synthea synthetic patient data for stroke risk prediction machine learning [Dataset]. http://doi.org/10.7910/DVN/EXVWQY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    AJ Chen
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.7910/DVN/EXVWQYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.7910/DVN/EXVWQY

    Description

    This dataset contains Synthea synthetic patient data used in building ML models for stroke risk prediction. The ML models are used to simulate ML-enabled LHS. See the first LHS simulation paper published in Nature Scientific Reports. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data.

  9. Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...

    • catalog.data.gov
    • data.va.gov
    • +1more
    Updated Apr 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Veterans Affairs (2021). Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge [Dataset]. https://catalog.data.gov/dataset/synthetic-cohort-for-vha-innovation-ecosystem-and-precisionfda-covid-19-risk-factor-modeli
    Explore at:
    Dataset updated
    Apr 25, 2021
    Dataset provided by
    United States Department of Veterans Affairshttp://va.gov/
    Description

    The dataset is a synthetic cohort for use for the VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge. The dataset was generated using Synthea, a tool created by MITRE to generate synthetic electronic health records (EHRs) from curated care maps and publicly available statistics. This dataset represents 147,451 patients developed using the COVID-19 module. The dataset format conforms to the CSV file outputs. Below are links to all relevant information. PrecisionFDA Challenge: https://precision.fda.gov/challenges/11 Synthea hompage: https://synthetichealth.github.io/synthea/ Synethea GitHub repository: https://github.com/synthetichealth/synthea Synthea COVID-19 Module publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7531559/ CSV File Format Data Dictionary: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary

  10. h

    PhenoMan Evaluation - Patient FHIR Resources

    • health-atlas.de
    • health-atlas.eu
    application/gzip
    Updated Jul 3, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandr Uciteli; Christoph Beger (2020). PhenoMan Evaluation - Patient FHIR Resources [Dataset]. https://www.health-atlas.de/data_files/284
    Explore at:
    application/gzip(2.73 MB)Available download formats
    Dataset updated
    Jul 3, 2020
    Authors
    Alexandr Uciteli; Christoph Beger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data file contains FHIR bundles of patient resources, which were used for the evaluation of the PhenoMan. Originally the patient data were generated with Synthea(TM) and truncated to reduce overall size and import times into a HAPI FHIR JPA Server.

    This data file contains 66,018 patients.

  11. d

    Medical records of 100 Synthea patients

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, AJ (2023). Medical records of 100 Synthea patients [Dataset]. http://doi.org/10.7910/DVN/VBEKZO
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Chen, AJ
    Description

    The dataset has 1 population of 100 Synthea synthetic patients generated by Synthea tool. After unzipped, original medical records are in CSV files. Synthea domain records are similar to those in real EMR, including patients, encounters, conditions (diagnosis), observations, medications, and procedures.

  12. Semantic Triples from "A Collaborative, Realism-Based, Electronic Healthcare...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Andrew Miller; Mark Andrew Miller; Chirstian Stoeckert; Chirstian Stoeckert (2020). Semantic Triples from "A Collaborative, Realism-Based, Electronic Healthcare Graph: Public Data, Common Data Models, and Practical Instantiation" [Dataset]. http://doi.org/10.5281/zenodo.3352397
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mark Andrew Miller; Mark Andrew Miller; Chirstian Stoeckert; Chirstian Stoeckert
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    These RDF triples are the result of modeling electronic health care records synthesized with Synthea software and can be loaded into a triplestore. The following abstract comes from a paper, describing the semantic instantiation process, and submitted to the ICBO 2019 conference.

    ABSTRACT: There is ample literature on the semantic modeling of biomedical data in general, but less has been published on realism-based, semantic instantiation of electronic health records (EHR). Reasons include difficult design decisions and issues of data governance. A collaborative approach can address design and technology utilization issues, but is especially constrained by limited access to the data at hand: protected health information.

    Effective collaboration can be facilitated by public EHR-like data sets, which would ideally include a large variety of datatypes mirroring actual EHRs and enough records to drive a performance assessment. An investment into reading public EHR-like data from a popular common data model (CDM) is preferable over reading each public data set’s native format.

    In addition to identifying suitable public EHR-like data sets and CDMs, this paper addresses instantiation via relational-to-RDF mapping. The completed instantiation is available for download, and a competency question demonstrates fidelity across all discussed formats.

  13. h

    PhenoMan Evaluation - Condition FHIR Resources

    • health-atlas.de
    • health-atlas.eu
    application/gzip
    Updated Jul 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandr Uciteli; Christoph Beger (2020). PhenoMan Evaluation - Condition FHIR Resources [Dataset]. https://www.health-atlas.de/data_files/286
    Explore at:
    application/gzip(4.26 MB)Available download formats
    Dataset updated
    Jul 3, 2020
    Authors
    Alexandr Uciteli; Christoph Beger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data file contains FHIR bundles of condition resources, which were used for the evaluation of the PhenoMan. Originally the condition data were generated with Synthea(TM) and truncated to reduce overall size and import times into a HAPI FHIR JPA Server. Please import the patient resources prior to the conditions.

    This data file contains 139,763 conditions.

  14. Synthetic Data Generated by initial student interns at WEHI

    • figshare.com
    txt
    Updated Jul 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ya Cho; Shouyi Yuan; Ryan Li (2023). Synthetic Data Generated by initial student interns at WEHI [Dataset]. http://doi.org/10.6084/m9.figshare.23805087.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ya Cho; Shouyi Yuan; Ryan Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic Data Generated by initial student interns at WEHI using Synthea

  15. h

    PhenoMan Evaluation - AllergyIntolerance FHIR Resources

    • health-atlas.de
    • health-atlas.eu
    application/gzip
    Updated Jul 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandr Uciteli; Christoph Beger (2020). PhenoMan Evaluation - AllergyIntolerance FHIR Resources [Dataset]. https://www.health-atlas.de/data_files/285
    Explore at:
    application/gzip(26.1 KB)Available download formats
    Dataset updated
    Jul 3, 2020
    Authors
    Alexandr Uciteli; Christoph Beger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data file contains FHIR bundles of allergy intolerance resources, which were used for the evaluation of the PhenoMan. Originally the allergy intolerance data were generated with Synthea(TM) and truncated to reduce overall size and import times into a HAPI FHIR JPA Server. Please import the patient resources prior to the allergy intolerances.

    This data file contains 563 allergy intolerances.

  16. Cynthia Data - synthetic EHR records

    • kaggle.com
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Craig Calderone (2025). Cynthia Data - synthetic EHR records [Dataset]. https://www.kaggle.com/datasets/craigcynthiaai/cynthia-data-synthetic-ehr-records/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Craig Calderone
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description: This dataset contains 5 sample PDF Electronic Health Records (EHRs), generated as part of a synthetic healthcare data project. The purpose of this dataset is to assist with sales distribution, offering potential users and stakeholders a glimpse of how synthetic EHRs can look and function. These records have been crafted to mimic realistic admission data while ensuring privacy and compliance with all data protection regulations.

    Key Features: 1. Synthetic Data: Entirely artificial data created for testing and demonstration purposes. 1. PDF Format: Records are presented in PDF format, commonly used in healthcare systems. 1. Diverse Use Cases: Useful for evaluating tools related to data parsing, machine learning in healthcare, or EHR management systems. 1. Rich Admission Details: Includes admission-related data that highlights the capabilities of synthetic EHR generation.

    Potential Use Cases:

    • Demonstrating EHR-related tools or services.
    • Benchmarking data parsing models for PDF health records.
    • Showcasing synthetic healthcare data in sales or marketing efforts.

    Feel free to use this dataset for non-commercial testing and demonstration purposes. Feedback and suggestions for improvements are always welcome!

  17. synthia-rand-cityscape

    • kaggle.com
    Updated Nov 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pengwei Li (2021). synthia-rand-cityscape [Dataset]. https://www.kaggle.com/datasets/pengweili/synthiarandcityscape/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Pengwei Li
    Description

    Dataset

    This dataset was created by Pengwei Li

    Contents

  18. t

    SYNTHIA → Cityscapes - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). SYNTHIA → Cityscapes - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/synthia---cityscapes
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The SYNTHIA dataset is a synthetic dataset for semantic segmentation, and the Cityscapes dataset is a real-world dataset for semantic segmentation.

  19. h

    Synthia-v1.5-II

    • huggingface.co
    Updated Oct 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Migel Tissera (2024). Synthia-v1.5-II [Dataset]. https://huggingface.co/datasets/migtissera/Synthia-v1.5-II
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 7, 2024
    Authors
    Migel Tissera
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    migtissera/Synthia-v1.5-II dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Chen, AJ (2023). Synthea lung cancer synthetic patient data series for ML [Dataset]. http://doi.org/10.7910/DVN/Q5LK5A

Synthea lung cancer synthetic patient data series for ML

Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Chen, AJ
Description

These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction in simulation of ML-enabled learning health systems. Five populations of 30K patients were generated by the Synthea patient generator. They were combined sequentially to form 5 different size populations, from 30K to 150K patients. Patients with or without lung cancer were selected roughly at 1:3 ratio and their electronic health records (EHR) were processed to data table files ready for machine learning. The ML-ready table files also have the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble to real patients, these ML-ready dataset can be used to develop and test ML algorithms, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first use of these datasets was in a LHS simulation study, which was published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

Search
Clear search
Close search
Google apps
Main menu