2 datasets found
  1. d

    Synthea lung cancer synthetic patient data series for ML

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, AJ (2023). Synthea lung cancer synthetic patient data series for ML [Dataset]. http://doi.org/10.7910/DVN/Q5LK5A
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Chen, AJ
    Description

    These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction in simulation of ML-enabled learning health systems. Five populations of 30K patients were generated by the Synthea patient generator. They were combined sequentially to form 5 different size populations, from 30K to 150K patients. Patients with or without lung cancer were selected roughly at 1:3 ratio and their electronic health records (EHR) were processed to data table files ready for machine learning. The ML-ready table files also have the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble to real patients, these ML-ready dataset can be used to develop and test ML algorithms, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first use of these datasets was in a LHS simulation study, which was published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

  2. m

    Synthetic Synthea patient datasets for lung cancer risk prediction machine...

    • data.mendeley.com
    Updated Oct 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anjun Chen (2022). Synthetic Synthea patient datasets for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.17632/b24cb4nn8h.1
    Explore at:
    Dataset updated
    Oct 31, 2022
    Authors
    Anjun Chen
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction and simulation study of learning health systems.

    1. In subfolder "unconverted": Five populations of 30K patients were generated by the Synthea patient generator. About 1100 lung cancer patients and 3000 control patients (without lung cancer) were selected and their electronic health records (EHR) were processed to data table files ready for machine learning using common algorithms like XGBoost.

    2. In root directory: The five 30K-patient datasets were combined sequentially to form 5 different size datasets, from 30K to 150K patients. The new datasets were resampled to keep all lung cancer patients plus about 3x control patients. The ML-ready table files also had the continuous numeric values converted to categorical values.

    Because Synthea patients are closely resemble real patients, the Synthea patient data can be used to develop and test ML algorithms and pipelines, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns.

    The first LHS simulation study titled "Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data" has been published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Chen, AJ (2023). Synthea lung cancer synthetic patient data series for ML [Dataset]. http://doi.org/10.7910/DVN/Q5LK5A

Synthea lung cancer synthetic patient data series for ML

Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Chen, AJ
Description

These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction in simulation of ML-enabled learning health systems. Five populations of 30K patients were generated by the Synthea patient generator. They were combined sequentially to form 5 different size populations, from 30K to 150K patients. Patients with or without lung cancer were selected roughly at 1:3 ratio and their electronic health records (EHR) were processed to data table files ready for machine learning. The ML-ready table files also have the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble to real patients, these ML-ready dataset can be used to develop and test ML algorithms, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first use of these datasets was in a LHS simulation study, which was published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

Search
Clear search
Close search
Google apps
Main menu