37 datasets found

Synthea synthetic patient generator data in OMOP Common Data Model
registry.opendata.aws
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Sevices (2023). Synthea synthetic patient generator data in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/synthea-omop/
Explore at:
Dataset updated
Jan 4, 2023
Dataset provided by
Amazon.comhttp://amazon.com/
Description
The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government (although a citation would be appreciated). You can read our first academic paper here: https://doi.org/10.1093/jamia/ocx079
Synthea Generated Synthetic Data in FHIR
console.cloud.google.com
Updated Jun 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The MITRE Corporation (2020). Synthea Generated Synthetic Data in FHIR [Dataset]. https://console.cloud.google.com/marketplace/product/mitre/synthea-fhir
Explore at:
Dataset updated
Jun 10, 2020
Dataset authored and provided by
The MITRE Corporationhttps://www.mitre.org/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The Synthea Generated Synthetic Data in FHIR hosts over 1 million synthetic patient records generated using Synthea in FHIR format. Exported from the Google Cloud Healthcare API FHIR Store into BigQuery using analytics schema . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . This public dataset is also available in Google Cloud Storage and available free to use. The URL for the GCS bucket is gs://gcp-public-data--synthea-fhir-data-1m-patients. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. Please cite SyntheaTM as: Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079
d
Medical records of 30K Synthea synthetic patients
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, AJ (2023). Medical records of 30K Synthea synthetic patients [Dataset]. http://doi.org/10.7910/DVN/BWDKXS
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/BWDKXS
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Chen, AJ
Description
The dataset has 2 populations of Synthea synthetic patients generated by Synthea tool. Each population has 15K patients with original medical records in CSV files. Because the total file size is >3GB in each population, the files are compressed in zip file. Synthea records are in domains similar to those in real EMR, including patients, encounters, conditions (diagnosis), observations, medications, and procedures. The data was first used in building ML models for lung cancer risk prediction. For more information, see the published paper in Nature Scientific Reports (https://www.nature.com/articles/s41598-022-23011-4)
H
Synthea synthetic patient data for lung cancer risk prediction machine...
dataverse.harvard.edu
Updated Nov 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AJ Chen (2022). Synthea synthetic patient data for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.7910/DVN/GD5XWE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GD5XWE
Dataset updated
Nov 13, 2022
Dataset provided by
Harvard Dataverse
Authors
AJ Chen
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/GD5XWEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/GD5XWE
Description
This dataset contains Synthea synthetic patient data used in building ML models for lung cancer risk prediction. The ML models are used to simulate ML-enabled LHS. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data. For data source and methods, see the first ML-LHS simulation paper published in Nature Scientific Reports: https://www.nature.com/articles/s41598-022-23011-4.
d
Synthea lung cancer synthetic patient data series for ML
search.dataone.org
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, AJ (2023). Synthea lung cancer synthetic patient data series for ML [Dataset]. http://doi.org/10.7910/DVN/Q5LK5A
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/Q5LK5A
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Chen, AJ
Description
These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction in simulation of ML-enabled learning health systems. Five populations of 30K patients were generated by the Synthea patient generator. They were combined sequentially to form 5 different size populations, from 30K to 150K patients. Patients with or without lung cancer were selected roughly at 1:3 ratio and their electronic health records (EHR) were processed to data table files ready for machine learning. The ML-ready table files also have the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble to real patients, these ML-ready dataset can be used to develop and test ML algorithms, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first use of these datasets was in a LHS simulation study, which was published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).
Australian synthetic healthcare data with Synthea
researchdata.edu.au
datadownload
Updated Jul 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Grimes; Michael Lawley; Roc Reguant Comellas; Sankalp Khanna; David Hansen; Denis Bauer; Parnesh Raniga; Hoa Ngo; Donna Truran; Hamed Hassanzadeh; Mitchell O'Brien; Ibrahima Diouf (2024). Australian synthetic healthcare data with Synthea [Dataset]. https://researchdata.edu.au/australian-synthetic-healthcare-synthea/3378771
Explore at:
datadownloadAvailable download formats
Dataset updated
Jul 4, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
John Grimes; Michael Lawley; Roc Reguant Comellas; Sankalp Khanna; David Hansen; Denis Bauer; Parnesh Raniga; Hoa Ngo; Donna Truran; Hamed Hassanzadeh; Mitchell O'Brien; Ibrahima Diouf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Australia
Description
We developed an Australianised version of Synthea. Synthea is a synthetic data generation software that uses publicly available population aggregate statistics such as demographics, disease prevalence and incidence rates, and health reports. Synthea generates data based on manually curated models of clinical workflows and disease progression that cover a patient’s entire life and does not use real patient data; guaranteeing a completely synthetic dataset. We generated 117,258 synthetic patients from Queensland.
d
Medical records of 100 Synthea patients
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, AJ (2023). Medical records of 100 Synthea patients [Dataset]. http://doi.org/10.7910/DVN/VBEKZO
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/VBEKZO
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Chen, AJ
Description
The dataset has 1 population of 100 Synthea synthetic patients generated by Synthea tool. After unzipped, original medical records are in CSV files. Synthea domain records are similar to those in real EMR, including patients, encounters, conditions (diagnosis), observations, medications, and procedures.
Synthetic Suicide Prevention Dataset with SDoH
catalog.data.gov
datahub.va.gov
+3more
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2025). Synthetic Suicide Prevention Dataset with SDoH [Dataset]. https://catalog.data.gov/dataset/synthetic-suicide-prevention-dataset-with-sdoh
Explore at:
Dataset updated
Jun 2, 2025
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.
H
Synthea synthetic patient data for stroke risk prediction machine learning
dataverse.harvard.edu
Updated Nov 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AJ Chen (2022). Synthea synthetic patient data for stroke risk prediction machine learning [Dataset]. http://doi.org/10.7910/DVN/EXVWQY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/EXVWQY
Dataset updated
Nov 13, 2022
Dataset provided by
Harvard Dataverse
Authors
AJ Chen
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.7910/DVN/EXVWQYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.7910/DVN/EXVWQY
Description
This dataset contains Synthea synthetic patient data used in building ML models for stroke risk prediction. The ML models are used to simulate ML-enabled LHS. See the first LHS simulation paper published in Nature Scientific Reports. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data.
Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...
catalog.data.gov
data.va.gov
+2more
Updated Apr 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2021). Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge [Dataset]. https://catalog.data.gov/dataset/synthetic-cohort-for-vha-innovation-ecosystem-and-precisionfda-covid-19-risk-factor-modeli
Explore at:
Dataset updated
Apr 25, 2021
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
The dataset is a synthetic cohort for use for the VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge. The dataset was generated using Synthea, a tool created by MITRE to generate synthetic electronic health records (EHRs) from curated care maps and publicly available statistics. This dataset represents 147,451 patients developed using the COVID-19 module. The dataset format conforms to the CSV file outputs. Below are links to all relevant information. PrecisionFDA Challenge: https://precision.fda.gov/challenges/11 Synthea hompage: https://synthetichealth.github.io/synthea/ Synethea GitHub repository: https://github.com/synthetichealth/synthea Synthea COVID-19 Module publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7531559/ CSV File Format Data Dictionary: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary
Semantic Triples from "A Collaborative, Realism-Based, Electronic Healthcare...
zenodo.org
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Andrew Miller; Mark Andrew Miller; Chirstian Stoeckert; Chirstian Stoeckert (2020). Semantic Triples from "A Collaborative, Realism-Based, Electronic Healthcare Graph: Public Data, Common Data Models, and Practical Instantiation" [Dataset]. http://doi.org/10.5281/zenodo.3358854
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3358854
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mark Andrew Miller; Mark Andrew Miller; Chirstian Stoeckert; Chirstian Stoeckert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
These RDF triples (synthea_graph_exportable.nq.zip) are the result of modeling electronic health records (synthea_csv_output_turbo_cannonical.zip), that were synthesized with the Synthea software (https://github.com/synthetichealth/synthea). Anyone who loads them into a triplestore database is encouraged to provide feedback at https://github.com/PennTURBO/EhrGraphCollab/issues. The following abstract comes from a paper, describing the semantic instantiation process, and presented to the ICBO 2019 conference (https://drive.google.com/file/d/1eYXTBl75Wx3XPMmCIOZba-8Cv0DIhlRq/view).

ABSTRACT: There is ample literature on the semantic modeling of biomedical data in general, but less has been published on realism-based, semantic instantiation of electronic health records (EHR). Reasons include difficult design decisions and issues of data governance. A collaborative approach can address design and technology utilization issues, but is especially constrained by limited access to the data at hand: protected health information.

Effective collaboration can be facilitated by public EHR-like data sets, which would ideally include a large variety of datatypes mirroring actual EHRs and enough records to drive a performance assessment. An investment into reading public EHR-like data from a popular common data model (CDM) is preferable over reading each public data set’s native format.

In addition to identifying suitable public EHR-like data sets and CDMs, this paper addresses instantiation via relational-to-RDF mapping. The completed instantiation is available for download, and a competency question demonstrates fidelity across all discussed formats.
h
Synthea OMOP (CDM) - North East and North Cumbria
healthdatagateway.org
unknown
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Synthea OMOP (CDM) - North East and North Cumbria [Dataset]. https://healthdatagateway.org/en/dataset/1351
Explore at:
unknownAvailable download formats
Dataset updated
Jun 17, 2025
License
https://northeastnorthcumbria.nhs.uk/our-work/secure-data-environment/https://northeastnorthcumbria.nhs.uk/our-work/secure-data-environment/
Description
Synthetic Primary Care Data (Synthea) transformed into the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM)

Data is sourced from https://synthea.mitre.org/downloads using the 100 sample patient CSV variant of available downloads. Data has been transformed using the ETL methods described by https://github.com/OHDSI/ETL-Synthea

This is a patient level dataset of Primary Care data covering 100 synthetic patients
d
10,000 Synthetic Medicare Patient Records
search.dataone.org
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hall, Dylan (2023). 10,000 Synthetic Medicare Patient Records [Dataset]. http://doi.org/10.7910/DVN/QDXLWR
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/QDXLWR
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Hall, Dylan
Description
This dataset contains 10,000 synthetic patient records representing a scaled-down US Medicare population. The records were generated by Synthea ( https://github.com/synthetichealth/synthea ) and are completely synthetic and contain no real patient data. This data is presented free of cost and free of restrictions. Each record is stored as one file in HL7 FHIR R4 ( https://www.hl7.org/fhir/ ) containing one Bundle, in JSON. For more information on how this specific population was created, or to generate your own at any scale, see: https://github.com/synthetichealth/populations/tree/master/medicare
h
PhenoMan Evaluation - Observation FHIR Resources
health-atlas.de
application/gzip
Updated Jul 3, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandr Uciteli; Christoph Beger (2020). PhenoMan Evaluation - Observation FHIR Resources [Dataset]. https://www.health-atlas.de/data_files/287
Explore at:
application/gzip(124 MB)Available download formats
Dataset updated
Jul 3, 2020
Authors
Alexandr Uciteli; Christoph Beger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data file contains FHIR bundles of observation resources, which were used for the evaluation of the PhenoMan. Originally the observation data were generated with Synthea(TM) and truncated to reduce overall size and import times into a HAPI FHIR JPA Server. Please import the patient resources prior to the observations.

This data file contains 8,026,380 observations.
Semantic Triples from "A Collaborative, Realism-Based, Electronic Healthcare...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Andrew Miller; Mark Andrew Miller; Chirstian Stoeckert; Chirstian Stoeckert (2020). Semantic Triples from "A Collaborative, Realism-Based, Electronic Healthcare Graph: Public Data, Common Data Models, and Practical Instantiation" [Dataset]. http://doi.org/10.5281/zenodo.2641233
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2641233
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mark Andrew Miller; Mark Andrew Miller; Chirstian Stoeckert; Chirstian Stoeckert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
These RDF triples are the result of modeling electronic health care records synthesized with Synthea software and can be loaded into a triplestore. The following abstract comes from a paper, describing the semantic instantiation process, and submitted to the ICBO 2019 conference.

ABSTRACT: There is ample literature on the semantic modeling of biomedical data in general, but less has been published on realism-based, semantic instantiation of electronic health records (EHR). Reasons include difficult design decisions and issues of data governance. A collaborative approach can address design and technology utilization issues, but is especially constrained by limited access to the data at hand: protected health information.

Effective collaboration can be facilitated by public EHR-like data sets, which would ideally include a large variety of datatypes mirroring actual EHRs and enough records to drive a performance assessment. An investment into reading public EHR-like data from a popular common data model (CDM) is preferable over reading each public data set’s native format.

In addition to identifying suitable public EHR-like data sets and CDMs, this paper addresses instantiation via relational-to-RDF mapping. The completed instantiation is available for download, and a competency question demonstrates fidelity across all discussed formats.
d
Synthea stroke synthetic patient data series for risk prediction ML
dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, AJ (2023). Synthea stroke synthetic patient data series for risk prediction ML [Dataset]. http://doi.org/10.7910/DVN/LBD9GU
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/LBD9GU
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Chen, AJ
Description
These synthetic patient datasets were created for machine learning (ML) study of stroke risk prediction. Five populations of 30K patients were generated by the Synthea patient generator. They were combined sequentially to form 5 different size populations, from 30K to 150K patients. Patients with or without stroke were selected roughly at 1:3 ratio and their electronic health records (EHR) were processed to data table files ready for machine learning. The ML-ready table files also have the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble to real patients, these ML-ready dataset can be used to develop and test ML algorithms, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first use of these datasets was in a LHS simulation study, which was published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).
h
PhenoMan Evaluation - Condition FHIR Resources
health-atlas.de
application/gzip
Updated Jul 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandr Uciteli; Christoph Beger (2020). PhenoMan Evaluation - Condition FHIR Resources [Dataset]. https://www.health-atlas.de/data_files/286
Explore at:
application/gzip(4.26 MB)Available download formats
Dataset updated
Jul 3, 2020
Authors
Alexandr Uciteli; Christoph Beger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data file contains FHIR bundles of condition resources, which were used for the evaluation of the PhenoMan. Originally the condition data were generated with Synthea(TM) and truncated to reduce overall size and import times into a HAPI FHIR JPA Server. Please import the patient resources prior to the conditions.

This data file contains 139,763 conditions.
Synthetic Data Generated by initial student interns at WEHI
figshare.com
txt
Updated Jul 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Synthetic Data Generated by initial student interns at WEHI [Dataset]. https://figshare.com/articles/dataset/Synthetic_Data_Generated_by_initial_student_interns_at_WEHI/23805087
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23805087.v1
Dataset updated
Jul 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ya Cho; Shouyi Yuan; Ryan Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic Data Generated by initial student interns at WEHI using Synthea
h
PhenoMan Evaluation - AllergyIntolerance FHIR Resources
health-atlas.de
application/gzip
Updated Jul 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandr Uciteli; Christoph Beger (2020). PhenoMan Evaluation - AllergyIntolerance FHIR Resources [Dataset]. https://www.health-atlas.de/data_files/285
Explore at:
application/gzip(26.1 KB)Available download formats
Dataset updated
Jul 3, 2020
Authors
Alexandr Uciteli; Christoph Beger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data file contains FHIR bundles of allergy intolerance resources, which were used for the evaluation of the PhenoMan. Originally the allergy intolerance data were generated with Synthea(TM) and truncated to reduce overall size and import times into a HAPI FHIR JPA Server. Please import the patient resources prior to the allergy intolerances.

This data file contains 563 allergy intolerances.
Synthetic EHRs for Benchmarking System Performance
kaggle.com
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Abbasi (2025). Synthetic EHRs for Benchmarking System Performance [Dataset]. http://doi.org/10.34740/kaggle/dsv/11614066
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/11614066
Dataset updated
Apr 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ahmed Abbasi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains FHIR-compatible Electronic Health Records (EHR) generated using the Synthea synthetic patient generator. It is specifically designed to benchmark the performance of a blockchain-based EHR solution, NexaEHR, which utilizes smart contracts and IPFS for data storage and management.

The dataset includes 1000 EHR records (files), each representing a separate synthetic record for a persona, with varying sizes. The largest record is approximately 80 MB, simulating the average record size a patient might accumulate annually. These records are intended for testing and evaluating the scalability, efficiency, and effectiveness of blockchain technology in managing and securing healthcare data within a decentralized system.

Facebook

Twitter

Click to copy link

Link copied

Cite

Amazon Web Sevices (2023). Synthea synthetic patient generator data in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/synthea-omop/

Synthea synthetic patient generator data in OMOP Common Data Model

Explore at:

Dataset updated

Jan 4, 2023

Dataset provided by

Amazon.comhttp://amazon.com/

Description

The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government (although a citation would be appreciated). You can read our first academic paper here: https://doi.org/10.1093/jamia/ocx079

Clear search

Close search

Google apps

Main menu

Synthea synthetic patient generator data in OMOP Common Data Model

Synthea Generated Synthetic Data in FHIR

Medical records of 30K Synthea synthetic patients

Synthea synthetic patient data for lung cancer risk prediction machine...

Synthea lung cancer synthetic patient data series for ML

Australian synthetic healthcare data with Synthea

Medical records of 100 Synthea patients

Synthetic Suicide Prevention Dataset with SDoH

Synthea synthetic patient data for stroke risk prediction machine learning

Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...

Semantic Triples from "A Collaborative, Realism-Based, Electronic Healthcare...

Synthea OMOP (CDM) - North East and North Cumbria

10,000 Synthetic Medicare Patient Records

PhenoMan Evaluation - Observation FHIR Resources

Semantic Triples from "A Collaborative, Realism-Based, Electronic Healthcare...

Synthea stroke synthetic patient data series for risk prediction ML

PhenoMan Evaluation - Condition FHIR Resources

Synthetic Data Generated by initial student interns at WEHI

PhenoMan Evaluation - AllergyIntolerance FHIR Resources

Synthetic EHRs for Benchmarking System Performance

Synthea synthetic patient generator data in OMOP Common Data Model