Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is Electronic Health Record Predicting collected from a private Hospital in Indonesia. It contains the patients laboratory test results used to determine next patient treatment whether in care or out care patient. The task embedded to the dataset is classification prediction.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A 100,000-patient database that contains in total 100,000 virtual patients, 361,760 admissions, and 107,535,387 lab observations.
EHR-RelB is a benchmark dataset for biomedical concept relatedness, consisting of 3630 concept pairs sampled from electronic health records (EHRs). EHR-RelA is a smaller dataset of 111 concept pairs, which are mainly unrelated.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
These data are modelled using the OMOP Common Data Model v5.3.Correlated Data SourceNG tube vocabulariesGeneration RulesThe patient’s age should be between 18 and 100 at the moment of the visit.Ethnicity data is using 2021 census data in England and Wales (Census in England and Wales 2021) .Gender is equally distributed between Male and Female (50% each).Every person in the record has a link in procedure_occurrence with the concept “Checking the position of nasogastric tube using X-ray”2% of person records have a link in procedure_occurrence with the concept of “Plain chest X-ray”60% of visit_occurrence has visit concept “Inpatient Visit”, while 40% have “Emergency Room Visit”NotesVersion 0Generated by man-made rule/story generatorStructural correct, all tables linked with the relationshipWe used national ethnicity data to generate a realistic distribution (see below)2011 Race Census figure in England and WalesEthnic Group : Population(%)Asian or Asian British: Bangladeshi - 1.1Asian or Asian British: Chinese - 0.7Asian or Asian British: Indian - 3.1Asian or Asian British: Pakistani - 2.7Asian or Asian British: any other Asian background -1.6Black or African or Caribbean or Black British: African - 2.5Black or African or Caribbean or Black British: Caribbean - 1Black or African or Caribbean or Black British: other Black or African or Caribbean background - 0.5Mixed multiple ethnic groups: White and Asian - 0.8Mixed multiple ethnic groups: White and Black African - 0.4Mixed multiple ethnic groups: White and Black Caribbean - 0.9Mixed multiple ethnic groups: any other Mixed or multiple ethnic background - 0.8White: English or Welsh or Scottish or Northern Irish or British - 74.4White: Irish - 0.9White: Gypsy or Irish Traveller - 0.1White: any other White background - 6.4Other ethnic group: any other ethnic group - 1.6Other ethnic group: Arab - 0.6
The INSPECT dataset (Integrating Numerous Sources for Prognostic Evaluation of Clinical Timelines) contains de-identified longitudinal electronic health records (EHRs) from a large cohort of pulmonary embolism (PE) patients, along with ground truth labels for multiple outcomes. It includes 19,390 patients EHRs linked to 23,248 CTPA studies with paired radiology impressions.
https://redivis.com/fileUploads/282601b3-2c4b-4de2-a84c-742037a916cd%3E" alt="inspect-logo.png">
1. Overview
INSPECT is a large-scale 3D multimodal medical imaging dataset:
%3C!-- --%3E
2. CT Scans + Radiology Impression Notes
Imaging data are available for download from the Stanford AIMI Center.
3. EHR Data
EHR data is sourced from Stanford’s STARR-OMOP database. Data are standardized in the OMOP CDM schema and are fully de-identified. Complete technical details are included in the paper, but key highlights:
%3C!-- --%3E
Please see our Github repo to obtain code for loading the dataset, including a full data preprocessing pipeline for reproducibility, and running a set of pretrained baseline models
Access to the INSPECT dataset requires the following:
%3C!-- --%3E
**These data must remain on your encrypted machine. Redistribution of data is FORBIDDEN and will result in immediate termination of access privileges. **
IMPORTANT NOTES:
%3C!-- --%3E
Please allow 7-10 business days to process applications.
👂💉 EHRSHOT is a dataset for benchmarking the few-shot performance of foundation models for clinical prediction tasks. EHRSHOT contains de-identified structured data (e.g., diagnosis and procedure codes, medications, lab values) from the electronic health records (EHRs) of 6,739 Stanford Medicine patients and includes 15 prediction tasks. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and includes data beyond ICU and emergency department patients.
⚡️Quickstart 1. To recreate the original EHRSHOT paper, download the EHRSHOT_ASSETS.zip file from the "Files" tab 2. To work with OMOP CDM formatted data, download all the tables in the "Tables" tab
⚙️ Please see the "Methodology" section below for details on the dataset and downloadable files.
1. 📖 Overview
EHRSHOT is a benchmark for evaluating models on few-shot learning for patient classification tasks. The dataset contains:
%3C!-- --%3E
2. 💽 Dataset
EHRSHOT is sourced from Stanford’s STARR-OMOP database.
%3C!-- --%3E
We provide two versions of the dataset:
%3C!-- --%3E
To access the raw data, please see the "Tables" and "Files"** **tabs above:
3. 💽 Data Files and Formats
We provide EHRSHOT in two file formats:
%3C!-- --%3E
Within the "Tables" tab...
1. %3Cu%3EEHRSHOT-OMOP%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Notes: Contains all OMOP CDM tables for the EHRSHOT patients. Note that this dataset is slightly different than the original EHRSHOT dataset, as these tables contain the full OMOP schema rather than a filtered subset.
Within the "Files" tab...
1. %3Cu%3EEHRSHOT_ASSETS.zip%3C/u%3E
* Dataset Version: EHRSHOT-Original
* Data Format: FEMR 0.1.16
* Notes: The original EHRSHOT dataset as detailed in the paper. Also includes model weights.
2. %3Cu%3EEHRSHOT_MEDS.zip%3C/u%3E
* Dataset Version: EHRSHOT-Original
* Data Format: MEDS 0.3.3
* Notes: The original EHRSHOT dataset as detailed in the paper. It does not include any models.
3. %3Cu%3EEHRSHOT_OMOP_MEDS.zip%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Data Format: MEDS 0.3.3 + MEDS-ETL 0.3.8
* Notes: Converts the dataset from EHRSHOT-OMOP into MEDS format via the `meds_etl_omop`command from MEDS-ETL.
4. %3Cu%3EEHRSHOT_OMOP_MEDS_Reader.zip%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Data Format: MEDS Reader 0.1.9 + MEDS 0.3.3 + MEDS-ETL 0.3.8
* Notes: Same data as EHRSHOT_OMOP_MEDS.zip, but converted into a MEDS-Reader database for faster reads.
4. 🤖 Model
We also release the full weights of **CLMBR-T-base, **a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Please download from https://huggingface.co/StanfordShahLab/clmbr-t-base
**5. 🧑💻 Code **
Please see our Github repo to obtain code for loading the dataset and running a set of pretrained baseline models: https://github.com/som-shahlab/ehrshot-benchmark/
**NOTE: You must authenticate to Redivis using your formal affiliation's email address. If you use gmail or other personal email addresses, you will not be granted access. **
Access to the EHRSHOT dataset requires the following:
My HealtheVet (www.myhealth.va.gov) is a Personal Health Record portal designed to improve the delivery of health care services to Veterans, to promote health and wellness, and to engage Veterans as more active participants in their health care. The My HealtheVet portal enables Veterans to create and maintain a web-based PHR that provides access to patient health education information and resources, a comprehensive personal health journal, and electronic services such as online VA prescription refill requests and Secure Messaging. Veterans can visit the My HealtheVet website and self-register to create an account, although registration is not required to view the professionally-sponsored health education resources, including topics of special interest to the Veteran population. Once registered, Veterans can create a customized PHR that is accessible from any computer with Internet access.
Problem Statement
👉 Download the case studies here
Hospitals and healthcare providers faced challenges in ensuring continuous monitoring of patient vitals, especially for high-risk patients. Traditional monitoring methods often lacked real-time data processing and timely alerts, leading to delayed responses and increased hospital readmissions. The healthcare provider needed a solution to monitor patient health continuously and deliver actionable insights for improved care.
Challenge
Implementing an advanced patient monitoring system involved overcoming several challenges:
Collecting and analyzing real-time data from multiple IoT-enabled medical devices.
Ensuring accurate health insights while minimizing false alarms.
Integrating the system seamlessly with hospital workflows and electronic health records (EHR).
Solution Provided
A comprehensive patient monitoring system was developed using IoT-enabled medical devices and AI-based monitoring systems. The solution was designed to:
Continuously collect patient vital data such as heart rate, blood pressure, oxygen levels, and temperature.
Analyze data in real-time to detect anomalies and provide early warnings for potential health issues.
Send alerts to healthcare professionals and caregivers for timely interventions.
Development Steps
Data Collection
Deployed IoT-enabled devices such as wearable monitors, smart sensors, and bedside equipment to collect patient data continuously.
Preprocessing
Cleaned and standardized data streams to ensure accurate analysis and integration with hospital systems.
AI Model Development
Built machine learning models to analyze vital trends and detect abnormalities in real-time
Validation
Tested the system in controlled environments to ensure accuracy and reliability in detecting health issues.
Deployment
Implemented the solution in hospitals and care facilities, integrating it with EHR systems and alert mechanisms for seamless operation.
Continuous Monitoring & Improvement
Established a feedback loop to refine models and algorithms based on real-world data and healthcare provider feedback.
Results
Enhanced Patient Care
Real-time monitoring and proactive alerts enabled healthcare professionals to provide timely interventions, improving patient outcomes.
Early Detection of Health Issues
The system detected potential health complications early, reducing the severity of conditions and preventing critical events.
Reduced Hospital Readmissions
Continuous monitoring helped manage patient health effectively, leading to a significant decrease in readmission rates.
Improved Operational Efficiency
Automation and real-time insights reduced the burden on healthcare staff, allowing them to focus on critical cases.
Scalable Solution
The system adapted seamlessly to various healthcare settings, including hospitals, clinics, and home care environments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
http://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008Details: https://github.com/theislab/ehrapy-datasetsThis is the original diabetic_data.csv file downloaded from the above link on 18 Mar 2024 under a CC BY 4.0 License.It is stored here for convenience of ehrapy users.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Patient-drug-disease (PDD) Graph dataset, utilising Electronic medical records (EMRS) and biomedical Knowledge graphs. The novel framework to construct the PDD graph is described in the associated publication.PDD is an RDF graph consisting of PDD facts, where a PDD fact is represented by an RDF triple to indicate that a patient takes a drug or a patient is diagnosed with a disease. For instance, (pdd:274671, pdd:diagnosed, sepsis)Data files are in .nt N-Triple format, a line-based syntax for an RDF graph. These can be accessed via openly-available text edit software.diagnose_icd_information.nt - contains RDF triples mapping patients to diagnoses. For example:(pdd:18740, pdd:diagnosed, icd99592),where pdd:18740 is a patient entity, and icd99592 is the ICD-9 code of sepsis.drug_patients.nt- contains RDF triples mapping patients to drugs. For example:(pdd:18740, pdd:prescribed, aspirin),where pdd:18740 is a patient entity, and aspirin is the drug's name.Background:Electronic medical records contain multi-format electronic medical data that consist of an abundance of medical knowledge. Faced with patients' symptoms, experienced caregivers make the right medical decisions based on their professional knowledge, which accurately grasps relationships between symptoms, diagnoses and corresponding treatments. In the associated paper, we aim to capture these relationships by constructing a large and high-quality heterogenous graph linking patients, diseases, and drugs (PDD) in EMRs. Specifically, we propose a novel framework to extract important medical entities from MIMIC-III (Medical Information Mart for Intensive Care III) and automatically link them with the existing biomedical knowledge graphs, including ICD-9 ontology and DrugBank. The PDD graph presented in this paper is accessible on the Web via the SPARQL endpoint as well as in .nt format in this repository, and provides a pathway for medical discovery and applications, such as effective treatment recommendations.De-identificationIt is necessary to mention that MIMIC-III contains clinical information of patients. Although the protected health information was de-identifed, researchers who seek to use more clinical data should complete an on-line training course and then apply for the permission to download the complete MIMIC-III dataset: https://mimic.physionet.org/
The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.
The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.
Problem Statement
👉 Download the case studies here
Healthcare providers often rely on generalized treatment protocols that may not address the unique needs of individual patients. This approach led to variability in treatment outcomes, reduced efficacy, and limited patient satisfaction. A leading hospital sought a solution to develop personalized treatment plans tailored to each patient’s medical history, genetic profile, and current health status.
Challenge
Implementing a personalized healthcare treatment system involved overcoming the following challenges:
Integrating diverse patient data, including medical history, lab results, genetic information, and lifestyle factors.
Developing predictive models capable of identifying optimal treatment plans for individual patients.
Ensuring compliance with privacy regulations and maintaining data security throughout the process.
Solution Provided
An advanced healthcare treatment recommendation system was developed using machine learning models and predictive analytics. The solution was designed to:
Analyze patient data to identify patterns and predict treatment outcomes.
Recommend individualized treatment plans optimized for efficacy and patient preferences.
Continuously learn and adapt to improve recommendations based on new medical insights and patient feedback.
Development Steps
Data Collection
Aggregated data from electronic health records (EHR), genetic testing reports, and patient-provided health information.
Preprocessing
Standardized and anonymized data to ensure accuracy, consistency, and compliance with healthcare privacy regulations.
Model Development
Trained machine learning models to identify correlations between patient characteristics and treatment outcomes. Developed predictive algorithms to recommend personalized treatment plans for conditions like chronic diseases, cancer, and rare disorders.
Validation
Tested the system on historical patient data to evaluate its accuracy in predicting successful treatment outcomes.
Deployment
Integrated the solution into the hospital’s clinical decision support systems, enabling healthcare providers to access personalized treatment recommendations during consultations.
Continuous Monitoring & Improvement
Established a feedback mechanism to refine models using real-world treatment outcomes and patient satisfaction data.
Results
Improved Patient Outcomes
The system delivered personalized treatment recommendations that significantly improved recovery rates and health outcomes.
Increased Treatment Efficacy
Optimized treatment plans reduced trial-and-error approaches, leading to more effective interventions and fewer side effects.
Personalized Healthcare Experiences
Patients reported higher satisfaction levels due to treatment plans tailored to their individual needs and preferences.
Enhanced Decision-Making
Healthcare providers benefited from data-driven insights, enabling more informed and confident decisions.
Scalable and Future-Ready Solution
The system scaled seamlessly to support diverse medical specialties and adapted to incorporate emerging medical research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adaptation of http://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records# Ready for usage with ehrapy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adaptation of http://archive.ics.uci.edu/ml/datasets/DermatologyReady for usage with ehrapy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We conducted our experiments on de-identified EHR data from MIMIC-III. This data set contains various clinical data relating to patient admission to ICU, such as disease diagnoses in the form of International Classification of Diseases (ICD)-9 codes, and lab test results as detailed in Supplementary Materials. We collected data for 5,956 patients, extracting lab tests every hour from admission. There are a total of 409 unique lab tests and 3,387 unique disease diagnoses observed. The diagnoses were obtained as ICD-9 codes and they were represented using one-hot encoding where one represents patients with disease and zero indicates those without. We binned the lab test events into 6, 12, 24, and 48 hours prior to patient death or discharge from ICU. From these data, we performed mortality predictions that are 10-fold, cross validated.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The Medical Information Mart for Intensive Care (MIMIC)-IV database is comprised of deidentified electronic health records for patients admitted to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed users. Here, we have provided an openly-available demo of MIMIC-IV containing a subset of 100 patients. The dataset includes similar content to MIMIC-IV, but excludes free-text clinical notes. The demo may be useful for running workshops and for assessing whether the MIMIC-IV is appropriate for a study before making an access request.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Synthetic Skin Cancer Detection Dataset is designed for educational and research purposes to analyze factors associated with skin cancer types, their diagnosis, and treatment options. The dataset includes anonymized, synthetic data on various clinical and demographic factors for individuals diagnosed with different types of skin cancer.
https://storage.googleapis.com/opendatabay_public/0536d52f-a9dd-4e31-9caf-e1a47fd836d9/8511f90adc6c_yes_diagnosis_counts.png" alt="Distribution of Synthetic skin cancer dataset yes_diagnosis_counts.png">
This dataset can be used for the following applications:
This synthetic dataset is fully anonymized and complies with data privacy standards. It includes a broad set of factors to support diverse research and analysis in the oncology and medical domains, particularly in dermatology.
CC0 (Public Domain)
1_Cofrequency_Counts.tar.gzSee ReadMe.txt2_Singleton_Frequency_Counts.tar.gzSee ReadMe.txt provided with "1_Cofrequency_Counts.tar.gz"3_ID_Mappings.tar.gzSee ReadMe.txt provided with "1_Cofrequency_Counts.tar.gz"4_Scripts.tar.gzSee ReadMe.txt provided with "1_Cofrequency_Counts.tar.gz"
https://aimistanford-web-api.azurewebsites.net/licenses/f1f352a6-243f-4905-8e00-389edbca9e83/viewhttps://aimistanford-web-api.azurewebsites.net/licenses/f1f352a6-243f-4905-8e00-389edbca9e83/view
Synthesizing information from various data sources plays a crucial role in the practice of modern medicine. Current applications of artificial intelligence in medicine often focus on single-modality data due to a lack of publicly available, multimodal medical datasets. To address this limitation, we introduce INSPECT, which contains de-identified longitudinal records from a large cohort of pulmonary embolism (PE) patients, along with ground truth labels for multiple outcomes. INSPECT contains data from 19,438 patients, including CT images, sections of radiology reports, and structured electronic health record (EHR) data (including demographics, diagnoses, procedures, and vitals). Using our provided dataset, we develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks. We evaluate image-only, EHR-only, and fused models. Trained models and the de-identified dataset are made available for non-commercial use under a data use agreement. To the best our knowledge, INSPECT is the largest multimodal dataset for enabling reproducible research on strategies for integrating 3D medical imaging and EHR data. NOTE: this is the first part of release due to PHI review. This release has 20078 CT scans, 21,266 impression sections and the EHR modality data will be uploaded to Stanford Redivis website (https://redivis.com/Stanford)
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge).MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three factors: it is freely available to researchers worldwide; it encompasses a diverse and very large population of ICU patients; and it contains highly granular data, including vital signs, laboratory results, and medications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is Electronic Health Record Predicting collected from a private Hospital in Indonesia. It contains the patients laboratory test results used to determine next patient treatment whether in care or out care patient. The task embedded to the dataset is classification prediction.