https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The advent of large, open access text databases has driven advances in state-of-the-art model performance in natural language processing (NLP). The relatively limited amount of clinical data available for NLP has been cited as a significant barrier to the field's progress. Here we describe MIMIC-IV-Note: a collection of deidentified free-text clinical notes for patients included in the MIMIC-IV clinical database. MIMIC-IV-Note contains 331,794 deidentified discharge summaries from 145,915 patients admitted to the hospital and emergency department at the Beth Israel Deaconess Medical Center in Boston, MA, USA. The database also contains 2,321,355 deidentified radiology reports for 237,427 patients. All notes have had protected health information removed in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. All notes are linkable to MIMIC-IV providing important context to the clinical data therein. The database is intended to stimulate research in clinical natural language processing and associated areas.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy. Here we present Medical Information Mart for Intensive Care (MIMIC)-IV, a large deidentified dataset of patients admitted to the emergency department or an intensive care unit at the Beth Israel Deaconess Medical Center in Boston, MA. MIMIC-IV contains data for over 65,000 patients admitted to an ICU and over 200,000 patients admitted to the emergency department. MIMIC-IV incorporates contemporary data and adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.
The Medical Information Mart for Intensive Care (MIMIC)-IV database is comprised of deidentified electronic health records for patients admitted to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed users. Here, we have provided an openly-available demo of MIMIC-IV containing a subset of 100 patients. The dataset includes similar content to MIMIC-IV, but excludes free-text clinical notes. The demo may be useful for running workshops and for assessing whether the MIMIC-IV is appropriate for a study before making an access request.
MIMIC-IV ICD-10 contains 122,279 discharge summaries—free-text medical documents—annotated with ICD-10 diagnosis and procedure codes. It contains data for patients admitted to the Beth Israel Deaconess Medical Center emergency department or ICU between 2008-2019. All codes with fewer than ten examples have been removed, and the train-val-test split was created using multi-label stratified sampling. The dataset is described further in Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study, and the code to use the dataset is found here.
The dataset is intended for medical code prediction and was created using MIMIC-IV v2.2 and MIMIC-IV-NOTE v2.2. Using the two datasets requires a license obtained in Physionet; this can take a couple of days.
The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.
The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study, we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We then used these questions to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes. To evaluate the resulting models, we created 23 questions resembling eligibility criteria from the apixaban clinical trial and evaluated them on a random sample of 100 patient notes from MIMIC-IV. Notes from MIMIC-IV were taken from after 2012 to ensure no overlap with any of the notes from MIMIC-III which were used to generate the data used to finetune the models. We release the 2300 total question-answer pairs as a dataset here.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Large language models in healthcare can generate informative patient summaries while reducing the documentation workload of healthcare professionals. However, these models are prone to producing hallucinations, that is, generating unsupported information, which is problematic in the sensitive healthcare domain. To better characterize unsupported facts in medical texts, we developed a rigorous labeling protocol. Following this protocol, two medical experts annotated unsupported facts in 100 doctor-written summaries from the MIMIC-IV-Note Discharge Instructions and hallucinations 100 LLM-generated patient summaries. Here, we are releasing two datasets based on these annotations: Hallucinations-MIMIC-DI and Hallucinations-Generated-DI. We find that using these datasets to train on hallucination-free examples effectively reduces hallucinations for both Llama 2 (2.60 to 1.55 hallucinations per summary) and GPT-4 (0.70 to 0.40). Furthermore, we created a preprocessed version of the MIMIC-IV-Notes Discharge Instructions, releasing both a full-context version (MIMIC-IV-Note-Ext-DI) and a version that only uses the Brief Hospital Course for context (MIMIC-IV-Note-Ext-DI-BHC).
MIMIC-IV-ED is a large, freely available database of emergency department (ED) admissions at the Beth Israel Deaconess Medical Center between 2011 and 2019. As of MIMIC-ED v1.0, the database contains 448,972 ED stays. Vital signs, triage information, medication reconciliation, medication administration, and discharge diagnoses are available. All data are deidentified to comply with the Health Information Portability and Accountability Act (HIPAA) Safe Harbor provision. MIMIC-ED is intended to support a diverse range of education initiatives and research studies.
Embeddings for SNOMED CT concepts produced by models of FastText trained on different corpora. Each file contains a JSON file that links the ID of a SNOMED CT concept to its corresponding embedding. Files ft_mimicN_dict.json contain the embeddings of models trained on subsets of MIMIC-IV, where N denotes the percentage of MIMIC-IV used in the training of the model; whereas ft_snomed_ct_walks_dict.json contains the embeddings of a FastText model trained on an artifical corpus obtained by performing walks on SNOMED CT (https://doi.org/10.1016/j.jbi.2023.104297).
These embeddings were generated and studied in the paper Assessing the Effectiveness of Embedding Methods in Capturing Clinical Information from SNOMED CT () and more information can also be found in the following repository: https://github.com/JavierCastellD/AssessingSNOMEDEmbeddings.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
With the rapid development of generative LLMs (large language models) in the field of natural language processing, their potential in medical applications has become increasingly evident. However, most existing studies rely on exam-style questions or artificially designed cases, lacking validation using real patient data. To address this gap, this study leverages the MIMIC-IV database to construct a subset, MIMIC-IV-Ext Cardiac Disease, which includes 4,761 patients diagnosed with cardiac diseases. The dataset covers all relevant clinical examinations from admission to discharge, as well as the final diagnoses.Combining these data with the multi-turn interaction framework we built can be used to test whether large models can guide patients through in-hospital examinations. Moreover, after modifying the MIMIC-IV dataset, our sub-dataset can greatly facilitate researchers in conducting other studies.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge).MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three factors: it is freely available to researchers worldwide; it encompasses a diverse and very large population of ICU patients; and it contains highly granular data, including vital signs, laboratory results, and medications.
👂💉 EHRSHOT is a dataset for benchmarking the few-shot performance of foundation models for clinical prediction tasks. EHRSHOT contains de-identified structured data (e.g., diagnosis and procedure codes, medications, lab values) from the electronic health records (EHRs) of 6,739 Stanford Medicine patients and includes 15 prediction tasks. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and includes data beyond ICU and emergency department patients.
⚡️Quickstart 1. To recreate the original EHRSHOT paper, download the EHRSHOT_ASSETS.zip file from the "Files" tab 2. To work with OMOP CDM formatted data, download all the tables in the "Tables" tab
⚙️ Please see the "Methodology" section below for details on the dataset and downloadable files.
1. 📖 Overview
EHRSHOT is a benchmark for evaluating models on few-shot learning for patient classification tasks. The dataset contains:
%3C!-- --%3E
2. 💽 Dataset
EHRSHOT is sourced from Stanford’s STARR-OMOP database.
%3C!-- --%3E
We provide two versions of the dataset:
%3C!-- --%3E
To access the raw data, please see the "Tables" and "Files"** **tabs above:
3. 💽 Data Files and Formats
We provide EHRSHOT in two file formats:
%3C!-- --%3E
Within the "Tables" tab...
1. %3Cu%3EEHRSHOT-OMOP%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Notes: Contains all OMOP CDM tables for the EHRSHOT patients. Note that this dataset is slightly different than the original EHRSHOT dataset, as these tables contain the full OMOP schema rather than a filtered subset.
Within the "Files" tab...
1. %3Cu%3EEHRSHOT_ASSETS.zip%3C/u%3E
* Dataset Version: EHRSHOT-Original
* Data Format: FEMR 0.1.16
* Notes: The original EHRSHOT dataset as detailed in the paper. Also includes model weights.
2. %3Cu%3EEHRSHOT_MEDS.zip%3C/u%3E
* Dataset Version: EHRSHOT-Original
* Data Format: MEDS 0.3.3
* Notes: The original EHRSHOT dataset as detailed in the paper. It does not include any models.
3. %3Cu%3EEHRSHOT_OMOP_MEDS.zip%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Data Format: MEDS 0.3.3 + MEDS-ETL 0.3.8
* Notes: Converts the dataset from EHRSHOT-OMOP into MEDS format via the `meds_etl_omop`command from MEDS-ETL.
4. %3Cu%3EEHRSHOT_OMOP_MEDS_Reader.zip%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Data Format: MEDS Reader 0.1.9 + MEDS 0.3.3 + MEDS-ETL 0.3.8
* Notes: Same data as EHRSHOT_OMOP_MEDS.zip, but converted into a MEDS-Reader database for faster reads.
4. 🤖 Model
We also release the full weights of **CLMBR-T-base, **a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Please download from https://huggingface.co/StanfordShahLab/clmbr-t-base
**5. 🧑💻 Code **
Please see our Github repo to obtain code for loading the dataset and running a set of pretrained baseline models: https://github.com/som-shahlab/ehrshot-benchmark/
**NOTE: You must authenticate to Redivis using your formal affiliation's email address. If you use gmail or other personal email addresses, you will not be granted access. **
Access to the EHRSHOT dataset requires the following:
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
This dataset was designed and created to enable advancements in healthcare-focused large language models, particularly in the context of retrieval-augmented clinical question-answering capabilities. Developed using a self-constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Recent advances in scaling large language models (LLMs) has resulted in significant improvements over a number of natural language processing benchmarks. There has been some work to pretrain these language models over clinical text. These works demonstrate that training a language model using masked language modeling (MLM) on clinical notes is an effective technique for boosting performance on downstream tasks. All of these previous works use decoder-only architectures. We train 4 different clinical T5 models on the union of MIMIC-III and IV notes. Two of the models are initialized from previous T5-models (T5-base and SciFive). We additionally train a T5-Base and T5-Large model from scratch. These models should not be distributed to non-credentialed users. Research has shown that these language models have the potential to leak sensitive information. Due to this potential risk, we release the model weights under PhysioNet credentialed access.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The MIMIC-IV-ECG module contains approximately 800,000 diagnostic electrocardiograms across nearly 160,000 unique patients. These diagnostic ECGs use 12 leads and are 10 seconds in length. They are sampled at 500 Hz. This subset contains all of the ECGs for patients who appear in the MIMIC-IV Clinical Database. When a cardiologist report is available for a given ECG, it is also provided. The patients in MIMIC-IV-ECG have been matched against the MIMIC-IV Clinical Database, making it possible to link to information across the MIMIC-IV modules.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
This challenge, sponsored by SNOMED International, seeks to advance the development of Entity Linking models that operate on unstructured clinical texts. Participants in the challenge will train entity linking models using a subset of MIMIC-IV-Note discharge summaries that have been annotated with SNOMED CT concepts by a team of medical professionals. The full dataset (which is comprised of a training set and a test set) consists of approximately 75,000 annotations across nearly 300 discharge summaries. The challenge was originally run as a competition on the DrivenData platform. Now that the competition has concluded, the hidden test set has been added to this dataset.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The EchoNotes Structured Database derived from MIMIC-III (ECHO-NOTE2NUM) is a structured echocardiogram database derived from 43,472 observational notes obtained during echocardiogram studies conducted in the intensive care unit at the Beth Israel Deaconess Medical Center between 2001 and 2012. The database encompasses various aspects of cardiac structure and function, including cavity size, wall thickness, systolic and diastolic function, valve regurgitation and stenosis, as well as pulmonary pressures. To facilitate extensive data analysis, the clinical notes were transformed into a structured numerical format. Within each echocardiogram report sentence, specific words or phrases were identified to describe abnormal findings, and a severity staging system using numeric categories was established. This large publicly-accessible database of structured echocardiogram data holds significant potential as a tool to investigate cardiovascular disease in the intensive care unit.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. Notably, the demo dataset does not include free-text notes.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The MIMIC Chest X-ray (MIMIC-CXR) Database v2.0.0 is a large publicly available dataset of chest radiographs in DICOM format with free-text radiology reports. The dataset contains 377,110 images corresponding to 227,835 radiographic studies performed at the Beth Israel Deaconess Medical Center in Boston, MA. The dataset is de-identified to satisfy the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements. Protected health information (PHI) has been removed. The dataset is intended to support a wide body of research in medicine including image understanding, natural language processing, and decision support.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The advent of large, open access text databases has driven advances in state-of-the-art model performance in natural language processing (NLP). The relatively limited amount of clinical data available for NLP has been cited as a significant barrier to the field's progress. Here we describe MIMIC-IV-Note: a collection of deidentified free-text clinical notes for patients included in the MIMIC-IV clinical database. MIMIC-IV-Note contains 331,794 deidentified discharge summaries from 145,915 patients admitted to the hospital and emergency department at the Beth Israel Deaconess Medical Center in Boston, MA, USA. The database also contains 2,321,355 deidentified radiology reports for 237,427 patients. All notes have had protected health information removed in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. All notes are linkable to MIMIC-IV providing important context to the clinical data therein. The database is intended to stimulate research in clinical natural language processing and associated areas.