22 datasets found

p
MIMIC-IV-Note: Deidentified free-text clinical notes
physionet.org
Updated Jan 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair Johnson; Tom Pollard; Steven Horng; Leo Anthony Celi; Roger Mark (2023). MIMIC-IV-Note: Deidentified free-text clinical notes [Dataset]. http://doi.org/10.13026/1n74-ne17
Explore at:
Unique identifier
https://doi.org/10.13026/1n74-ne17
Dataset updated
Jan 6, 2023
Authors
Alistair Johnson; Tom Pollard; Steven Horng; Leo Anthony Celi; Roger Mark
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
The advent of large, open access text databases has driven advances in state-of-the-art model performance in natural language processing (NLP). The relatively limited amount of clinical data available for NLP has been cited as a significant barrier to the field's progress. Here we describe MIMIC-IV-Note: a collection of deidentified free-text clinical notes for patients included in the MIMIC-IV clinical database. MIMIC-IV-Note contains 331,794 deidentified discharge summaries from 145,915 patients admitted to the hospital and emergency department at the Beth Israel Deaconess Medical Center in Boston, MA, USA. The database also contains 2,321,355 deidentified radiology reports for 237,427 patients. All notes have had protected health information removed in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. All notes are linkable to MIMIC-IV providing important context to the clinical data therein. The database is intended to stimulate research in clinical natural language processing and associated areas.
p
MIMIC-IV
physionet.org
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark (2024). MIMIC-IV [Dataset]. http://doi.org/10.13026/kpb9-mt58
Explore at:
Unique identifier
https://doi.org/10.13026/kpb9-mt58
Dataset updated
Oct 11, 2024
Authors
Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy. Here we present Medical Information Mart for Intensive Care (MIMIC)-IV, a large deidentified dataset of patients admitted to the emergency department or an intensive care unit at the Beth Israel Deaconess Medical Center in Boston, MA. MIMIC-IV contains data for over 65,000 patients admitted to an ICU and over 200,000 patients admitted to the emergency department. MIMIC-IV incorporates contemporary data and adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.
o
MIMIC-IV Clinical Database Demo
registry.opendata.aws
physionet.org
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PhysioNet (2024). MIMIC-IV Clinical Database Demo [Dataset]. https://registry.opendata.aws/mimic-iv-demo/
Explore at:
Dataset updated
Nov 25, 2024
Dataset provided by
<a href="https://physionet.org/">PhysioNet</a>
Description
The Medical Information Mart for Intensive Care (MIMIC)-IV database is comprised of deidentified electronic health records for patients admitted to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed users. Here, we have provided an openly-available demo of MIMIC-IV containing a subset of 100 patients. The dataset includes similar content to MIMIC-IV, but excludes free-text clinical notes. The demo may be useful for running workshops and for assessing whether the MIMIC-IV is appropriate for a study before making an access request.
P
MIMIC-IV ICD-10 Dataset
paperswithcode.com
Updated Apr 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joakim Edin; Alexander Junge; Jakob D. Havtorn; Lasse Borgholt; Maria Maistro; Tuukka Ruotsalo; Lars Maaløe (2023). MIMIC-IV ICD-10 Dataset [Dataset]. https://paperswithcode.com/dataset/mimic-iv-icd-10
Explore at:
Dataset updated
Apr 20, 2023
Authors
Joakim Edin; Alexander Junge; Jakob D. Havtorn; Lasse Borgholt; Maria Maistro; Tuukka Ruotsalo; Lars Maaløe
Description
MIMIC-IV ICD-10 contains 122,279 discharge summaries—free-text medical documents—annotated with ICD-10 diagnosis and procedure codes. It contains data for patients admitted to the Beth Israel Deaconess Medical Center emergency department or ICU between 2008-2019. All codes with fewer than ten examples have been removed, and the train-val-test split was created using multi-label stratified sampling. The dataset is described further in Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study, and the code to use the dataset is found here.

The dataset is intended for medical code prediction and was created using MIMIC-IV v2.2 and MIMIC-IV-NOTE v2.2. Using the two datasets requires a license obtained in Physionet; this can take a couple of days.
P
MIMIC-III Dataset
paperswithcode.com
opendatalab.com
Updated Feb 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair E.W. Johnson; Tom J. Pollard; Lu Shen; Li-wei H. Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G. Mark (2021). MIMIC-III Dataset [Dataset]. https://paperswithcode.com/dataset/mimic-iii
Explore at:
Dataset updated
Feb 9, 2021
Authors
Alistair E.W. Johnson; Tom J. Pollard; Lu Shen; Li-wei H. Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G. Mark
Description
The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.

The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.
p
MIMIC-IV-Ext-Apixaban-Trial-Criteria-Questions
physionet.org
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones (2025). MIMIC-IV-Ext-Apixaban-Trial-Criteria-Questions [Dataset]. http://doi.org/10.13026/4p6q-vb04
Explore at:
Unique identifier
https://doi.org/10.13026/4p6q-vb04
Dataset updated
Apr 30, 2025
Authors
Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study, we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We then used these questions to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes. To evaluate the resulting models, we created 23 questions resembling eligibility criteria from the apixaban clinical trial and evaluated them on a random sample of 100 patient notes from MIMIC-IV. Notes from MIMIC-IV were taken from after 2012 to ensure no overlap with any of the notes from MIMIC-III which were used to generate the data used to finetune the models. We release the 2300 total question-answer pairs as a dataset here.
p
Data from: Medical Expert Annotations of Unsupported Facts in Doctor-Written...
physionet.org
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Hegselmann; Shannon Shen; Florian Gierse; Monica Agrawal; David Sontag; Xiaoyi Jiang (2025). Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries [Dataset]. http://doi.org/10.13026/gedc-j464
Explore at:
Unique identifier
https://doi.org/10.13026/gedc-j464
Dataset updated
Apr 30, 2025
Authors
Stefan Hegselmann; Shannon Shen; Florian Gierse; Monica Agrawal; David Sontag; Xiaoyi Jiang
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Large language models in healthcare can generate informative patient summaries while reducing the documentation workload of healthcare professionals. However, these models are prone to producing hallucinations, that is, generating unsupported information, which is problematic in the sensitive healthcare domain. To better characterize unsupported facts in medical texts, we developed a rigorous labeling protocol. Following this protocol, two medical experts annotated unsupported facts in 100 doctor-written summaries from the MIMIC-IV-Note Discharge Instructions and hallucinations 100 LLM-generated patient summaries. Here, we are releasing two datasets based on these annotations: Hallucinations-MIMIC-DI and Hallucinations-Generated-DI. We find that using these datasets to train on hallucination-free examples effectively reduces hallucinations for both Llama 2 (2.60 to 1.55 hallucinations per summary) and GPT-4 (0.70 to 0.40). Furthermore, we created a preprocessed version of the MIMIC-IV-Notes Discharge Instructions, releasing both a full-context version (MIMIC-IV-Note-Ext-DI) and a version that only uses the Brief Hospital Course for context (MIMIC-IV-Note-Ext-DI-BHC).
P
Data from: MIMIC-IV-ED Dataset
paperswithcode.com
Updated Feb 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). MIMIC-IV-ED Dataset [Dataset]. https://paperswithcode.com/dataset/mimic-iv-ed
Explore at:
Dataset updated
Feb 9, 2025
Description
MIMIC-IV-ED is a large, freely available database of emergency department (ED) admissions at the Beth Israel Deaconess Medical Center between 2011 and 2019. As of MIMIC-ED v1.0, the database contains 448,972 ED stays. Vital signs, triage information, medication reconciliation, medication administration, and discharge diagnoses are available. All data are deidentified to comply with the Health Information Portability and Accountability Act (HIPAA) Safe Harbor provision. MIMIC-ED is intended to support a diverse range of education initiatives and research studies.
u
Data from: FastText embeddings for SNOMED CT concepts using MIMIC-IV notes...
portalinvestigacion.um.es
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Castell Díaz, Javier; Kelleher, John D.; Fernández Breis, Jesualdo Tomás; Martínez Costa, Catalina; Castell Díaz, Javier; Kelleher, John D.; Fernández Breis, Jesualdo Tomás; Martínez Costa, Catalina (2025). FastText embeddings for SNOMED CT concepts using MIMIC-IV notes and SNOMED CT walks [Dataset]. https://portalinvestigacion.um.es/documentos/67bc32b7478fbf5d29390d6f
Explore at:
Dataset updated
2025
Authors
Castell Díaz, Javier; Kelleher, John D.; Fernández Breis, Jesualdo Tomás; Martínez Costa, Catalina; Castell Díaz, Javier; Kelleher, John D.; Fernández Breis, Jesualdo Tomás; Martínez Costa, Catalina
Description
Embeddings for SNOMED CT concepts produced by models of FastText trained on different corpora. Each file contains a JSON file that links the ID of a SNOMED CT concept to its corresponding embedding. Files ft_mimicN_dict.json contain the embeddings of models trained on subsets of MIMIC-IV, where N denotes the percentage of MIMIC-IV used in the training of the model; whereas ft_snomed_ct_walks_dict.json contains the embeddings of a FastText model trained on an artifical corpus obtained by performing walks on SNOMED CT (https://doi.org/10.1016/j.jbi.2023.104297).

These embeddings were generated and studied in the paper Assessing the Effectiveness of Embedding Methods in Capturing Clinical Information from SNOMED CT () and more information can also be found in the following repository: https://github.com/JavierCastellD/AssessingSNOMEDEmbeddings.
p
MIMIC-IV-Ext Cardiac Disease
physionet.org
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiawei Cao; Sendong Zhao (2025). MIMIC-IV-Ext Cardiac Disease [Dataset]. http://doi.org/10.13026/khgm-hc33
Explore at:
Unique identifier
https://doi.org/10.13026/khgm-hc33
Dataset updated
May 6, 2025
Authors
Jiawei Cao; Sendong Zhao
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
With the rapid development of generative LLMs (large language models) in the field of natural language processing, their potential in medical applications has become increasingly evident. However, most existing studies rely on exam-style questions or artificially designed cases, lacking validation using real patient data. To address this gap, this study leverages the MIMIC-IV database to construct a subset, MIMIC-IV-Ext Cardiac Disease, which includes 4,761 patients diagnosed with cardiac diseases. The dataset covers all relevant clinical examinations from admission to discharge, as well as the final diagnoses.Combining these data with the multi-turn interaction framework we built can be used to test whether large models can guide patients through in-hospital examinations. Moreover, after modifying the MIMIC-IV dataset, our sub-dataset can greatly facilitate researchers in conducting other studies.
p
MIMIC-III Clinical Database
physionet.org
Updated Sep 4, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair Johnson; Tom Pollard; Roger Mark (2016). MIMIC-III Clinical Database [Dataset]. http://doi.org/10.13026/C2XW26
Explore at:
Unique identifier
https://doi.org/10.13026/C2XW26
Dataset updated
Sep 4, 2016
Authors
Alistair Johnson; Tom Pollard; Roger Mark
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge).MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three factors: it is freely available to researchers worldwide; it encompasses a diverse and very large population of ICU patients; and it contains highly granular data, including vital signs, laboratory results, and medications.
EHRSHOT
redivis.com
stanford.redivis.com
application/jsonl +7
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shah Lab (2025). EHRSHOT [Dataset]. http://doi.org/10.57761/0gv9-nd83
Explore at:
csv, application/jsonl, sas, parquet, stata, spss, arrow, avroAvailable download formats
Unique identifier
https://doi.org/10.57761/0gv9-nd83
Dataset updated
Feb 13, 2025
Dataset provided by
Redivis Inc.
Authors
Shah Lab
Description
Abstract

👂💉 EHRSHOT is a dataset for benchmarking the few-shot performance of foundation models for clinical prediction tasks. EHRSHOT contains de-identified structured data (e.g., diagnosis and procedure codes, medications, lab values) from the electronic health records (EHRs) of 6,739 Stanford Medicine patients and includes 15 prediction tasks. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and includes data beyond ICU and emergency department patients.

⚡️Quickstart 1. To recreate the original EHRSHOT paper, download the EHRSHOT_ASSETS.zip file from the "Files" tab 2. To work with OMOP CDM formatted data, download all the tables in the "Tables" tab

⚙️ Please see the "Methodology" section below for details on the dataset and downloadable files.

Methodology

1. 📖 Overview

EHRSHOT is a benchmark for evaluating models on few-shot learning for patient classification tasks. The dataset contains:

**6,739 **patients

41.6 million clinical events

921,499 visits

15 prediction tasks

%3C!-- --%3E

2. 💽 Dataset

EHRSHOT is sourced from Stanford’s STARR-OMOP database.

Data follows the OMOP CDM and is fully de-identified.

Unlike most other EHR research datasets, EHRSHOT is not restricted to ED/ICU visits and instead includes longitudinal patient data for all hospital encounter types.

EHRSHOT does not contain clinical notes or images.

%3C!-- --%3E

We provide two versions of the dataset:

EHRSHOT-Original is the same exact dataset used in the original EHRSHOT paper.

EHRSHOT-OMOP is a more complete version of the EHRSHOT dataset which includes all OMOP CDM tables and additional OMOP metadata.

%3C!-- --%3E

To access the raw data, please see the "Tables" and "Files"** **tabs above:

3. 💽 Data Files and Formats

We provide EHRSHOT in two file formats:

OMOP CDM v5.4

Medical Event Data Standard (MEDS)

%3C!-- --%3E

Within the "Tables" tab...

1. %3Cu%3EEHRSHOT-OMOP%3C/u%3E

* Dataset Version: EHRSHOT-OMOP

* Notes: Contains all OMOP CDM tables for the EHRSHOT patients. Note that this dataset is slightly different than the original EHRSHOT dataset, as these tables contain the full OMOP schema rather than a filtered subset.

Within the "Files" tab...

1. %3Cu%3EEHRSHOT_ASSETS.zip%3C/u%3E

* Dataset Version: EHRSHOT-Original

* Data Format: FEMR 0.1.16

* Notes: The original EHRSHOT dataset as detailed in the paper. Also includes model weights.

2. %3Cu%3EEHRSHOT_MEDS.zip%3C/u%3E

* Dataset Version: EHRSHOT-Original

* Data Format: MEDS 0.3.3

* Notes: The original EHRSHOT dataset as detailed in the paper. It does not include any models.

3. %3Cu%3EEHRSHOT_OMOP_MEDS.zip%3C/u%3E

* Dataset Version: EHRSHOT-OMOP

* Data Format: MEDS 0.3.3 + MEDS-ETL 0.3.8

* Notes: Converts the dataset from EHRSHOT-OMOP into MEDS format via the `meds_etl_omop`command from MEDS-ETL.

4. %3Cu%3EEHRSHOT_OMOP_MEDS_Reader.zip%3C/u%3E

* Dataset Version: EHRSHOT-OMOP

* Data Format: MEDS Reader 0.1.9 + MEDS 0.3.3 + MEDS-ETL 0.3.8

* Notes: Same data as EHRSHOT_OMOP_MEDS.zip, but converted into a MEDS-Reader database for faster reads.

4. 🤖 Model

We also release the full weights of **CLMBR-T-base, **a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Please download from https://huggingface.co/StanfordShahLab/clmbr-t-base

**5. 🧑‍💻 Code **

Please see our Github repo to obtain code for loading the dataset and running a set of pretrained baseline models: https://github.com/som-shahlab/ehrshot-benchmark/

Usage

**NOTE: You must authenticate to Redivis using your formal affiliation's email address. If you use gmail or other personal email addresses, you will not be granted access. **

Access to the EHRSHOT dataset requires the following:

Verified Affiliation with an **Academic, Government, **o
p
Data from: EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge...
physionet.org
Updated Jan 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantin Kotschenreuther (2024). EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems [Dataset]. http://doi.org/10.13026/25fx-f706
Explore at:
Unique identifier
https://doi.org/10.13026/25fx-f706
Dataset updated
Jan 11, 2024
Authors
Konstantin Kotschenreuther
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
This dataset was designed and created to enable advancements in healthcare-focused large language models, particularly in the context of retrieval-augmented clinical question-answering capabilities. Developed using a self-constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.
p
Data from: Clinical-T5: Large Language Models Built Using MIMIC Clinical...
physionet.org
Updated Jan 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Lehman; Alistair Johnson (2023). Clinical-T5: Large Language Models Built Using MIMIC Clinical Text [Dataset]. http://doi.org/10.13026/rj8x-v335
Explore at:
Unique identifier
https://doi.org/10.13026/rj8x-v335
Dataset updated
Jan 25, 2023
Authors
Eric Lehman; Alistair Johnson
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Recent advances in scaling large language models (LLMs) has resulted in significant improvements over a number of natural language processing benchmarks. There has been some work to pretrain these language models over clinical text. These works demonstrate that training a language model using masked language modeling (MLM) on clinical notes is an effective technique for boosting performance on downstream tasks. All of these previous works use decoder-only architectures. We train 4 different clinical T5 models on the union of MIMIC-III and IV notes. Two of the models are initialized from previous T5-models (T5-base and SciFive). We additionally train a T5-Base and T5-Large model from scratch. These models should not be distributed to non-credentialed users. Research has shown that these language models have the potential to leak sensitive information. Due to this potential risk, we release the model weights under PhysioNet credentialed access.
p
Data from: MIMIC-IV-ECG - Diagnostic Electrocardiogram Matched Subset
physionet.org
Updated Dec 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Gow; Tom Pollard; Larry A Nathanson; Alistair Johnson; Benjamin Moody; Chrystinne Fernandes; Nathaniel Greenbaum; Seth Berkowitz; Dana Moukheiber; Parastou Eslami; Elizabeth Herbst; Roger Mark; Steven Horng (2022). MIMIC-IV-ECG - Diagnostic Electrocardiogram Matched Subset [Dataset]. http://doi.org/10.13026/tw25-ec93
Explore at:
Unique identifier
https://doi.org/10.13026/tw25-ec93
Dataset updated
Dec 23, 2022
Authors
Brian Gow; Tom Pollard; Larry A Nathanson; Alistair Johnson; Benjamin Moody; Chrystinne Fernandes; Nathaniel Greenbaum; Seth Berkowitz; Dana Moukheiber; Parastou Eslami; Elizabeth Herbst; Roger Mark; Steven Horng
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
The MIMIC-IV-ECG module contains approximately 800,000 diagnostic electrocardiograms across nearly 160,000 unique patients. These diagnostic ECGs use 12 leads and are 10 seconds in length. They are sampled at 500 Hz. This subset contains all of the ECGs for patients who appear in the MIMIC-IV Clinical Database. When a cardiologist report is available for a given ECG, it is also provided. The patients in MIMIC-IV-ECG have been matched against the MIMIC-IV Clinical Database, making it possible to link to information across the MIMIC-IV modules.
p
SNOMED CT Entity Linking Challenge
physionet.org
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Will Hardman; Mark Banks; Rory Davidson; Donna Truran; Nindya Widita Ayuningtyas; Hoa Ngo; Alistair Johnson; Tom Pollard (2025). SNOMED CT Entity Linking Challenge [Dataset]. http://doi.org/10.13026/qn8t-6e19
Explore at:
Unique identifier
https://doi.org/10.13026/qn8t-6e19
Dataset updated
Jul 22, 2025
Authors
Will Hardman; Mark Banks; Rory Davidson; Donna Truran; Nindya Widita Ayuningtyas; Hoa Ngo; Alistair Johnson; Tom Pollard
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
This challenge, sponsored by SNOMED International, seeks to advance the development of Entity Linking models that operate on unstructured clinical texts. Participants in the challenge will train entity linking models using a subset of MIMIC-IV-Note discharge summaries that have been annotated with SNOMED CT concepts by a team of medical professionals. The full dataset (which is comprised of a training set and a test set) consists of approximately 75,000 annotations across nearly 300 discharge summaries. The challenge was originally run as a competition on the DrivenData platform. Now that the competition has concluded, the hidden test set has been added to this dataset.
p
Data from: EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice...
physionet.org
Updated Jun 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sunjun Kweon; Jiyoun Kim; Heeyoung Kwak; Dongchul Cha; Hangyul Yoon; Kwang Hyun Kim; Jeewon Yang; Seunghyun Won; Edward Choi (2024). EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries [Dataset]. http://doi.org/10.13026/acga-ht95
Explore at:
Unique identifier
https://doi.org/10.13026/acga-ht95
Dataset updated
Jun 26, 2024
Authors
Sunjun Kweon; Jiyoun Kim; Heeyoung Kwak; Dongchul Cha; Hangyul Yoon; Kwang Hyun Kim; Jeewon Yang; Seunghyun Won; Edward Choi
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Area covered
World
Description
Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
p
Data from: EchoNotes Structured Database derived from MIMIC-III...
physionet.org
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gloria Hyunjung Kwak; Dana Moukheiber; Mira Moukheiber; Lama Moukheiber; Sulaiman Moukheiber; Neel Butala; Leo Anthony Celi; Christina Chen (2024). EchoNotes Structured Database derived from MIMIC-III (ECHO-NOTE2NUM) [Dataset]. http://doi.org/10.13026/xhrz-ht59
Explore at:
Unique identifier
https://doi.org/10.13026/xhrz-ht59
Dataset updated
Feb 23, 2024
Authors
Gloria Hyunjung Kwak; Dana Moukheiber; Mira Moukheiber; Lama Moukheiber; Sulaiman Moukheiber; Neel Butala; Leo Anthony Celi; Christina Chen
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
The EchoNotes Structured Database derived from MIMIC-III (ECHO-NOTE2NUM) is a structured echocardiogram database derived from 43,472 observational notes obtained during echocardiogram studies conducted in the intensive care unit at the Beth Israel Deaconess Medical Center between 2001 and 2012. The database encompasses various aspects of cardiac structure and function, including cavity size, wall thickness, systolic and diastolic function, valve regurgitation and stenosis, as well as pulmonary pressures. To facilitate extensive data analysis, the clinical notes were transformed into a structured numerical format. Within each echocardiogram report sentence, specific words or phrases were identified to describe abnormal findings, and a severity staging system using numeric categories was established. This large publicly-accessible database of structured echocardiogram data holds significant potential as a tool to investigate cardiovascular disease in the intensive care unit.
p
MIMIC-III Clinical Database Demo
physionet.org
Updated Apr 24, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair Johnson; Tom Pollard; Roger Mark (2019). MIMIC-III Clinical Database Demo [Dataset]. http://doi.org/10.13026/C2HM2Q
Explore at:
Unique identifier
https://doi.org/10.13026/C2HM2Q
Dataset updated
Apr 24, 2019
Authors
Alistair Johnson; Tom Pollard; Roger Mark
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. Notably, the demo dataset does not include free-text notes.
p
Data from: MIMIC-CXR Database
physionet.org
Updated Jul 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair Johnson; Tom Pollard; Roger Mark; Seth Berkowitz; Steven Horng (2024). MIMIC-CXR Database [Dataset]. http://doi.org/10.13026/4jqj-jw95
Explore at:
Unique identifier
https://doi.org/10.13026/4jqj-jw95
Dataset updated
Jul 23, 2024
Authors
Alistair Johnson; Tom Pollard; Roger Mark; Seth Berkowitz; Steven Horng
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
The MIMIC Chest X-ray (MIMIC-CXR) Database v2.0.0 is a large publicly available dataset of chest radiographs in DICOM format with free-text radiology reports. The dataset contains 377,110 images corresponding to 227,835 radiographic studies performed at the Beth Israel Deaconess Medical Center in Boston, MA. The dataset is de-identified to satisfy the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements. Protected health information (PHI) has been removed. The dataset is intended to support a wide body of research in medicine including image understanding, natural language processing, and decision support.

Facebook

Twitter

Click to copy link

Link copied

Cite

Alistair Johnson; Tom Pollard; Steven Horng; Leo Anthony Celi; Roger Mark (2023). MIMIC-IV-Note: Deidentified free-text clinical notes [Dataset]. http://doi.org/10.13026/1n74-ne17

MIMIC-IV-Note: Deidentified free-text clinical notes

Explore at:

59 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.13026/1n74-ne17

Dataset updated

Jan 6, 2023

Authors

Alistair Johnson; Tom Pollard; Steven Horng; Leo Anthony Celi; Roger Mark

License

https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Description

The advent of large, open access text databases has driven advances in state-of-the-art model performance in natural language processing (NLP). The relatively limited amount of clinical data available for NLP has been cited as a significant barrier to the field's progress. Here we describe MIMIC-IV-Note: a collection of deidentified free-text clinical notes for patients included in the MIMIC-IV clinical database. MIMIC-IV-Note contains 331,794 deidentified discharge summaries from 145,915 patients admitted to the hospital and emergency department at the Beth Israel Deaconess Medical Center in Boston, MA, USA. The database also contains 2,321,355 deidentified radiology reports for 237,427 patients. All notes have had protected health information removed in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. All notes are linkable to MIMIC-IV providing important context to the clinical data therein. The database is intended to stimulate research in clinical natural language processing and associated areas.

Clear search

Close search

Google apps

Main menu

MIMIC-IV-Note: Deidentified free-text clinical notes

MIMIC-IV

MIMIC-IV Clinical Database Demo

MIMIC-IV ICD-10 Dataset

MIMIC-III Dataset

MIMIC-IV-Ext-Apixaban-Trial-Criteria-Questions

Data from: Medical Expert Annotations of Unsupported Facts in Doctor-Written...

Data from: MIMIC-IV-ED Dataset

Data from: FastText embeddings for SNOMED CT concepts using MIMIC-IV notes...

MIMIC-IV-Ext Cardiac Disease

MIMIC-III Clinical Database

EHRSHOT

Abstract

Methodology

Usage

Data from: EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge...

Data from: Clinical-T5: Large Language Models Built Using MIMIC Clinical...

Data from: MIMIC-IV-ECG - Diagnostic Electrocardiogram Matched Subset

SNOMED CT Entity Linking Challenge

Data from: EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice...

Data from: EchoNotes Structured Database derived from MIMIC-III...

MIMIC-III Clinical Database Demo

Data from: MIMIC-CXR Database

MIMIC-IV-Note: Deidentified free-text clinical notesSee More Versions

MIMIC-IV-Note: Deidentified free-text clinical notes