35 datasets found

p
Data from: Medication Extraction Labels for MIMIC-IV-Note Clinical Database
physionet.org
Updated Dec 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akshay Goel; Almog Gueta; Omry Gilon; Sofia Erell; Amir Feder (2023). Medication Extraction Labels for MIMIC-IV-Note Clinical Database [Dataset]. http://doi.org/10.13026/ps1s-ab29
Explore at:
Unique identifier
https://doi.org/10.13026/ps1s-ab29
Dataset updated
Dec 12, 2023
Authors
Akshay Goel; Almog Gueta; Omry Gilon; Sofia Erell; Amir Feder
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
This dataset release provides medication extraction labels for a subset of 600 discharge summaries from the 2023 MIMIC-IV-Note dataset. These labels are consistent with the schema from the 2009 i2b2 Workshop on NLP Challenges dataset. We utilized a Large Language Model (LLM) pipeline to generate these labels, achieving performance on par with the average human annotation specialist.
h
EHRGym-MIMIC-Extract
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yishan Zhong, EHRGym-MIMIC-Extract [Dataset]. https://huggingface.co/datasets/zhongys97/EHRGym-MIMIC-Extract
Explore at:
Authors
Yishan Zhong
Description
zhongys97/EHRGym-MIMIC-Extract dataset hosted on Hugging Face and contributed by the HF Datasets community
p
MIMIC-IV
physionet.org
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark (2024). MIMIC-IV [Dataset]. http://doi.org/10.13026/kpb9-mt58
Explore at:
Unique identifier
https://doi.org/10.13026/kpb9-mt58
Dataset updated
Oct 11, 2024
Authors
Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy. Here we present Medical Information Mart for Intensive Care (MIMIC)-IV, a large deidentified dataset of patients admitted to the emergency department or an intensive care unit at the Beth Israel Deaconess Medical Center in Boston, MA. MIMIC-IV contains data for over 65,000 patients admitted to an ICU and over 200,000 patients admitted to the emergency department. MIMIC-IV incorporates contemporary data and adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.
S
EHR data from MIMIC-III
scidb.cn
Updated Aug 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tingyi Wanyan; Hossein Honarvar; Ariful Azad; Ying Ding; Benjamin S. Glicksberg (2021). EHR data from MIMIC-III [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00094
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.j00104.00094
Dataset updated
Aug 24, 2021
Dataset provided by
Science Data Bank
Authors
Tingyi Wanyan; Hossein Honarvar; Ariful Azad; Ying Ding; Benjamin S. Glicksberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We conducted our experiments on de-identified EHR data from MIMIC-III. This data set contains various clinical data relating to patient admission to ICU, such as disease diagnoses in the form of International Classification of Diseases (ICD)-9 codes, and lab test results as detailed in Supplementary Materials. We collected data for 5,956 patients, extracting lab tests every hour from admission. There are a total of 409 unique lab tests and 3,387 unique disease diagnoses observed. The diagnoses were obtained as ICD-9 codes and they were represented using one-hot encoding where one represents patients with disease and zero indicates those without. We binned the lab test events into 6, 12, 24, and 48 hours prior to patient death or discharge from ICU. From these data, we performed mortality predictions that are 10-fold, cross validated.
p
Data from: CLIP: A Dataset for Extracting Action Items for Physicians from...
physionet.org
Updated Jun 21, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Mullenbach; Yada Pruksachatkun; Sean Adler; Jennifer Seale; Jordan Swartz; T Greg McKelvey; Yi Yang; David Sontag (2021). CLIP: A Dataset for Extracting Action Items for Physicians from Hospital Discharge Notes [Dataset]. http://doi.org/10.13026/kw00-z903
Explore at:
Unique identifier
https://doi.org/10.13026/kw00-z903
Dataset updated
Jun 21, 2021
Authors
James Mullenbach; Yada Pruksachatkun; Sean Adler; Jennifer Seale; Jordan Swartz; T Greg McKelvey; Yi Yang; David Sontag
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
We created a dataset of clinical action items annotated over MIMIC-III. This dataset, which we call CLIP, is annotated by physicians and covers 718 discharge summaries, representing 107,494 sentences. Annotations were collected as character-level spans to discharge summaries after applying surrogate generation to fill in the anonymized templates from MIMIC-III text with faked data. We release these spans, their aggregation into sentence-level labels, and the sentence tokenizer used to aggregate the spans and label sentences. We also release the surrogate data generator, and the document IDs used for training, validation, and test splits, to enable reproduction. The spans are annotated with 0 or more labels of 7 different types, representing the different actions that may need to be taken: Appointment, Lab, Procedure, Medication, Imaging, Patient Instructions, and Other. We encourage the community to use this dataset to develop methods for automatically extracting clinical action items from discharge summaries.
f
SQL code.
plos.figshare.com
7z
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). SQL code. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.s001
Explore at:
7zAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276835.s001
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The code is about how to extract data from the MIMIC-III. (7Z)
p
Data from: MIMIC-III and eICU-CRD: Feature Representation by FIDDLE...
physionet.org
Updated Apr 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengpu Tang; Parmida Davarmanesh; Yanmeng Song; Danai Koutra; Michael Sjoding; Jenna Wiens (2021). MIMIC-III and eICU-CRD: Feature Representation by FIDDLE Preprocessing [Dataset]. http://doi.org/10.13026/2qtg-k467
Explore at:
Unique identifier
https://doi.org/10.13026/2qtg-k467
Dataset updated
Apr 28, 2021
Authors
Shengpu Tang; Parmida Davarmanesh; Yanmeng Song; Danai Koutra; Michael Sjoding; Jenna Wiens
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
This is a preprocessed dataset derived from patient records in MIMIC-III and eICU, two large-scale electronic health record (EHR) databases. It contains features and labels for 5 prediction tasks involving 3 adverse outcomes (prediction times listed in parentheses): in-hospital mortality (48h), acute respiratory failure (4h and 12h), and shock (4h and 12h). We extracted comprehensive, high-dimensional feature representations (up to ~8,000 features) using FIDDLE (FlexIble Data-Driven pipeLinE), an open-source preprocessing pipeline for structured clinical data. These 5 prediction tasks were designed in consultation with a critical care physician for their clinical importance, and were used as part of the proof-of-concept experiments in the original paper to demonstrate FIDDLE's utility in aiding the feature engineering step of machine learning model development. The intent of this release is to share preprocessed MIMIC-III and eICU datasets used in the experiments to support and enable reproducible machine learning research on EHR data.
Z
Structure Annotations of Assessment and Plan Sections from MIMIC-III
data.niaid.nih.gov
Updated Apr 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stupp, Doron (2022). Structure Annotations of Assessment and Plan Sections from MIMIC-III [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6413404
Explore at:
Dataset updated
Apr 17, 2022
Dataset provided by
Matias, Yossi
Feder, Amir
Lee, I-Ching
Barequet, Ronnie
Oren, Eyal
Benjamini, Ayelet
Rajkomar, Alvin
Ofek, Eran
Hassidim, Avinatan
Stupp, Doron
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Physicians record their detailed thought-processes about diagnoses and treatments as unstructured text in a section of a clinical note called the "assessment and plan". This information is more clinically rich than structured billing codes assigned for an encounter but harder to reliably extract given the complexity of clinical language and documentation habits. To structure these sections we collected a dataset of annotations over assessment and plan sections from the publicly available and de-identified MIMIC-III dataset, and developed deep-learning based models to perform this task, described in the associated paper available as a pre-print at: https://www.medrxiv.org/content/10.1101/2022.04.13.22273438v1

When using this data please cite our paper:

@article {Stupp2022.04.13.22273438, author = {Stupp, Doron and Barequet, Ronnie and Lee, I-Ching and Oren, Eyal and Feder, Amir and Benjamini, Ayelet and Hassidim, Avinatan and Matias, Yossi and Ofek, Eran and Rajkomar, Alvin}, title = {Structured Understanding of Assessment and Plans in Clinical Documentation}, year = {2022}, doi = {10.1101/2022.04.13.22273438}, publisher = {Cold Spring Harbor Laboratory Press}, URL = {https://www.medrxiv.org/content/early/2022/04/17/2022.04.13.22273438}, journal = {medRxiv} }

The dataset, presented here, contains annotations of assessment and plan sections of notes from the publicly available and de-identified MIMIC-III dataset, marking the active problems, their assessment description, and plan action items. Action items are additionally marked as one of 8 categories (listed below). The dataset contains over 30,000 annotations of 579 notes from distinct patients, annotated by 6 medical residents and students.

The dataset is divided into 4 partitions - a training set (481 notes), validation set (50 notes), test set (48 notes) and an inter-rater set. The inter-rater set contains the annotations of each of the raters over the test set. Rater 1 in the inter-rater set should be regarded as an intra-rater comparison (details in the paper). The labels underwent automatic normalization to capture entire word boundaries and remove flanking non-alphanumeric characters.

Code for transforming labels into TensorFlow examples and training models as described in the paper will be made available at GitHub: https://github.com/google-research/google-research/tree/master/assessment_plan_modeling

In order to use these annotations, the user additionally needs to obtain the text of the notes which is found in the NOTE_EVENTS table from MIMIC-III, access to which is to be acquired independently (https://mimic.mit.edu/)

Annotations are given as character spans in a CSV file with the following schema:

Field Type Semantics partition categorical (one of [train, val, test, interrater] The set of ratings the span belongs to. rater_id int Unique id for each the raters note_id int The note’s unique note_id, links to the MIMIC-III notes table (as ROW-ID). span_type categorical (one of [PROBLEM_TITLE, PROBLEM_DESCRIPTION, ACTION_ITEM] Type of the span as annotated by raters. char_start int Character offsets from note start char_end int action_item_type categorical (one of [MEDICATIONS, IMAGING, OBSERVATIONS_LABS, CONSULTS, NUTRITION, THERAPEUTIC_PROCEDURES, OTHER_DIAGNOSTIC_PROCEDURES, OTHER]) Type of action item if the span is an action item (empty otherwise) as annotated by raters.
Test AU-ROC scores for four models trained with features extracted using...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson (2023). Test AU-ROC scores for four models trained with features extracted using PubMedBERT. [Dataset]. http://doi.org/10.1371/journal.pone.0262182.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0262182.t005
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Test AU-ROC scores for four models trained with features extracted using PubMedBERT.
Test AU-ROC scores by four models trained with features extracted using...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson (2023). Test AU-ROC scores by four models trained with features extracted using TF-IDF. [Dataset]. http://doi.org/10.1371/journal.pone.0262182.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0262182.t003
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Test AU-ROC scores by four models trained with features extracted using TF-IDF.
p
MIMIC-IV-Ext-Apixaban-Trial-Criteria-Questions
physionet.org
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones (2025). MIMIC-IV-Ext-Apixaban-Trial-Criteria-Questions [Dataset]. http://doi.org/10.13026/4p6q-vb04
Explore at:
Unique identifier
https://doi.org/10.13026/4p6q-vb04
Dataset updated
Apr 30, 2025
Authors
Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study, we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We then used these questions to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes. To evaluate the resulting models, we created 23 questions resembling eligibility criteria from the apixaban clinical trial and evaluated them on a random sample of 100 patient notes from MIMIC-IV. Notes from MIMIC-IV were taken from after 2012 to ensure no overlap with any of the notes from MIMIC-III which were used to generate the data used to finetune the models. We release the 2300 total question-answer pairs as a dataset here.
Data from: PDD Graph: Bridging Electronic Medical Records and Biomedical...
springernature.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meng Wang; Jiaheng Zhang; Jun Liu; Wei Hu; Sen Wang; Xue Li; Wenqiang Liu (2023). PDD Graph: Bridging Electronic Medical Records and Biomedical Knowledge Graphs via Entity Linking [Dataset]. http://doi.org/10.6084/m9.figshare.5242138
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5242138
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Meng Wang; Jiaheng Zhang; Jun Liu; Wei Hu; Sen Wang; Xue Li; Wenqiang Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Patient-drug-disease (PDD) Graph dataset, utilising Electronic medical records (EMRS) and biomedical Knowledge graphs. The novel framework to construct the PDD graph is described in the associated publication.PDD is an RDF graph consisting of PDD facts, where a PDD fact is represented by an RDF triple to indicate that a patient takes a drug or a patient is diagnosed with a disease. For instance, (pdd:274671, pdd:diagnosed, sepsis)Data files are in .nt N-Triple format, a line-based syntax for an RDF graph. These can be accessed via openly-available text edit software.diagnose_icd_information.nt - contains RDF triples mapping patients to diagnoses. For example:(pdd:18740, pdd:diagnosed, icd99592),where pdd:18740 is a patient entity, and icd99592 is the ICD-9 code of sepsis.drug_patients.nt- contains RDF triples mapping patients to drugs. For example:(pdd:18740, pdd:prescribed, aspirin),where pdd:18740 is a patient entity, and aspirin is the drug's name.Background:Electronic medical records contain multi-format electronic medical data that consist of an abundance of medical knowledge. Faced with patients' symptoms, experienced caregivers make the right medical decisions based on their professional knowledge, which accurately grasps relationships between symptoms, diagnoses and corresponding treatments. In the associated paper, we aim to capture these relationships by constructing a large and high-quality heterogenous graph linking patients, diseases, and drugs (PDD) in EMRs. Specifically, we propose a novel framework to extract important medical entities from MIMIC-III (Medical Information Mart for Intensive Care III) and automatically link them with the existing biomedical knowledge graphs, including ICD-9 ontology and DrugBank. The PDD graph presented in this paper is accessible on the Web via the SPARQL endpoint as well as in .nt format in this repository, and provides a pathway for medical discovery and applications, such as effective treatment recommendations.De-identificationIt is necessary to mention that MIMIC-III contains clinical information of patients. Although the protected health information was de-identifed, researchers who seek to use more clinical data should complete an on-line training course and then apply for the permission to download the complete MIMIC-III dataset: https://mimic.physionet.org/
Comparing deep learning and concept extraction based methods for patient...
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Gehrmann; Franck Dernoncourt; Yeran Li; Eric T. Carlson; Joy T. Wu; Jonathan Welt; John Foote Jr.; Edward T. Moseley; David W. Grant; Patrick D. Tyler; Leo A. Celi (2023). Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives [Dataset]. http://doi.org/10.1371/journal.pone.0192360
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0192360
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sebastian Gehrmann; Franck Dernoncourt; Yeran Li; Eric T. Carlson; Joy T. Wu; Jonathan Welt; John Foote Jr.; Edward T. Moseley; David W. Grant; Patrick D. Tyler; Leo A. Celi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In secondary analysis of electronic health records, a crucial task consists in correctly identifying the patient cohort under investigation. In many cases, the most valuable and relevant information for an accurate classification of medical conditions exist only in clinical narratives. Therefore, it is necessary to use natural language processing (NLP) techniques to extract and evaluate these narratives. The most commonly used approach to this problem relies on extracting a number of clinician-defined medical concepts from text and using machine learning techniques to identify whether a particular patient has a certain condition. However, recent advances in deep learning and NLP enable models to learn a rich representation of (medical) language. Convolutional neural networks (CNN) for text classification can augment the existing techniques by leveraging the representation of language to learn which phrases in a text are relevant for a given medical condition. In this work, we compare concept extraction based methods with CNNs and other commonly used models in NLP in ten phenotyping tasks using 1,610 discharge summaries from the MIMIC-III database. We show that CNNs outperform concept extraction based methods in almost all of the tasks, with an improvement in F1-score of up to 26 and up to 7 percentage points in area under the ROC curve (AUC). We additionally assess the interpretability of both approaches by presenting and evaluating methods that calculate and extract the most salient phrases for a prediction. The results indicate that CNNs are a valid alternative to existing approaches in patient phenotyping and cohort identification, and should be further investigated. Moreover, the deep learning approach presented in this paper can be used to assist clinicians during chart review or support the extraction of billing codes from text by identifying and highlighting relevant phrases for various medical conditions.
f
Test AU-ROC scores for four models trained with features extracted using...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson (2023). Test AU-ROC scores for four models trained with features extracted using FastText. [Dataset]. http://doi.org/10.1371/journal.pone.0262182.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0262182.t004
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Test AU-ROC scores for four models trained with features extracted using FastText.
p
MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions
physionet.org
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones (2025). MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions [Dataset]. http://doi.org/10.13026/30k0-av04
Explore at:
Unique identifier
https://doi.org/10.13026/30k0-av04
Dataset updated
Apr 22, 2025
Authors
Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study [1], we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We manually reviewed 1000 of these examples and release them here. These examples can then be used to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes.
Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a...
opendatalab.com
zip
Updated May 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University College London (2023). Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a sample of MIMIC-III clinical notes) [Dataset]. https://opendatalab.com/OpenDataLab/Rare_Diseases_Mentions_in_etc
Explore at:
zip(637266 bytes)Available download formats
Dataset updated
May 4, 2023
Dataset provided by
Health Data Research Uk
University College London
University of Edinburgh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data annotation The 1,073 full rare disease mention annotations (from 312 MIMIC-III discharge summaries) are in full_set_RD_ann_MIMIC_III_disch.csv. The data split: * the first 400 rows are used for validation, validation_set_RD_ann_MIMIC_III_disch.csv, and * the last 673 rows are used for testing, test_set_RD_ann_MIMIC_III_disch.csv. The 198 rare disease mention annotations (from 145 MIMIC-III radiology reports) are in test_set_RD_ann_MIMIC_III_rad.csv. To note that radiology reports were only used for testing and not for validation. To note: a row can only be consider a true phenotype of the patient only when the value of the column gold mention-to-ORDO label is 1. Data sampling and annotation procedure (i) Randomly sampled 500 discharge summaries (and 1000 radiology reports) from MIMIC-III (ii) 312 of the 500 discharge summaries (and 145 of the 1000 radiology reports) have at least one positive UMLS mention linked to ORDO, as identified by SemEHR; there are altogether 1073 (and 198 in radiology reports) UMLS/ORDO mentions. (iii) 3 medical informatics researchers (staff or PhD students) annotated the 1,073 mentions (and 2 medical informatics researchers annotated the 198 mentions in radiology reports), regarding whether they are the correct patient phenotypes matched to UMLS and ORDO. Contradictions in the annotations were then resolved by another research staff having biomedical background. Data dictionary Column Name Description ROW_ID Identifier unique to each row, see https://mimic.physionet.org/mimictables/noteevents/ SUBJECT_ID Identifier unique to a patient, see https://mimic.physionet.org/mimictables/noteevents/ HADM_ID Identifier unique to a patient hospital stay, see https://mimic.physionet.org/mimictables/noteevents/ document structure name The document structure name of the mention. The document structure name is identified by SemEHR (only for discharge summaries). document structure offset in full document The start and ending offsets of the document structure texts (or template) in the whole discharge summary. The document structure is parsed by SemEHR with regular expressions (only for discharge summaries). mention The mention identified by SemEHR. mention offset in document structure The start and ending offsets of the mention in the document structure (only for discharge summaries). mention offset in full document The start and ending offsets of the mention in the whole discharge summary. They can be calculated by document structure offset in full document and mention offset in document structure. UMLS with desc The UMLS identified by SemEHR, corresponding to the mention. ORDO with desc The ORDO matched to the UMLS, using the linkage in the ORDO ontology, see https://www.ebi.ac.uk/ols/ontologies/ordo/terms?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_3325 as an example. gold mention-to-UMLS label Whether the mention-UMLS pair indicate a correct phenotype of the patient (i.e. a positive mention that correctly matches to the UMLS concept), 1 if correct, 0 if not. gold UMLS-to-ORDO label Whether the matching is correct from the UMLS concept to the ORDO concept, 1 if correct, 0 if not. gold mention-to-ORDO label Whether the mention-ORDO triple indicates a correct phenotype of the patient, 1 if correct, 0 if not. This column is 1 if both the mention-to-UMLS label and the UMLS-to-ORDO label are 1, otherwise 0. Note: * These manual annotations are by no means to be perfect. There are hypothetical mentions which were difficult for the annotators to make a decision. Also, they are based on the output of SemEHR, which does not have 100% recall, so the annotations may not cover all rare diseases mentions from the sampled discharge summaries. * In row 323 from the full set or the validation set, the mention nph is not in the document structure (due to error in mention extraction), thus the gold mention-to-UMLS label is -1.
EHR data from patients with AML and AML-Simiar diseases
figshare.com
bin
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Sun (2024). EHR data from patients with AML and AML-Simiar diseases [Dataset]. http://doi.org/10.6084/m9.figshare.27959760.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27959760.v1
Dataset updated
Dec 4, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Chang Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This describes the datasets used for training and evaluation Onto-CGAN model sourced from the MIMIC-IV Clinical Database V2.2. The query script used for data extraction is listed below. Due to the restricted access policies of the MIMIC database, we are unable to publish the extracted subset of MIMIC data. However, researchers with authorized access to the MIMIC-IV database may request the experimental patient data from the corresponding author.SQL Query in MIMIC SELECT DISTINCT diag_table.subject_id, diag_table.icd_code, diag_table.hadm_id, patients_table.gender, patients_table.anchor_age, omr_table.result_name, omr_table.result_value,patients_table.dodFROM physionet-data.mimiciv_hosp.diagnoses_icd AS diag_tableJOIN physionet-data.mimiciv_hosp.patients AS patients_table ON diag_table.subject_id = patients_table.subject_idJOIN physionet-data.mimiciv_hosp.omr AS omr_table ON diag_table.subject_id = omr_table.subject_idWHERE diag_table.icd_code LIKE '20%'
p
Data from: MIMICEL: MIMIC-IV Event Log for Emergency Department
physionet.org
Updated Jun 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Wei; Zhipeng He; Chun Ouyang; Catarina Moreira (2023). MIMICEL: MIMIC-IV Event Log for Emergency Department [Dataset]. http://doi.org/10.13026/c9yj-1t90
Explore at:
Unique identifier
https://doi.org/10.13026/c9yj-1t90
Dataset updated
Jun 16, 2023
Authors
Jia Wei; Zhipeng He; Chun Ouyang; Catarina Moreira
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
In this work, we extract an event log from the MIMIC-IV-ED dataset by adopting a well-established event log generation methodology, and we name this event log MIMICEL. The data tables in the MIMIC-IV-ED dataset relate to each other based on the existing relational database schema, and each table records the individual activities of patients along their journey in the emergency department (ED). While the data tables in the MIMIC-IV-ED dataset catch snapshots of a patient journey in the ED, the extracted event log MIMICEL aims to capture an end-to-end process of the patient journey. This will enable us to analyse the existing patient flows, thereby improving the efficiency of an ED process.
The most salient phrases for advanced heart failure and alcohol abuse.
plos.figshare.com
figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Gehrmann; Franck Dernoncourt; Yeran Li; Eric T. Carlson; Joy T. Wu; Jonathan Welt; John Foote Jr.; Edward T. Moseley; David W. Grant; Patrick D. Tyler; Leo A. Celi (2023). The most salient phrases for advanced heart failure and alcohol abuse. [Dataset]. http://doi.org/10.1371/journal.pone.0192360.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0192360.t003
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sebastian Gehrmann; Franck Dernoncourt; Yeran Li; Eric T. Carlson; Joy T. Wu; Jonathan Welt; John Foote Jr.; Edward T. Moseley; David W. Grant; Patrick D. Tyler; Leo A. Celi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The salient cTAKES CUIs are extracted from the filtered RF model.
p
Data from: LLaVA-Rad MIMIC-CXR Annotations
physionet.org
paperswithcode.com
Updated Jan 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Manuel Zambrano Chaves; Shih-Cheng Huang; Yanbo Xu; Hanwen Xu; Naoto Usuyama; Sheng Zhang; Fei Wang; Yujia Xie; Mahmoud Khademi; Ziyi Yang; Hany Awadalla; Julia Gong; Houdong Hu; Jianwei Yang; Chunyuan Li; Jianfeng Gao; Yu Gu; Cliff Wong; Mu-Hsin Wei; Tristan Naumann; Muhao Chen; Matthew Lungren; Akshay Chaudhari; Serena Yeung; Curtis Langlotz; Sheng Wang; Hoifung Poon (2025). LLaVA-Rad MIMIC-CXR Annotations [Dataset]. http://doi.org/10.13026/4ma4-k740
Explore at:
Unique identifier
https://doi.org/10.13026/4ma4-k740
Dataset updated
Jan 24, 2025
Authors
Juan Manuel Zambrano Chaves; Shih-Cheng Huang; Yanbo Xu; Hanwen Xu; Naoto Usuyama; Sheng Zhang; Fei Wang; Yujia Xie; Mahmoud Khademi; Ziyi Yang; Hany Awadalla; Julia Gong; Houdong Hu; Jianwei Yang; Chunyuan Li; Jianfeng Gao; Yu Gu; Cliff Wong; Mu-Hsin Wei; Tristan Naumann; Muhao Chen; Matthew Lungren; Akshay Chaudhari; Serena Yeung; Curtis Langlotz; Sheng Wang; Hoifung Poon
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
LLaVA-Rad MIMIC-CXR features more accurate section extractions from MIMIC-CXR free-text radiology reports. Traditionally, rule-based methods were used to extract sections such as the reason for exam, findings, and impression. However, these approaches often fail due to inconsistencies in report structure and clinical language. In this work, we leverage GPT-4 to extract these sections more reliably, adding 237,073 image-text pairs to the training split and 1,952 pairs to the validation split. This enhancement afforded the development and fine-tuning of LLaVA-Rad, a multimodal large language model (LLM) tailored for radiology applications, achieving improved performance on report generation tasks.

This resource is provided to support reproducibility and for the benefit of the research community, enabling further exploration in vision–language modeling. For more details, please refer to the accompanying paper [1].

Facebook

Twitter

Click to copy link

Link copied

Cite

Akshay Goel; Almog Gueta; Omry Gilon; Sofia Erell; Amir Feder (2023). Medication Extraction Labels for MIMIC-IV-Note Clinical Database [Dataset]. http://doi.org/10.13026/ps1s-ab29

Data from: Medication Extraction Labels for MIMIC-IV-Note Clinical Database

Explore at:

Unique identifier

https://doi.org/10.13026/ps1s-ab29

Dataset updated

Dec 12, 2023

Authors

Akshay Goel; Almog Gueta; Omry Gilon; Sofia Erell; Amir Feder

License

https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Description

This dataset release provides medication extraction labels for a subset of 600 discharge summaries from the 2023 MIMIC-IV-Note dataset. These labels are consistent with the schema from the 2009 i2b2 Workshop on NLP Challenges dataset. We utilized a Large Language Model (LLM) pipeline to generate these labels, achieving performance on par with the average human annotation specialist.

Clear search

Close search

Google apps

Main menu

Data from: Medication Extraction Labels for MIMIC-IV-Note Clinical Database

EHRGym-MIMIC-Extract

MIMIC-IV

EHR data from MIMIC-III

Data from: CLIP: A Dataset for Extracting Action Items for Physicians from...

SQL code.

Data from: MIMIC-III and eICU-CRD: Feature Representation by FIDDLE...

Structure Annotations of Assessment and Plan Sections from MIMIC-III

Test AU-ROC scores for four models trained with features extracted using...

Test AU-ROC scores by four models trained with features extracted using...

MIMIC-IV-Ext-Apixaban-Trial-Criteria-Questions

Data from: PDD Graph: Bridging Electronic Medical Records and Biomedical...

Comparing deep learning and concept extraction based methods for patient...

Test AU-ROC scores for four models trained with features extracted using...

MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions

Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a...

EHR data from patients with AML and AML-Simiar diseases

Data from: MIMICEL: MIMIC-IV Event Log for Emergency Department

The most salient phrases for advanced heart failure and alcohol abuse.

Data from: LLaVA-Rad MIMIC-CXR Annotations

Data from: Medication Extraction Labels for MIMIC-IV-Note Clinical Database