35 datasets found
  1. p

    Data from: Medication Extraction Labels for MIMIC-IV-Note Clinical Database

    • physionet.org
    Updated Dec 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshay Goel; Almog Gueta; Omry Gilon; Sofia Erell; Amir Feder (2023). Medication Extraction Labels for MIMIC-IV-Note Clinical Database [Dataset]. http://doi.org/10.13026/ps1s-ab29
    Explore at:
    Dataset updated
    Dec 12, 2023
    Authors
    Akshay Goel; Almog Gueta; Omry Gilon; Sofia Erell; Amir Feder
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    This dataset release provides medication extraction labels for a subset of 600 discharge summaries from the 2023 MIMIC-IV-Note dataset. These labels are consistent with the schema from the 2009 i2b2 Workshop on NLP Challenges dataset. We utilized a Large Language Model (LLM) pipeline to generate these labels, achieving performance on par with the average human annotation specialist.

  2. h

    EHRGym-MIMIC-Extract

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yishan Zhong, EHRGym-MIMIC-Extract [Dataset]. https://huggingface.co/datasets/zhongys97/EHRGym-MIMIC-Extract
    Explore at:
    Authors
    Yishan Zhong
    Description

    zhongys97/EHRGym-MIMIC-Extract dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. p

    MIMIC-IV

    • physionet.org
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark (2024). MIMIC-IV [Dataset]. http://doi.org/10.13026/kpb9-mt58
    Explore at:
    Dataset updated
    Oct 11, 2024
    Authors
    Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy. Here we present Medical Information Mart for Intensive Care (MIMIC)-IV, a large deidentified dataset of patients admitted to the emergency department or an intensive care unit at the Beth Israel Deaconess Medical Center in Boston, MA. MIMIC-IV contains data for over 65,000 patients admitted to an ICU and over 200,000 patients admitted to the emergency department. MIMIC-IV incorporates contemporary data and adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.

  4. S

    EHR data from MIMIC-III

    • scidb.cn
    Updated Aug 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tingyi Wanyan; Hossein Honarvar; Ariful Azad; Ying Ding; Benjamin S. Glicksberg (2021). EHR data from MIMIC-III [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00094
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 24, 2021
    Dataset provided by
    Science Data Bank
    Authors
    Tingyi Wanyan; Hossein Honarvar; Ariful Azad; Ying Ding; Benjamin S. Glicksberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We conducted our experiments on de-identified EHR data from MIMIC-III. This data set contains various clinical data relating to patient admission to ICU, such as disease diagnoses in the form of International Classification of Diseases (ICD)-9 codes, and lab test results as detailed in Supplementary Materials. We collected data for 5,956 patients, extracting lab tests every hour from admission. There are a total of 409 unique lab tests and 3,387 unique disease diagnoses observed. The diagnoses were obtained as ICD-9 codes and they were represented using one-hot encoding where one represents patients with disease and zero indicates those without. We binned the lab test events into 6, 12, 24, and 48 hours prior to patient death or discharge from ICU. From these data, we performed mortality predictions that are 10-fold, cross validated.

  5. p

    Data from: CLIP: A Dataset for Extracting Action Items for Physicians from...

    • physionet.org
    Updated Jun 21, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Mullenbach; Yada Pruksachatkun; Sean Adler; Jennifer Seale; Jordan Swartz; T Greg McKelvey; Yi Yang; David Sontag (2021). CLIP: A Dataset for Extracting Action Items for Physicians from Hospital Discharge Notes [Dataset]. http://doi.org/10.13026/kw00-z903
    Explore at:
    Dataset updated
    Jun 21, 2021
    Authors
    James Mullenbach; Yada Pruksachatkun; Sean Adler; Jennifer Seale; Jordan Swartz; T Greg McKelvey; Yi Yang; David Sontag
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    We created a dataset of clinical action items annotated over MIMIC-III. This dataset, which we call CLIP, is annotated by physicians and covers 718 discharge summaries, representing 107,494 sentences. Annotations were collected as character-level spans to discharge summaries after applying surrogate generation to fill in the anonymized templates from MIMIC-III text with faked data. We release these spans, their aggregation into sentence-level labels, and the sentence tokenizer used to aggregate the spans and label sentences. We also release the surrogate data generator, and the document IDs used for training, validation, and test splits, to enable reproduction. The spans are annotated with 0 or more labels of 7 different types, representing the different actions that may need to be taken: Appointment, Lab, Procedure, Medication, Imaging, Patient Instructions, and Other. We encourage the community to use this dataset to develop methods for automatically extracting clinical action items from discharge summaries.

  6. f

    SQL code.

    • plos.figshare.com
    7z
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). SQL code. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.s001
    Explore at:
    7zAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The code is about how to extract data from the MIMIC-III. (7Z)

  7. p

    Data from: MIMIC-III and eICU-CRD: Feature Representation by FIDDLE...

    • physionet.org
    Updated Apr 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengpu Tang; Parmida Davarmanesh; Yanmeng Song; Danai Koutra; Michael Sjoding; Jenna Wiens (2021). MIMIC-III and eICU-CRD: Feature Representation by FIDDLE Preprocessing [Dataset]. http://doi.org/10.13026/2qtg-k467
    Explore at:
    Dataset updated
    Apr 28, 2021
    Authors
    Shengpu Tang; Parmida Davarmanesh; Yanmeng Song; Danai Koutra; Michael Sjoding; Jenna Wiens
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    This is a preprocessed dataset derived from patient records in MIMIC-III and eICU, two large-scale electronic health record (EHR) databases. It contains features and labels for 5 prediction tasks involving 3 adverse outcomes (prediction times listed in parentheses): in-hospital mortality (48h), acute respiratory failure (4h and 12h), and shock (4h and 12h). We extracted comprehensive, high-dimensional feature representations (up to ~8,000 features) using FIDDLE (FlexIble Data-Driven pipeLinE), an open-source preprocessing pipeline for structured clinical data. These 5 prediction tasks were designed in consultation with a critical care physician for their clinical importance, and were used as part of the proof-of-concept experiments in the original paper to demonstrate FIDDLE's utility in aiding the feature engineering step of machine learning model development. The intent of this release is to share preprocessed MIMIC-III and eICU datasets used in the experiments to support and enable reproducible machine learning research on EHR data.

  8. Z

    Structure Annotations of Assessment and Plan Sections from MIMIC-III

    • data.niaid.nih.gov
    Updated Apr 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stupp, Doron (2022). Structure Annotations of Assessment and Plan Sections from MIMIC-III [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6413404
    Explore at:
    Dataset updated
    Apr 17, 2022
    Dataset provided by
    Matias, Yossi
    Feder, Amir
    Lee, I-Ching
    Barequet, Ronnie
    Oren, Eyal
    Benjamini, Ayelet
    Rajkomar, Alvin
    Ofek, Eran
    Hassidim, Avinatan
    Stupp, Doron
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Physicians record their detailed thought-processes about diagnoses and treatments as unstructured text in a section of a clinical note called the "assessment and plan". This information is more clinically rich than structured billing codes assigned for an encounter but harder to reliably extract given the complexity of clinical language and documentation habits. To structure these sections we collected a dataset of annotations over assessment and plan sections from the publicly available and de-identified MIMIC-III dataset, and developed deep-learning based models to perform this task, described in the associated paper available as a pre-print at: https://www.medrxiv.org/content/10.1101/2022.04.13.22273438v1

    When using this data please cite our paper:

    @article {Stupp2022.04.13.22273438, author = {Stupp, Doron and Barequet, Ronnie and Lee, I-Ching and Oren, Eyal and Feder, Amir and Benjamini, Ayelet and Hassidim, Avinatan and Matias, Yossi and Ofek, Eran and Rajkomar, Alvin}, title = {Structured Understanding of Assessment and Plans in Clinical Documentation}, year = {2022}, doi = {10.1101/2022.04.13.22273438}, publisher = {Cold Spring Harbor Laboratory Press}, URL = {https://www.medrxiv.org/content/early/2022/04/17/2022.04.13.22273438}, journal = {medRxiv} }

    The dataset, presented here, contains annotations of assessment and plan sections of notes from the publicly available and de-identified MIMIC-III dataset, marking the active problems, their assessment description, and plan action items. Action items are additionally marked as one of 8 categories (listed below). The dataset contains over 30,000 annotations of 579 notes from distinct patients, annotated by 6 medical residents and students.

    The dataset is divided into 4 partitions - a training set (481 notes), validation set (50 notes), test set (48 notes) and an inter-rater set. The inter-rater set contains the annotations of each of the raters over the test set. Rater 1 in the inter-rater set should be regarded as an intra-rater comparison (details in the paper). The labels underwent automatic normalization to capture entire word boundaries and remove flanking non-alphanumeric characters.

    Code for transforming labels into TensorFlow examples and training models as described in the paper will be made available at GitHub: https://github.com/google-research/google-research/tree/master/assessment_plan_modeling

    In order to use these annotations, the user additionally needs to obtain the text of the notes which is found in the NOTE_EVENTS table from MIMIC-III, access to which is to be acquired independently (https://mimic.mit.edu/)

    Annotations are given as character spans in a CSV file with the following schema:

        Field
        Type
        Semantics
    
    
        partition
        categorical (one of [train, val, test, interrater]
        The set of ratings the span belongs to.
    
    
        rater_id
        int
        Unique id for each the raters
    
    
        note_id
        int
        The note’s unique note_id, links to the MIMIC-III notes table (as ROW-ID).
    
    
        span_type
        categorical (one of [PROBLEM_TITLE,
        PROBLEM_DESCRIPTION, ACTION_ITEM]
        Type of the span as annotated by raters.
    
    
        char_start
        int
        Character offsets from note start
    
    
        char_end
        int
    
    
        action_item_type
        categorical (one of [MEDICATIONS, IMAGING, OBSERVATIONS_LABS, CONSULTS, NUTRITION, THERAPEUTIC_PROCEDURES, OTHER_DIAGNOSTIC_PROCEDURES, OTHER])
        Type of action item if the span is an action item (empty otherwise) as annotated by raters.
    
  9. Test AU-ROC scores for four models trained with features extracted using...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson (2023). Test AU-ROC scores for four models trained with features extracted using PubMedBERT. [Dataset]. http://doi.org/10.1371/journal.pone.0262182.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Test AU-ROC scores for four models trained with features extracted using PubMedBERT.

  10. Test AU-ROC scores by four models trained with features extracted using...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson (2023). Test AU-ROC scores by four models trained with features extracted using TF-IDF. [Dataset]. http://doi.org/10.1371/journal.pone.0262182.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Test AU-ROC scores by four models trained with features extracted using TF-IDF.

  11. p

    MIMIC-IV-Ext-Apixaban-Trial-Criteria-Questions

    • physionet.org
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones (2025). MIMIC-IV-Ext-Apixaban-Trial-Criteria-Questions [Dataset]. http://doi.org/10.13026/4p6q-vb04
    Explore at:
    Dataset updated
    Apr 30, 2025
    Authors
    Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study, we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We then used these questions to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes. To evaluate the resulting models, we created 23 questions resembling eligibility criteria from the apixaban clinical trial and evaluated them on a random sample of 100 patient notes from MIMIC-IV. Notes from MIMIC-IV were taken from after 2012 to ensure no overlap with any of the notes from MIMIC-III which were used to generate the data used to finetune the models. We release the 2300 total question-answer pairs as a dataset here.

  12. Data from: PDD Graph: Bridging Electronic Medical Records and Biomedical...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meng Wang; Jiaheng Zhang; Jun Liu; Wei Hu; Sen Wang; Xue Li; Wenqiang Liu (2023). PDD Graph: Bridging Electronic Medical Records and Biomedical Knowledge Graphs via Entity Linking [Dataset]. http://doi.org/10.6084/m9.figshare.5242138
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Meng Wang; Jiaheng Zhang; Jun Liu; Wei Hu; Sen Wang; Xue Li; Wenqiang Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Patient-drug-disease (PDD) Graph dataset, utilising Electronic medical records (EMRS) and biomedical Knowledge graphs. The novel framework to construct the PDD graph is described in the associated publication.PDD is an RDF graph consisting of PDD facts, where a PDD fact is represented by an RDF triple to indicate that a patient takes a drug or a patient is diagnosed with a disease. For instance, (pdd:274671, pdd:diagnosed, sepsis)Data files are in .nt N-Triple format, a line-based syntax for an RDF graph. These can be accessed via openly-available text edit software.diagnose_icd_information.nt - contains RDF triples mapping patients to diagnoses. For example:(pdd:18740, pdd:diagnosed, icd99592),where pdd:18740 is a patient entity, and icd99592 is the ICD-9 code of sepsis.drug_patients.nt- contains RDF triples mapping patients to drugs. For example:(pdd:18740, pdd:prescribed, aspirin),where pdd:18740 is a patient entity, and aspirin is the drug's name.Background:Electronic medical records contain multi-format electronic medical data that consist of an abundance of medical knowledge. Faced with patients' symptoms, experienced caregivers make the right medical decisions based on their professional knowledge, which accurately grasps relationships between symptoms, diagnoses and corresponding treatments. In the associated paper, we aim to capture these relationships by constructing a large and high-quality heterogenous graph linking patients, diseases, and drugs (PDD) in EMRs. Specifically, we propose a novel framework to extract important medical entities from MIMIC-III (Medical Information Mart for Intensive Care III) and automatically link them with the existing biomedical knowledge graphs, including ICD-9 ontology and DrugBank. The PDD graph presented in this paper is accessible on the Web via the SPARQL endpoint as well as in .nt format in this repository, and provides a pathway for medical discovery and applications, such as effective treatment recommendations.De-identificationIt is necessary to mention that MIMIC-III contains clinical information of patients. Although the protected health information was de-identifed, researchers who seek to use more clinical data should complete an on-line training course and then apply for the permission to download the complete MIMIC-III dataset: https://mimic.physionet.org/

  13. Comparing deep learning and concept extraction based methods for patient...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Gehrmann; Franck Dernoncourt; Yeran Li; Eric T. Carlson; Joy T. Wu; Jonathan Welt; John Foote Jr.; Edward T. Moseley; David W. Grant; Patrick D. Tyler; Leo A. Celi (2023). Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives [Dataset]. http://doi.org/10.1371/journal.pone.0192360
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sebastian Gehrmann; Franck Dernoncourt; Yeran Li; Eric T. Carlson; Joy T. Wu; Jonathan Welt; John Foote Jr.; Edward T. Moseley; David W. Grant; Patrick D. Tyler; Leo A. Celi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In secondary analysis of electronic health records, a crucial task consists in correctly identifying the patient cohort under investigation. In many cases, the most valuable and relevant information for an accurate classification of medical conditions exist only in clinical narratives. Therefore, it is necessary to use natural language processing (NLP) techniques to extract and evaluate these narratives. The most commonly used approach to this problem relies on extracting a number of clinician-defined medical concepts from text and using machine learning techniques to identify whether a particular patient has a certain condition. However, recent advances in deep learning and NLP enable models to learn a rich representation of (medical) language. Convolutional neural networks (CNN) for text classification can augment the existing techniques by leveraging the representation of language to learn which phrases in a text are relevant for a given medical condition. In this work, we compare concept extraction based methods with CNNs and other commonly used models in NLP in ten phenotyping tasks using 1,610 discharge summaries from the MIMIC-III database. We show that CNNs outperform concept extraction based methods in almost all of the tasks, with an improvement in F1-score of up to 26 and up to 7 percentage points in area under the ROC curve (AUC). We additionally assess the interpretability of both approaches by presenting and evaluating methods that calculate and extract the most salient phrases for a prediction. The results indicate that CNNs are a valid alternative to existing approaches in patient phenotyping and cohort identification, and should be further investigated. Moreover, the deep learning approach presented in this paper can be used to assist clinicians during chart review or support the extraction of billing codes from text by identifying and highlighting relevant phrases for various medical conditions.

  14. f

    Test AU-ROC scores for four models trained with features extracted using...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson (2023). Test AU-ROC scores for four models trained with features extracted using FastText. [Dataset]. http://doi.org/10.1371/journal.pone.0262182.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Maria Mahbub; Sudarshan Srinivasan; Ioana Danciu; Alina Peluso; Edmon Begoli; Suzanne Tamang; Gregory D. Peterson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Test AU-ROC scores for four models trained with features extracted using FastText.

  15. p

    MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions

    • physionet.org
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones (2025). MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions [Dataset]. http://doi.org/10.13026/30k0-av04
    Explore at:
    Dataset updated
    Apr 22, 2025
    Authors
    Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study [1], we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We manually reviewed 1000 of these examples and release them here. These examples can then be used to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes.

  16. Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a...

    • opendatalab.com
    zip
    Updated May 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University College London (2023). Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a sample of MIMIC-III clinical notes) [Dataset]. https://opendatalab.com/OpenDataLab/Rare_Diseases_Mentions_in_etc
    Explore at:
    zip(637266 bytes)Available download formats
    Dataset updated
    May 4, 2023
    Dataset provided by
    Health Data Research Uk
    University College London
    University of Edinburgh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data annotation The 1,073 full rare disease mention annotations (from 312 MIMIC-III discharge summaries) are in full_set_RD_ann_MIMIC_III_disch.csv. The data split: * the first 400 rows are used for validation, validation_set_RD_ann_MIMIC_III_disch.csv, and * the last 673 rows are used for testing, test_set_RD_ann_MIMIC_III_disch.csv. The 198 rare disease mention annotations (from 145 MIMIC-III radiology reports) are in test_set_RD_ann_MIMIC_III_rad.csv. To note that radiology reports were only used for testing and not for validation. To note: a row can only be consider a true phenotype of the patient only when the value of the column gold mention-to-ORDO label is 1. Data sampling and annotation procedure (i) Randomly sampled 500 discharge summaries (and 1000 radiology reports) from MIMIC-III (ii) 312 of the 500 discharge summaries (and 145 of the 1000 radiology reports) have at least one positive UMLS mention linked to ORDO, as identified by SemEHR; there are altogether 1073 (and 198 in radiology reports) UMLS/ORDO mentions. (iii) 3 medical informatics researchers (staff or PhD students) annotated the 1,073 mentions (and 2 medical informatics researchers annotated the 198 mentions in radiology reports), regarding whether they are the correct patient phenotypes matched to UMLS and ORDO. Contradictions in the annotations were then resolved by another research staff having biomedical background. Data dictionary Column Name Description ROW_ID Identifier unique to each row, see https://mimic.physionet.org/mimictables/noteevents/ SUBJECT_ID Identifier unique to a patient, see https://mimic.physionet.org/mimictables/noteevents/ HADM_ID Identifier unique to a patient hospital stay, see https://mimic.physionet.org/mimictables/noteevents/ document structure name The document structure name of the mention. The document structure name is identified by SemEHR (only for discharge summaries). document structure offset in full document The start and ending offsets of the document structure texts (or template) in the whole discharge summary. The document structure is parsed by SemEHR with regular expressions (only for discharge summaries). mention The mention identified by SemEHR. mention offset in document structure The start and ending offsets of the mention in the document structure (only for discharge summaries). mention offset in full document The start and ending offsets of the mention in the whole discharge summary. They can be calculated by document structure offset in full document and mention offset in document structure. UMLS with desc The UMLS identified by SemEHR, corresponding to the mention. ORDO with desc The ORDO matched to the UMLS, using the linkage in the ORDO ontology, see https://www.ebi.ac.uk/ols/ontologies/ordo/terms?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_3325 as an example. gold mention-to-UMLS label Whether the mention-UMLS pair indicate a correct phenotype of the patient (i.e. a positive mention that correctly matches to the UMLS concept), 1 if correct, 0 if not. gold UMLS-to-ORDO label Whether the matching is correct from the UMLS concept to the ORDO concept, 1 if correct, 0 if not. gold mention-to-ORDO label Whether the mention-ORDO triple indicates a correct phenotype of the patient, 1 if correct, 0 if not. This column is 1 if both the mention-to-UMLS label and the UMLS-to-ORDO label are 1, otherwise 0. Note: * These manual annotations are by no means to be perfect. There are hypothetical mentions which were difficult for the annotators to make a decision. Also, they are based on the output of SemEHR, which does not have 100% recall, so the annotations may not cover all rare diseases mentions from the sampled discharge summaries. * In row 323 from the full set or the validation set, the mention nph is not in the document structure (due to error in mention extraction), thus the gold mention-to-UMLS label is -1.

  17. EHR data from patients with AML and AML-Simiar diseases

    • figshare.com
    bin
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chang Sun (2024). EHR data from patients with AML and AML-Simiar diseases [Dataset]. http://doi.org/10.6084/m9.figshare.27959760.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Chang Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This describes the datasets used for training and evaluation Onto-CGAN model sourced from the MIMIC-IV Clinical Database V2.2. The query script used for data extraction is listed below. Due to the restricted access policies of the MIMIC database, we are unable to publish the extracted subset of MIMIC data. However, researchers with authorized access to the MIMIC-IV database may request the experimental patient data from the corresponding author.SQL Query in MIMIC SELECT DISTINCT diag_table.subject_id, diag_table.icd_code, diag_table.hadm_id, patients_table.gender, patients_table.anchor_age, omr_table.result_name, omr_table.result_value,patients_table.dodFROM physionet-data.mimiciv_hosp.diagnoses_icd AS diag_tableJOIN physionet-data.mimiciv_hosp.patients AS patients_table ON diag_table.subject_id = patients_table.subject_idJOIN physionet-data.mimiciv_hosp.omr AS omr_table ON diag_table.subject_id = omr_table.subject_idWHERE diag_table.icd_code LIKE '20%'

  18. p

    Data from: MIMICEL: MIMIC-IV Event Log for Emergency Department

    • physionet.org
    Updated Jun 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Wei; Zhipeng He; Chun Ouyang; Catarina Moreira (2023). MIMICEL: MIMIC-IV Event Log for Emergency Department [Dataset]. http://doi.org/10.13026/c9yj-1t90
    Explore at:
    Dataset updated
    Jun 16, 2023
    Authors
    Jia Wei; Zhipeng He; Chun Ouyang; Catarina Moreira
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    In this work, we extract an event log from the MIMIC-IV-ED dataset by adopting a well-established event log generation methodology, and we name this event log MIMICEL. The data tables in the MIMIC-IV-ED dataset relate to each other based on the existing relational database schema, and each table records the individual activities of patients along their journey in the emergency department (ED). While the data tables in the MIMIC-IV-ED dataset catch snapshots of a patient journey in the ED, the extracted event log MIMICEL aims to capture an end-to-end process of the patient journey. This will enable us to analyse the existing patient flows, thereby improving the efficiency of an ED process.

  19. The most salient phrases for advanced heart failure and alcohol abuse.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Gehrmann; Franck Dernoncourt; Yeran Li; Eric T. Carlson; Joy T. Wu; Jonathan Welt; John Foote Jr.; Edward T. Moseley; David W. Grant; Patrick D. Tyler; Leo A. Celi (2023). The most salient phrases for advanced heart failure and alcohol abuse. [Dataset]. http://doi.org/10.1371/journal.pone.0192360.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sebastian Gehrmann; Franck Dernoncourt; Yeran Li; Eric T. Carlson; Joy T. Wu; Jonathan Welt; John Foote Jr.; Edward T. Moseley; David W. Grant; Patrick D. Tyler; Leo A. Celi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The salient cTAKES CUIs are extracted from the filtered RF model.

  20. p

    Data from: LLaVA-Rad MIMIC-CXR Annotations

    • physionet.org
    • paperswithcode.com
    Updated Jan 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Manuel Zambrano Chaves; Shih-Cheng Huang; Yanbo Xu; Hanwen Xu; Naoto Usuyama; Sheng Zhang; Fei Wang; Yujia Xie; Mahmoud Khademi; Ziyi Yang; Hany Awadalla; Julia Gong; Houdong Hu; Jianwei Yang; Chunyuan Li; Jianfeng Gao; Yu Gu; Cliff Wong; Mu-Hsin Wei; Tristan Naumann; Muhao Chen; Matthew Lungren; Akshay Chaudhari; Serena Yeung; Curtis Langlotz; Sheng Wang; Hoifung Poon (2025). LLaVA-Rad MIMIC-CXR Annotations [Dataset]. http://doi.org/10.13026/4ma4-k740
    Explore at:
    Dataset updated
    Jan 24, 2025
    Authors
    Juan Manuel Zambrano Chaves; Shih-Cheng Huang; Yanbo Xu; Hanwen Xu; Naoto Usuyama; Sheng Zhang; Fei Wang; Yujia Xie; Mahmoud Khademi; Ziyi Yang; Hany Awadalla; Julia Gong; Houdong Hu; Jianwei Yang; Chunyuan Li; Jianfeng Gao; Yu Gu; Cliff Wong; Mu-Hsin Wei; Tristan Naumann; Muhao Chen; Matthew Lungren; Akshay Chaudhari; Serena Yeung; Curtis Langlotz; Sheng Wang; Hoifung Poon
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    LLaVA-Rad MIMIC-CXR features more accurate section extractions from MIMIC-CXR free-text radiology reports. Traditionally, rule-based methods were used to extract sections such as the reason for exam, findings, and impression. However, these approaches often fail due to inconsistencies in report structure and clinical language. In this work, we leverage GPT-4 to extract these sections more reliably, adding 237,073 image-text pairs to the training split and 1,952 pairs to the validation split. This enhancement afforded the development and fine-tuning of LLaVA-Rad, a multimodal large language model (LLM) tailored for radiology applications, achieving improved performance on report generation tasks.

    This resource is provided to support reproducibility and for the benefit of the research community, enabling further exploration in vision–language modeling. For more details, please refer to the accompanying paper [1].

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Akshay Goel; Almog Gueta; Omry Gilon; Sofia Erell; Amir Feder (2023). Medication Extraction Labels for MIMIC-IV-Note Clinical Database [Dataset]. http://doi.org/10.13026/ps1s-ab29

Data from: Medication Extraction Labels for MIMIC-IV-Note Clinical Database

Related Article
Explore at:
Dataset updated
Dec 12, 2023
Authors
Akshay Goel; Almog Gueta; Omry Gilon; Sofia Erell; Amir Feder
License

https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Description

This dataset release provides medication extraction labels for a subset of 600 discharge summaries from the 2023 MIMIC-IV-Note dataset. These labels are consistent with the schema from the 2009 i2b2 Workshop on NLP Challenges dataset. We utilized a Large Language Model (LLM) pipeline to generate these labels, achieving performance on par with the average human annotation specialist.

Search
Clear search
Close search
Google apps
Main menu