30 datasets found
  1. MIMIC-III - Deep Reinforcement Learning

    • kaggle.com
    zip
    Updated Apr 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asjad K (2022). MIMIC-III - Deep Reinforcement Learning [Dataset]. https://www.kaggle.com/datasets/asjad99/mimiciii
    Explore at:
    zip(11100065 bytes)Available download formats
    Dataset updated
    Apr 7, 2022
    Authors
    Asjad K
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Digitization of healthcare data along with algorithmic breakthroughts in AI will have a major impact on healthcare delivery in coming years. Its intresting to see application of AI to assist clinicians during patient treatment in a privacy preserving way. While scientific knowledge can help guide interventions, there remains a key need to quickly cut through the space of decision policies to find effective strategies to support patients during the care process.

    Offline Reinforcement learning (also referred to as safe or batch reinforcement learning) is a promising sub-field of RL which provides us with a mechanism for solving real world sequential decision making problems where access to simulator is not available. Here we assume that learn a policy from fixed dataset of trajectories with further interaction with the environment(agent doesn't receive reward or punishment signal from the environment). It has shown that such an approach can leverage vast amount of existing logged data (in the form of previous interactions with the environment) and can outperform supervised learning approaches or heuristic based policies for solving real world - decision making problems. Offline RL algorithms when trained on sufficiently large and diverse offline datasets can produce close to optimal policies(ability to generalize beyond training data).

    As Part of my PhD, research, I investigated the problem of developing a Clinical Decision Support System for Sepsis Management using Offline Deep Reinforcement Learning.

    MIMIC-III ('Medical Information Mart for Intensive Care') is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.

    we try to answer the following question:

    Given a particular patient’s characteristics and physiological information at each time step as input, can our DeepRL approach, learn an optimal treatment policy that can prescribe the right intervention(e.g use of ventilator) to the patient each stage of the treatment process, in order to improve the final outcome(e.g patient mortality)?

    we can use popular state-of-the-art algorithms such as Deep Q Learning(DQN), Double Deep Q Learning (DDQN), DDQN combined with BNC, Mixed Monte Carlo(MMC) and Persistent Advantage Learning (PAL). Using these methods we can train an RL policy to recommend optimum treatment path for a given patient.

    Data acquisition, standard pre-processing and modelling details can be found here in Github repo: https://github.com/asjad99/MIMIC_RL_COACH

  2. mimic-2-preprocessed

    • figshare.com
    hdf
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Heumos (2023). mimic-2-preprocessed [Dataset]. http://doi.org/10.6084/m9.figshare.22331755.v1
    Explore at:
    hdfAvailable download formats
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Lukas Heumos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Preprocessed version of the MIMIC-II dataset. See https://github.com/theislab/ehrapy-datasets/tree/main/mimic_2

  3. Curated CXR report generation dataset

    • kaggle.com
    zip
    Updated Feb 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FinanceKim (2023). Curated CXR report generation dataset [Dataset]. https://www.kaggle.com/datasets/financekim/curated-cxr-report-generation-dataset/data
    Explore at:
    zip(8569086344 bytes)Available download formats
    Dataset updated
    Feb 13, 2023
    Authors
    FinanceKim
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
  4. Z

    Structure Annotations of Assessment and Plan Sections from MIMIC-III

    • data.niaid.nih.gov
    Updated Apr 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stupp, Doron; Barequet, Ronnie; Lee, I-Ching; Oren, Eyal; Feder, Amir; Benjamini, Ayelet; Hassidim, Avinatan; Matias, Yossi; Ofek, Eran; Rajkomar, Alvin (2022). Structure Annotations of Assessment and Plan Sections from MIMIC-III [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6413404
    Explore at:
    Dataset updated
    Apr 17, 2022
    Dataset provided by
    Google
    Authors
    Stupp, Doron; Barequet, Ronnie; Lee, I-Ching; Oren, Eyal; Feder, Amir; Benjamini, Ayelet; Hassidim, Avinatan; Matias, Yossi; Ofek, Eran; Rajkomar, Alvin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Physicians record their detailed thought-processes about diagnoses and treatments as unstructured text in a section of a clinical note called the "assessment and plan". This information is more clinically rich than structured billing codes assigned for an encounter but harder to reliably extract given the complexity of clinical language and documentation habits. To structure these sections we collected a dataset of annotations over assessment and plan sections from the publicly available and de-identified MIMIC-III dataset, and developed deep-learning based models to perform this task, described in the associated paper available as a pre-print at: https://www.medrxiv.org/content/10.1101/2022.04.13.22273438v1

    When using this data please cite our paper:

    @article {Stupp2022.04.13.22273438, author = {Stupp, Doron and Barequet, Ronnie and Lee, I-Ching and Oren, Eyal and Feder, Amir and Benjamini, Ayelet and Hassidim, Avinatan and Matias, Yossi and Ofek, Eran and Rajkomar, Alvin}, title = {Structured Understanding of Assessment and Plans in Clinical Documentation}, year = {2022}, doi = {10.1101/2022.04.13.22273438}, publisher = {Cold Spring Harbor Laboratory Press}, URL = {https://www.medrxiv.org/content/early/2022/04/17/2022.04.13.22273438}, journal = {medRxiv} }

    The dataset, presented here, contains annotations of assessment and plan sections of notes from the publicly available and de-identified MIMIC-III dataset, marking the active problems, their assessment description, and plan action items. Action items are additionally marked as one of 8 categories (listed below). The dataset contains over 30,000 annotations of 579 notes from distinct patients, annotated by 6 medical residents and students.

    The dataset is divided into 4 partitions - a training set (481 notes), validation set (50 notes), test set (48 notes) and an inter-rater set. The inter-rater set contains the annotations of each of the raters over the test set. Rater 1 in the inter-rater set should be regarded as an intra-rater comparison (details in the paper). The labels underwent automatic normalization to capture entire word boundaries and remove flanking non-alphanumeric characters.

    Code for transforming labels into TensorFlow examples and training models as described in the paper will be made available at GitHub: https://github.com/google-research/google-research/tree/master/assessment_plan_modeling

    In order to use these annotations, the user additionally needs to obtain the text of the notes which is found in the NOTE_EVENTS table from MIMIC-III, access to which is to be acquired independently (https://mimic.mit.edu/)

    Annotations are given as character spans in a CSV file with the following schema:

        Field
        Type
        Semantics
    
    
        partition
        categorical (one of [train, val, test, interrater]
        The set of ratings the span belongs to.
    
    
        rater_id
        int
        Unique id for each the raters
    
    
        note_id
        int
        The note’s unique note_id, links to the MIMIC-III notes table (as ROW-ID).
    
    
        span_type
        categorical (one of [PROBLEM_TITLE,
        PROBLEM_DESCRIPTION, ACTION_ITEM]
        Type of the span as annotated by raters.
    
    
        char_start
        int
        Character offsets from note start
    
    
        char_end
        int
    
    
        action_item_type
        categorical (one of [MEDICATIONS, IMAGING, OBSERVATIONS_LABS, CONSULTS, NUTRITION, THERAPEUTIC_PROCEDURES, OTHER_DIAGNOSTIC_PROCEDURES, OTHER])
        Type of action item if the span is an action item (empty otherwise) as annotated by raters.
    
  5. h

    dexmimicgen_datasets

    • huggingface.co
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MimicGen (2024). dexmimicgen_datasets [Dataset]. https://huggingface.co/datasets/MimicGen/dexmimicgen_datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2024
    Dataset authored and provided by
    MimicGen
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    DexMimicGen Datasets

    This repository contains the official dataset release of simulation environments and datasets for the ICRA 2025 paper "DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning". Website: https://dexmimicgen.github.io For business inquiries, please submit this form: NVIDIA Research Licensing

  6. MIMIC-III Clinical Database(Open Access)

    • kaggle.com
    zip
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ihssane Ned (2025). MIMIC-III Clinical Database(Open Access) [Dataset]. https://www.kaggle.com/datasets/ihssanened/mimic-iii-clinical-databaseopen-access/versions/2
    Explore at:
    zip(838939 bytes)Available download formats
    Dataset updated
    Jun 2, 2025
    Authors
    Ihssane Ned
    Description

    Dataset Source

    This dataset is a portion of MIMIC-III Clinical Database, a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset. The full dataset is available on PhysioNet this** link**

    Dataset Description:

    This dataset contains solely 4 tables (extracted from the original dataset), more informations about each table can be found in its corresponding link - admissions.csv
    - d_labitems.csv - labevents.csv - patient.csv a nice visualization of this dataset can be found here

    Future Perspectives:

    This portion of the dataset will be combined to build a comprehensive dataset of simulated medical reports.

  7. p

    MIMIC-IV on FHIR

    • physionet.org
    Updated Feb 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Bennett; Joshua Wiedekopf; Hannes Ulrich; Philip van Damme; Alistair Johnson (2024). MIMIC-IV on FHIR [Dataset]. http://doi.org/10.13026/cqt2-0b27
    Explore at:
    Dataset updated
    Feb 20, 2024
    Authors
    Alex Bennett; Joshua Wiedekopf; Hannes Ulrich; Philip van Damme; Alistair Johnson
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    Fast Healthcare Interoperability Resources (FHIR) has emerged as a robust standard for healthcare data exchange. To explore the use of FHIR for the process of data harmonization, we converted the Medical Information Mart for Intensive Care IV (MIMIC-IV) and MIMIC-IV Emergency Department (MIMIC-IV-ED) databases into FHIR. We extended base FHIR to encode information in MIMIC-IV and aimed to retain the data in FHIR with minimal additional processing, aligning to US Core v4.0.0 where possible. A total of 24 profiles were created for MIMIC-IV data, and an additional 6 profiles were created for MIMIC-IV-ED data. All MIMIC terminology was converted into code systems and value sets, as necessary. We hope MIMIC-IV in FHIR provides a useful restructuring of the data to support applications around data harmonization, interoperability, and other areas of research.

  8. 2022 Dataset of Butterfly Mimics

    • kaggle.com
    zip
    Updated Jul 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KeithPinson (2022). 2022 Dataset of Butterfly Mimics [Dataset]. https://www.kaggle.com/datasets/keithpinson/butterfly-mimics-2022
    Explore at:
    zip(43412200 bytes)Available download formats
    Dataset updated
    Jul 30, 2022
    Authors
    KeithPinson
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    https://github.com/KeithPinson/butterfly_mimics_2022_dataset/raw/main/DocResources/monarch_on_daisy-600.png" alt="Images of Monarch and Viceroy butterflies from tiny dataset">

    YOYMimics-2022-dataset.pdf

    This 2022 version of the dataset consists of 1028 total images. Each image is a 224x224 pixel jpg showing a single butterfly in the wild. The images are of 6 species of common North American butterflies. Some of the butterflies are toxic and some mimic the looks of the toxic. The images are for education and research purposes only. CSV files contain the labels and no label information is contained in the folder or image names. Additionally, a “tiny” dataset is included.

    The abbreviated tiny dataset for image classification is of just 2 species with an accompanying tiny dataset document.

    https://github.com/KeithPinson/butterfly_mimics_2022_dataset/raw/main/DocResources/the-monarchs-and-viceroys.png" alt="Images of Monarch and Viceroy butterflies from tiny dataset">

    The full data version the dataset for image classification is of 6 butterfly species. It too has an accompanying dataset document.

    https://github.com/KeithPinson/butterfly_mimics_2022_dataset/raw/main/DocResources/the-butterflies.png" alt="Images of Black, Monarch, Pipevine, Spicebush, Tiger and Viceroy butterflies from the dataset">

  9. n

    Supplementary Material for: 'An impedance pneumography signal quality index:...

    • data.niaid.nih.gov
    Updated Aug 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlton, Peter H (2021). Supplementary Material for: 'An impedance pneumography signal quality index: design, assessment and application to respiratory rate monitoring' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3973770
    Explore at:
    Dataset updated
    Aug 17, 2021
    Dataset provided by
    King's College London
    Authors
    Charlton, Peter H
    Description

    This supplementary material accompanies:

    Charlton P.H. et al., "An impedance pneumography signal quality index for respiratory rate monitoring: design, assessment and application", [under review], 2020

    The Impedance Pneumography Signal Quality Index (SQI) dataset and accompanying scripts (in Matlab format) are provided to facilitate reproduction of the analyses using data from the MIMIC III dataset in this publication.

    Summary of Publication

    In this article we developed and assessed the performance of a signal quality index (SQI) for the impedance pneumography signal. The SQI was developed using data from the Listen dataset, and assessed using data from the Listen dataset and MIMIC III datasets. The SQI was found to accurately classify segments of impedance pneumography signal as either high or low quality. Furthermore, when it was coupled with a high performance RR algorithm, highly accurate and precise RRs were estimated from those segments deemed to be high quality. In this study performance was assessed in the critical care environment - further work is required to deteremine whether the SQI is suitable for use with wearable sensors. Both the dataset and code used to perform this study are publicly available.

    Reproducing this Publication

    The work relating to the MIMIC dataset in this publication can be reproduced as follows:

    • Reproducing the analysis These steps can be used to quickly reproduce the analysis using the curated and annotated dataset.

    • Download the curated and annotated dataset from Zenodo using this direct download link.

    • Run the analysis using the run_imp_sqi_mimic.m script.

      • Full reproduction These steps include downloading the raw data files, extracting data from these files, collating the dataset, manually annotating the data, and performing the analysis.
    • Use the ImP_SQI_mimic_data_importer.m script to download raw MIMIC data files from PhysioNet, and collate them into a single Matlab file.

    • Prepare the dataset for manual annotation by running the run_imp_sqi_mimic.m script.

    • Manually annotate the signals by running the run_mimic_imp_annotation.m script - the annotations are stored in separate files (the original annotation files are available here).

    • Import the manual annotations into the collated data file by re-running the ImP_SQI_mimic_data_importer.m script.

    • Run run_imp_sqi_mimic.m to perform the analysis described in the publication.

    The scripts are also stored (alongside details of how to use them) are available in the RRest GitHub repository at: https://github.com/peterhcharlton/RRest/tree/master/RRest_v3.0/Publication_Specific_Scripts/ImP_SQI

    License: The dataset (mimic_imp_sqi_data.mat) is distributed under the terms specified in the accompanying LICENSE file. The scripts are distributed under the GNU General Public Licence (as specified towards the start of each file).

    Version 0.1.1: This is the version at the time of initial submission.

  10. p

    MIMIC-IV-Ext Triage Instruction Corpus

    • physionet.org
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qingyang Shen; Quan Guo (2025). MIMIC-IV-Ext Triage Instruction Corpus [Dataset]. http://doi.org/10.13026/q1nc-2e47
    Explore at:
    Dataset updated
    Mar 4, 2025
    Authors
    Qingyang Shen; Quan Guo
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    Emergency department (ED) overcrowding leads to delayed care, increased patient risk, and inefficient resource use. The MIMIC-IV-Ext Triage Instruction Corpus (MIETIC) addresses this by providing 9,629 structured triage cases from MIMIC-IV, aligned with the Emergency Severity Index (ESI). MIETIC supports large language model (LLM) training for AI-assisted triage, improving accuracy, consistency, and risk assessment. The dataset includes chief complaints, vital signs, demographics, and medical history, ensuring realistic triage decision-making. Developed through automated quality control and expert validation, MIETIC enhances model performance in high-risk and moderate-risk classification. Available in CSV formats, MIETIC enables research in clinical NLP, AI-driven triage, and decision-support tools. The dataset module includes:

    Structured triage cases with ESI labels. Triage case generation prompts for instruction tuning. Expert-validated samples for quality control. SQL scripts for data extraction and validation, hosted on GitHub.

    MIETIC provides a standardized, reproducible dataset to advance AI-driven emergency triage, optimizing accuracy, efficiency, and resource allocation.

  11. FastText embeddings for SNOMED CT concepts using MIMIC-IV notes and SNOMED...

    • zenodo.org
    bin
    Updated Feb 2, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Castell Díaz; Javier Castell Díaz; John D. Kelleher; John D. Kelleher; Jesualdo Tomás Fernández Breis; Jesualdo Tomás Fernández Breis; Catalina Martínez Costa; Catalina Martínez Costa (2026). FastText embeddings for SNOMED CT concepts using MIMIC-IV notes and SNOMED CT walks [Dataset]. http://doi.org/10.5281/zenodo.18220275
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 2, 2026
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Javier Castell Díaz; Javier Castell Díaz; John D. Kelleher; John D. Kelleher; Jesualdo Tomás Fernández Breis; Jesualdo Tomás Fernández Breis; Catalina Martínez Costa; Catalina Martínez Costa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Embeddings for SNOMED CT concepts produced by models of FastText trained on different corpora. Each NPZ file encodes a dictionary, which links the ID of a SNOMED CT concept to its corresponding embedding.

    Files ft_mimicN_dict.npz contain the embeddings of models trained on subsets of MIMIC-IV, where N denotes the percentage of MIMIC-IV used in the training of the model; whereas ft_snomed_ct_walks_dict.npz contains the embeddings of a FastText model trained on an artifical corpus obtained by performing walks on SNOMED CT (https://doi.org/10.1016/j.jbi.2023.104297" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.jbi.2023.104297).

    These embeddings were generated and studied in the paper Assessing the Effectiveness of Embedding Methods in Capturing Clinical Information from SNOMED CT () and more information can also be found in the following repository: https://github.com/JavierCastellD/AssessingSNOMEDEmbeddings.

  12. Hematology Complete Blood Count Dataset MIMIC-III

    • kaggle.com
    zip
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashlin Darius Govindasamy (2023). Hematology Complete Blood Count Dataset MIMIC-III [Dataset]. https://www.kaggle.com/datasets/ashlingovindasamy/hematology-complete-blood-count-dataset-mimic-iii/code
    Explore at:
    zip(39898 bytes)Available download formats
    Dataset updated
    Jun 7, 2023
    Authors
    Ashlin Darius Govindasamy
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Introduction

    This project takes the datasets obtained from MIMIC-III.

    MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012

    This repository contains the code to derive the Hematology Complete Blood Count (CBC) Dataset from the MIMIC-III dataset.

    Our objective is to derive a dataset that can be used to predict the disease of a patient based on the CBC values.

    I have achieved an accuracy of 0.01 using various machine learning algorithms. Which is very low. I am still working on it to improve the accuracy i have released the code so that others can contribute to it or give me some suggestions and feedback.

    This project can help your guys also on how to derive datasets from MIMIC-III using Pandas.

    I have also built a Streamlit app to take Xk parameters as input and predict the y disease of the patient.

    X in ['Hemoglobin', 'Eosinophils', 'Lymphocytes', 'Monocytes', 'Basophils', 'Neutrophils', 'Red Blood Cells', 'White Blood Cells']

    y ['Anemia', 'Leukemia', 'Thrombocytopenia', 'Thrombocytosis', 'Normal',....'Other']

    Objectives of this project: - Derive the Hematology Complete Blood Count (CBC) Dataset from the MIMIC-III dataset. (Done)

    • Build a Machine Learning model to predict the disease of a patient based on the CBC values at least 0.8 accuracy. (Not done yet but i have achieved 0.01 accuracy)

    • Build a Streamlit app to take Xk parameters as input and predict the y disease of the patient. (Done)

    https://github.com/adgsenpai/HematologyCBCDatasetDerivation

  13. EHRSHOT

    • redivis.com
    application/jsonl +7
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shah Lab (2025). EHRSHOT [Dataset]. http://doi.org/10.57761/0gv9-nd83
    Explore at:
    csv, application/jsonl, sas, parquet, stata, spss, arrow, avroAvailable download formats
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Redivis Inc.
    Authors
    Shah Lab
    Description

    Abstract

    đź‘‚đź’‰ EHRSHOT is a dataset for benchmarking the few-shot performance of foundation models for clinical prediction tasks. EHRSHOT contains de-identified structured data (e.g., diagnosis and procedure codes, medications, lab values) from the electronic health records (EHRs) of 6,739 Stanford Medicine patients and includes 15 prediction tasks. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and includes data beyond ICU and emergency department patients.

    ⚡️Quickstart 1. To recreate the original EHRSHOT paper, download the EHRSHOT_ASSETS.zip file from the "Files" tab 2. To work with OMOP CDM formatted data, download all the tables in the "Tables" tab

    ⚙️ Please see the "Methodology" section below for details on the dataset and downloadable files.

    Methodology

    1. đź“– Overview

    EHRSHOT is a benchmark for evaluating models on few-shot learning for patient classification tasks. The dataset contains:

    • **6,739 **patients
    • 41.6 million clinical events
    • 921,499 visits
    • 15 prediction tasks

    %3C!-- --%3E

    2. đź’˝ Dataset

    EHRSHOT is sourced from Stanford’s STARR-OMOP database.

    • Data follows the OMOP CDM and is fully de-identified.
    • Unlike most other EHR research datasets, EHRSHOT is not restricted to ED/ICU visits and instead includes longitudinal patient data for all hospital encounter types.
    • EHRSHOT does not contain clinical notes or images.

    %3C!-- --%3E

    We provide two versions of the dataset:

    • EHRSHOT-Original is the same exact dataset used in the original EHRSHOT paper.
    • EHRSHOT-OMOP is a more complete version of the EHRSHOT dataset which includes all OMOP CDM tables and additional OMOP metadata.

    %3C!-- --%3E

    To access the raw data, please see the "Tables" and "Files"** **tabs above:

    3. đź’˝ Data Files and Formats

    We provide EHRSHOT in two file formats:

    • OMOP CDM v5.4
    • Medical Event Data Standard (MEDS)

    %3C!-- --%3E

    Within the "Tables" tab...

    1. %3Cu%3EEHRSHOT-OMOP%3C/u%3E

    * Dataset Version: EHRSHOT-OMOP

    * Notes: Contains all OMOP CDM tables for the EHRSHOT patients. Note that this dataset is slightly different than the original EHRSHOT dataset, as these tables contain the full OMOP schema rather than a filtered subset.

    Within the "Files" tab...

    1. %3Cu%3EEHRSHOT_ASSETS.zip%3C/u%3E

    * Dataset Version: EHRSHOT-Original

    * Data Format: FEMR 0.1.16

    * Notes: The original EHRSHOT dataset as detailed in the paper. Also includes model weights.

    2. %3Cu%3EEHRSHOT_MEDS.zip%3C/u%3E

    * Dataset Version: EHRSHOT-Original

    * Data Format: MEDS 0.3.3

    * Notes: The original EHRSHOT dataset as detailed in the paper. It does not include any models.

    3. %3Cu%3EEHRSHOT_OMOP_MEDS.zip%3C/u%3E

    * Dataset Version: EHRSHOT-OMOP

    * Data Format: MEDS 0.3.3 + MEDS-ETL 0.3.8

    * Notes: Converts the dataset from EHRSHOT-OMOP into MEDS format via the `meds_etl_omop`command from MEDS-ETL.

    4. %3Cu%3EEHRSHOT_OMOP_MEDS_Reader.zip%3C/u%3E

    * Dataset Version: EHRSHOT-OMOP

    * Data Format: MEDS Reader 0.1.9 + MEDS 0.3.3 + MEDS-ETL 0.3.8

    * Notes: Same data as EHRSHOT_OMOP_MEDS.zip, but converted into a MEDS-Reader database for faster reads.

    4. 🤖 Model

    We also release the full weights of **CLMBR-T-base, **a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Please download from https://huggingface.co/StanfordShahLab/clmbr-t-base

    **5. 🧑‍💻 Code **

    Please see our Github repo to obtain code for loading the dataset and running a set of pretrained baseline models: https://github.com/som-shahlab/ehrshot-benchmark/

    Usage

    **NOTE: You must authenticate to Redivis using your formal affiliation's email address. If you use gmail or other personal email addresses, you will not be granted access. **

    Access to the EHRSHOT dataset requires the following:

    • Verified Affiliation with an **Academic, Government, **o
  14. PDD Graph

    • kaggle.com
    zip
    Updated Jun 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xjtushilei (2017). PDD Graph [Dataset]. https://www.kaggle.com/xjtushilei/pdd-graph
    Explore at:
    zip(1503 bytes)Available download formats
    Dataset updated
    Jun 15, 2017
    Authors
    xjtushilei
    Description

    Online

    Website

    Github

    DataHub

    SPARQL endpoint

    You can query some of the data online there. There is also the download link. Of course you can download it here.

    Context

    Electronic medical records contain multi-format electronic medical data that consist of an abundance of medical knowledge. Facing with patients symptoms, experienced caregivers make right medical decisions based on their professional knowledge that accurately grasps relationships between symptoms, diagnosis, and treatments. We aim to capture these relationships by constructing a large and high-quality heterogeneous graph linking patients, diseases, and drugs (PDD) in EMRs.

    Content

    Specifically, we extract important medical entities from MIMIC-III (Medical Information Mart for Intensive Care III) and automatically link them with the existing biomedical knowledge graphs, including ICD-9 ontology and DrugBank. The PDD graph presented is accessible on the Web via the SPARQL endpoint, and provides a pathway for medical discovery and applications, such as effective treatment recommendations.

    A subgraph of PDD is illustrated in the followng figure to betterunderstand the PDD graph.

    https://github.com/wangmengsd/pdd-graph/raw/master/example.png" alt="enter image description here">

    Acknowledgements

    Author

    Data set belongs to Meng Wang, Jiaheng Zhang, Jun Liu,Wei Hu, Sen Wang, , Wenqiang Liu and Lei Shi

    They come from: 1. MOEKLINNS lab, Xi’an Jiaotong University, Xi’an, China 2. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 3. Griffith Universtiy, Gold Coast Campus, Australia

    Some Email: - Meng Wang:wangmengsd@stu.xjtu.edu.cn - Lei Shi:xjtushilei@foxmail.com - Jun Liu:liukeen@xjtu.edu.cn

    Research

    The paper is being reviewed and is not easily disclosed.So it can't be linked here.

    Inspiration

    If you have any questions, please contact the email address above.

    Do you have any suggestions ? And send them to an e-mail address above.

    License

    This work is licensed under a Creative Commons Attribution 4.0 International License.

    ### If your article needs to be reference our work , you can reference our github.

  15. T

    robomimic_ph

    • tensorflow.org
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). robomimic_ph [Dataset]. https://www.tensorflow.org/datasets/catalog/robomimic_ph
    Explore at:
    Dataset updated
    Dec 11, 2024
    Description

    The Robomimic proficient human datasets were collected by 1 proficient operator using the RoboTurk platform (with the exception of Transport, which had 2 proficient operators working together). Each dataset consists of 200 successful trajectories.

    Each task has two versions: one with low dimensional observations (low_dim), and one with images (image).

    The datasets follow the RLDS format to represent steps and episodes.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('robomimic_ph', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  16. 2012-2014 Dataset [5/7] for the models trained and tested in the paper 'Can...

    • zenodo.org
    zip
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). 2012-2014 Dataset [5/7] for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12945050
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains part 5/7 of the full dataset used for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy".

    This dataset comprises 3 years of normalized hourly data for both low-resolution predictors [16 km] and high-resolution target variables [2km] (2mT and 10-m U and V), from 2012-2014. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

    To use the data, clone the corresponding repository, unzip this zip file in the data folder, and download from Zenodo the other parts of the dataset listed in the related works.

  17. Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II)

    • healthdata.gov
    • datasets.ai
    • +2more
    csv, xlsx, xml
    Updated Feb 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) [Dataset]. https://healthdata.gov/dataset/Multiparameter-Intelligent-Monitoring-in-Intensive/e2km-2dm6
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Feb 13, 2021
    Description

    The objective of this Bioengineering Research Partnership is to focus the resources of a powerful interdisciplinary team from academia (MIT), industry (Philips Medical Systems) and clinical medicine (Beth Israel Deaconess Medical Center) to develop and evaluate advanced ICU patient monitoring systems that will substantially improve the efficiency, accuracy and timeliness of clinical decision making in intensive care.

  18. Z

    Raw and post-processing data for using auditory models to mimic human...

    • data-staging.niaid.nih.gov
    Updated May 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osses, Alejandro; Varnet, Léo (2023). Raw and post-processing data for using auditory models to mimic human listeners in reverse correlation experiments from the fastACI toolbox [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7886231
    Explore at:
    Dataset updated
    May 3, 2023
    Authors
    Osses, Alejandro; Varnet, Léo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description: The current dataset provides all the stimuli (folder ../01-Stimuli/), raw data (folder ../02-Raw-data/) and post-processed data (../03-Post-proc-data/) used in the Forum Acusticum 2013 paper titled "Using auditory models to mimic human listeners in reverse correlation experiments from the fastACI toolbox" by the same authors. In this paper, we replicated the tone-in-noise experiment by Ahumada et al. (1975) but using an artificial listener instead of collecting data from real participants. The behavioural data were mimicked using an artificial listener based on 'king2019' (King et al., 2019) as a front-end model using a template-matching decision to indicate whether a 500-Hz tone was (or not) present in each of the noisy trials. This study offers a step-by-step guide of how can be an artificial listener integrated into fastACI.

    Use these data: Download all these data, locate them in a local directory of your computer. If you have MATLAB and you downloaded a local copy of the fastACI toolbox (open access at: https://github.com/aosses-tue/fastACI) you can recreate the figures of our paper. After downloading and initialising the toolbox (type 'startup_fastACI;', without quotation marks in MATLAB), run the script g20230501_FA_Artificial_listener_paper_figs.m (provided in this dataset) and follow the instructions on the screen to generate one of the four study figures. This script calls the function publ_osses2023b_FA_figs.m from the toolbox.

  19. h

    NLI4PR

    • huggingface.co
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur GuimarĂŁes (2025). NLI4PR [Dataset]. https://huggingface.co/datasets/araag2/NLI4PR
    Explore at:
    Dataset updated
    Mar 19, 2025
    Authors
    Artur GuimarĂŁes
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Natural Language Inference for Patient Recruitment (NLI4PR)

      Dataset Description
    

    Links

    Homepage: Github.io

    Repository: Github

    Paper: arXiv

    Contact (Original Authors): Mathilde Aguiar (mathilde.aguiar@lisn.fr)

    Contact (Curator):Artur GuimarĂŁes (artur.guimas@gmail.com)

      Dataset Summary
    

    MedQA is a large-scale multiple-choice question-answering dataset designed to mimic the style of professional medical board exams, particularly the USMLE… See the full description on the dataset page: https://huggingface.co/datasets/araag2/NLI4PR.

  20. NIH CheXmask Database: a dataset of anatomical seg

    • kaggle.com
    zip
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priyadarshi Mukhopadhyay (2025). NIH CheXmask Database: a dataset of anatomical seg [Dataset]. https://www.kaggle.com/datasets/poeticmage/chexmask-database-a-dataset-of-anatomical-segment
    Explore at:
    zip(944480270 bytes)Available download formats
    Dataset updated
    Jul 22, 2025
    Authors
    Priyadarshi Mukhopadhyay
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CheXmask Database: a large-scale dataset of anatomical segmentation masks for chest x-ray images

    This particular data contains only the segmentation for NIH Data set that is: Chest X Ray 8

    Nicolas Gaggion , Candelaria Mosquera , Martina Aineseder , Lucas Mansilla , Diego Milone , Enzo Ferrante

    Published: March 1, 2024. Version: 0.4

    This data set was downloaded from "https://physionet.org/content/chexmask-cxr-segmentation-data/0.4/OriginalResolution/#files-panel">Physionet. PhysioNet is a repository of freely-available medical research data, managed by the MIT Laboratory for Computational Physiology. Supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number R01EB030362. For more accessibility options, see the MIT Accessibility Page.

    Abstract The CheXmask Database presents a comprehensive, uniformly annotated collection of chest radiographs, constructed from five public databases: ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest and VinDr-CXR. The database aggregates 657,566 anatomical segmentation masks derived from images which have been processed using the HybridGNet model to ensure consistent, high-quality segmentation. To confirm the quality of the segmentations, we include in this database individual Reverse Classification Accuracy (RCA) scores for each of the segmentation masks. This dataset is intended to catalyze further innovation and refinement in the field of semantic chest X-ray analysis, offering a significant resource for researchers in the medical imaging domain.

    Ethics All publicly available datasets utilized in this study adhered to strict ethical standards and underwent thorough anonymization, with identifiable details removed. The study does not release any part of the original image datasets; it only provides already anonymized image identifiers to allow researchers to match the original images with our annotations. MIMIC-CXR-JPG dataset required additional ethics training and research courses for access. The study authors fulfilled all ethics courses and data use agreement requirements to ensure ethical data usage.

    Conflicts of Interest The authors have no conflict of interests to declare.

    References Wang R, Chen LC, Moukheiber L, Seastedt KP, Moukheiber M, Moukheiber D, Zaiman Z, Moukheiber S, Litchman T, Trivedi H, Steinberg R. Enabling chronic obstructive pulmonary disease diagnosis through chest X-rays: A multi-site and multi-modality study. International Journal of Medical Informatics. 2023 Oct 1;178:105211. Gaggion N, Mansilla L, Mosquera C, Milone DH, Ferrante E. Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis. IEEE Trans Med Imaging. 2022. doi:10.1109/TMI.2022.3224660. Wang X, et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. Irvin J, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI conference on artificial intelligence. 2019;33(01). Johnson AE, et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint. 2019. arXiv:1901.07042. Bustos A, et al. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Med Image Anal. 2020;66:101797. Nguyen HQ, et al. VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Sci Data. 2022;9(1):429. Valindria VV, et al. Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE Trans Med Imaging. 2017;36:1597–1606. Gaggion N, Vakalopoulou M, Milone DH, Ferrante E. Multi-center anatomical segmentation with heterogeneous labels via landmark-based models. In: 20th IEEE International Symposium on Biomedical Imaging (ISBI). IEEE; 2023. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A, editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Cham: Springer; 2015. p. 234-241. (Lecture Notes in Computer Science; vol 9351). Gaggion N. Chest-xray-landmark-dataset [Internet]. GitHub repository. Available from: https://github.com/ngaggion/Chest-xray-landmark-dataset. [Accessed 6/27/2023]

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Asjad K (2022). MIMIC-III - Deep Reinforcement Learning [Dataset]. https://www.kaggle.com/datasets/asjad99/mimiciii
Organization logo

MIMIC-III - Deep Reinforcement Learning

Clinical Decision Support System for Sepsis Management in Emergency care

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
zip(11100065 bytes)Available download formats
Dataset updated
Apr 7, 2022
Authors
Asjad K
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Digitization of healthcare data along with algorithmic breakthroughts in AI will have a major impact on healthcare delivery in coming years. Its intresting to see application of AI to assist clinicians during patient treatment in a privacy preserving way. While scientific knowledge can help guide interventions, there remains a key need to quickly cut through the space of decision policies to find effective strategies to support patients during the care process.

Offline Reinforcement learning (also referred to as safe or batch reinforcement learning) is a promising sub-field of RL which provides us with a mechanism for solving real world sequential decision making problems where access to simulator is not available. Here we assume that learn a policy from fixed dataset of trajectories with further interaction with the environment(agent doesn't receive reward or punishment signal from the environment). It has shown that such an approach can leverage vast amount of existing logged data (in the form of previous interactions with the environment) and can outperform supervised learning approaches or heuristic based policies for solving real world - decision making problems. Offline RL algorithms when trained on sufficiently large and diverse offline datasets can produce close to optimal policies(ability to generalize beyond training data).

As Part of my PhD, research, I investigated the problem of developing a Clinical Decision Support System for Sepsis Management using Offline Deep Reinforcement Learning.

MIMIC-III ('Medical Information Mart for Intensive Care') is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.

we try to answer the following question:

Given a particular patient’s characteristics and physiological information at each time step as input, can our DeepRL approach, learn an optimal treatment policy that can prescribe the right intervention(e.g use of ventilator) to the patient each stage of the treatment process, in order to improve the final outcome(e.g patient mortality)?

we can use popular state-of-the-art algorithms such as Deep Q Learning(DQN), Double Deep Q Learning (DDQN), DDQN combined with BNC, Mixed Monte Carlo(MMC) and Persistent Advantage Learning (PAL). Using these methods we can train an RL policy to recommend optimum treatment path for a given patient.

Data acquisition, standard pre-processing and modelling details can be found here in Github repo: https://github.com/asjad99/MIMIC_RL_COACH

Search
Clear search
Close search
Google apps
Main menu