32 datasets found
  1. h

    Mimic4Dataset

    • huggingface.co
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thouria (2023). Mimic4Dataset [Dataset]. https://huggingface.co/datasets/thbndi/Mimic4Dataset
    Explore at:
    Dataset updated
    Jul 7, 2023
    Authors
    Thouria
    Description

    Dataset for mimic4 data, by default for the Mortality task. Available tasks are: Mortality, Length of Stay, Readmission, Phenotype. The data is extracted from the mimic4 database using this pipeline: 'https://github.com/healthylaife/MIMIC-IV-Data-Pipeline/tree/main' mimic path should have this form : "path/to/mimic4data/from/username/mimiciv/2.2" If you choose a Custom task provide a configuration file for the Time series. Currently working with Mimic-IV ICU Data.

  2. f

    mimic-2-preprocessed

    • figshare.com
    hdf
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Heumos (2023). mimic-2-preprocessed [Dataset]. http://doi.org/10.6084/m9.figshare.22331755.v1
    Explore at:
    hdfAvailable download formats
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    figshare
    Authors
    Lukas Heumos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Preprocessed version of the MIMIC-II dataset. See https://github.com/theislab/ehrapy-datasets/tree/main/mimic_2

  3. p

    Data from: MIMIC-IV-Ext Triage Instruction Corpus

    • physionet.org
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qingyang Shen; Quan Guo (2025). MIMIC-IV-Ext Triage Instruction Corpus [Dataset]. http://doi.org/10.13026/q1nc-2e47
    Explore at:
    Dataset updated
    Mar 4, 2025
    Authors
    Qingyang Shen; Quan Guo
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    Emergency department (ED) overcrowding leads to delayed care, increased patient risk, and inefficient resource use. The MIMIC-IV-Ext Triage Instruction Corpus (MIETIC) addresses this by providing 9,629 structured triage cases from MIMIC-IV, aligned with the Emergency Severity Index (ESI). MIETIC supports large language model (LLM) training for AI-assisted triage, improving accuracy, consistency, and risk assessment. The dataset includes chief complaints, vital signs, demographics, and medical history, ensuring realistic triage decision-making. Developed through automated quality control and expert validation, MIETIC enhances model performance in high-risk and moderate-risk classification. Available in CSV formats, MIETIC enables research in clinical NLP, AI-driven triage, and decision-support tools. The dataset module includes:

    Structured triage cases with ESI labels. Triage case generation prompts for instruction tuning. Expert-validated samples for quality control. SQL scripts for data extraction and validation, hosted on GitHub.

    MIETIC provides a standardized, reproducible dataset to advance AI-driven emergency triage, optimizing accuracy, efficiency, and resource allocation.

  4. MIMIC-III Clinical Database(Open Access)

    • kaggle.com
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ihssane Ned (2025). MIMIC-III Clinical Database(Open Access) [Dataset]. https://www.kaggle.com/datasets/ihssanened/mimic-iii-clinical-databaseopen-access/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ihssane Ned
    Description

    Dataset Source

    This dataset is a portion of MIMIC-III Clinical Database, a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset. The full dataset is available on PhysioNet this** link**

    Dataset Description:

    This dataset contains solely 4 tables (extracted from the original dataset), more informations about each table can be found in its corresponding link - admissions.csv
    - d_labitems.csv - labevents.csv - patient.csv a nice visualization of this dataset can be found here

    Future Perspectives:

    This portion of the dataset will be combined to build a comprehensive dataset of simulated medical reports.

  5. The evolutionary dynamics between viral mimics and host proteins- input data...

    • zenodo.org
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rotem Fuchs; Ofir Schor; Bar Naim; Dafna Tussia-Cohen; Alessandra Mozzi; Diego Forni; Sivan Friedman; Zohar Haggai; Manuela Sironi; Tzachi Hagai; Rotem Fuchs; Ofir Schor; Bar Naim; Dafna Tussia-Cohen; Alessandra Mozzi; Diego Forni; Sivan Friedman; Zohar Haggai; Manuela Sironi; Tzachi Hagai (2025). The evolutionary dynamics between viral mimics and host proteins- input data for github repository [Dataset]. http://doi.org/10.5281/zenodo.15880296
    Explore at:
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rotem Fuchs; Ofir Schor; Bar Naim; Dafna Tussia-Cohen; Alessandra Mozzi; Diego Forni; Sivan Friedman; Zohar Haggai; Manuela Sironi; Tzachi Hagai; Rotem Fuchs; Ofir Schor; Bar Naim; Dafna Tussia-Cohen; Alessandra Mozzi; Diego Forni; Sivan Friedman; Zohar Haggai; Manuela Sironi; Tzachi Hagai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains input files necessary for runnig our analysis pipeline, as describrd in 'The evolutionary dynamics between viral mimics and host proteins" by Fuchs, Schor, Naim et al. The pipeline can be found at out github repository- "domain_mimicry" by HagaiLab: https://github.com/HagaiLab/domain_mimicry

  6. Structure Annotations of Assessment and Plan Sections from MIMIC-III

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Apr 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Doron Stupp; Doron Stupp; Ronnie Barequet; I-Ching Lee; Eyal Oren; Amir Feder; Amir Feder; Ayelet Benjamini; Avinatan Hassidim; Avinatan Hassidim; Yossi Matias; Eran Ofek; Alvin Rajkomar; Alvin Rajkomar; Ronnie Barequet; I-Ching Lee; Eyal Oren; Ayelet Benjamini; Yossi Matias; Eran Ofek (2022). Structure Annotations of Assessment and Plan Sections from MIMIC-III [Dataset]. http://doi.org/10.5281/zenodo.6413405
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 17, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Doron Stupp; Doron Stupp; Ronnie Barequet; I-Ching Lee; Eyal Oren; Amir Feder; Amir Feder; Ayelet Benjamini; Avinatan Hassidim; Avinatan Hassidim; Yossi Matias; Eran Ofek; Alvin Rajkomar; Alvin Rajkomar; Ronnie Barequet; I-Ching Lee; Eyal Oren; Ayelet Benjamini; Yossi Matias; Eran Ofek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Physicians record their detailed thought-processes about diagnoses and treatments as unstructured text in a section of a clinical note called the "assessment and plan". This information is more clinically rich than structured billing codes assigned for an encounter but harder to reliably extract given the complexity of clinical language and documentation habits. To structure these sections we collected a dataset of annotations over assessment and plan sections from the publicly available and de-identified MIMIC-III dataset, and developed deep-learning based models to perform this task, described in the associated paper available as a pre-print at: https://www.medrxiv.org/content/10.1101/2022.04.13.22273438v1

    When using this data please cite our paper:

    @article {Stupp2022.04.13.22273438,
    author = {Stupp, Doron and Barequet, Ronnie and Lee, I-Ching and Oren, Eyal and Feder, Amir and Benjamini, Ayelet and Hassidim, Avinatan and Matias, Yossi and Ofek, Eran and Rajkomar, Alvin},
    title = {Structured Understanding of Assessment and Plans in Clinical Documentation},
    year = {2022},
    doi = {10.1101/2022.04.13.22273438},
    publisher = {Cold Spring Harbor Laboratory Press},
    URL = {https://www.medrxiv.org/content/early/2022/04/17/2022.04.13.22273438},
    journal = {medRxiv}
    }

    The dataset, presented here, contains annotations of assessment and plan sections of notes from the publicly available and de-identified MIMIC-III dataset, marking the active problems, their assessment description, and plan action items. Action items are additionally marked as one of 8 categories (listed below). The dataset contains over 30,000 annotations of 579 notes from distinct patients, annotated by 6 medical residents and students.

    The dataset is divided into 4 partitions - a training set (481 notes), validation set (50 notes), test set (48 notes) and an inter-rater set. The inter-rater set contains the annotations of each of the raters over the test set. Rater 1 in the inter-rater set should be regarded as an intra-rater comparison (details in the paper). The labels underwent automatic normalization to capture entire word boundaries and remove flanking non-alphanumeric characters.

    Code for transforming labels into TensorFlow examples and training models as described in the paper will be made available at GitHub: https://github.com/google-research/google-research/tree/master/assessment_plan_modeling

    In order to use these annotations, the user additionally needs to obtain the text of the notes which is found in the NOTE_EVENTS table from MIMIC-III, access to which is to be acquired independently (https://mimic.mit.edu/)

    Annotations are given as character spans in a CSV file with the following schema:

    FieldTypeSemantics
    partitioncategorical (one of [train, val, test, interrater]The set of ratings the span belongs to.
    rater_idintUnique id for each the raters
    note_idintThe note’s unique note_id, links to the MIMIC-III notes table (as ROW-ID).
    span_typecategorical (one of [PROBLEM_TITLE,
    PROBLEM_DESCRIPTION, ACTION_ITEM]
    Type of the span as annotated by raters.
    char_startintCharacter offsets from note start
    char_endint
    action_item_typecategorical (one of [MEDICATIONS, IMAGING, OBSERVATIONS_LABS, CONSULTS, NUTRITION, THERAPEUTIC_PROCEDURES, OTHER_DIAGNOSTIC_PROCEDURES, OTHER])Type of action item if the span is an action item (empty otherwise) as annotated by raters.
  7. Curated CXR report generation dataset

    • kaggle.com
    Updated Feb 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FinanceKim (2023). Curated CXR report generation dataset [Dataset]. https://www.kaggle.com/datasets/financekim/curated-cxr-report-generation-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    FinanceKim
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
  8. Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) -...

    • healthdata.gov
    application/rdfxml +5
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) - e2km-2dm6 - Archive Repository [Dataset]. https://healthdata.gov/dataset/Multiparameter-Intelligent-Monitoring-in-Intensive/wra9-c8dz
    Explore at:
    application/rssxml, xml, csv, tsv, json, application/rdfxmlAvailable download formats
    Dataset updated
    Jul 26, 2023
    Description

    This dataset tracks the updates made on the dataset "Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II)" as a repository for previous versions of the data and metadata.

  9. Data from: FastText embeddings for SNOMED CT concepts using MIMIC-IV notes...

    • zenodo.org
    • portalinvestigacion.um.es
    • +2more
    json
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Castell Díaz; Javier Castell Díaz; John D. Kelleher; John D. Kelleher; Jesualdo Tomás Fernández Breis; Jesualdo Tomás Fernández Breis; Catalina Martínez Costa; Catalina Martínez Costa (2025). FastText embeddings for SNOMED CT concepts using MIMIC-IV notes and SNOMED CT walks [Dataset]. http://doi.org/10.5281/zenodo.14899937
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Javier Castell Díaz; Javier Castell Díaz; John D. Kelleher; John D. Kelleher; Jesualdo Tomás Fernández Breis; Jesualdo Tomás Fernández Breis; Catalina Martínez Costa; Catalina Martínez Costa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
  10. Dataset corresponding to "Blood Pressure Morphology Assessment from...

    • zenodo.org
    zip
    Updated Mar 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Aguirre; Nicolas Aguirre; Edith Grall-Maës; Leandro Javier Cymberknop; Ricardo Luis Armentano; Edith Grall-Maës; Leandro Javier Cymberknop; Ricardo Luis Armentano (2021). Dataset corresponding to "Blood Pressure Morphology Assessment from Photoplethysmogram and Demographic Information Using Deep Learning with Attention Mechanism" [Dataset]. http://doi.org/10.5281/zenodo.4598938
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 13, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicolas Aguirre; Nicolas Aguirre; Edith Grall-Maës; Leandro Javier Cymberknop; Ricardo Luis Armentano; Edith Grall-Maës; Leandro Javier Cymberknop; Ricardo Luis Armentano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset containing arterial blood pressure (ABP) signals and their corresponding finger photoplestimography (PPG). This dataset is a processed version of the MIMIC-III Waveform Database Matched Subset.

    File names were inherited from MIMIC-III. Files are saved in ".mat" and each file contains 2 structures with raw signals and different computed characteristics. Each structure corresponds to 15-second segments sampled at 125Hz.

    For more details, please refer to MIMIC-III Waveform Database Matched Subset and the processing source code.

  11. Z

    Supplementary Material for: 'An impedance pneumography signal quality index:...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlton, Peter H (2021). Supplementary Material for: 'An impedance pneumography signal quality index: design, assessment and application to respiratory rate monitoring' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3973770
    Explore at:
    Dataset updated
    Aug 17, 2021
    Dataset authored and provided by
    Charlton, Peter H
    Description

    This supplementary material accompanies:

    Charlton P.H. et al., "An impedance pneumography signal quality index for respiratory rate monitoring: design, assessment and application", [under review], 2020

    The Impedance Pneumography Signal Quality Index (SQI) dataset and accompanying scripts (in Matlab format) are provided to facilitate reproduction of the analyses using data from the MIMIC III dataset in this publication.

    Summary of Publication

    In this article we developed and assessed the performance of a signal quality index (SQI) for the impedance pneumography signal. The SQI was developed using data from the Listen dataset, and assessed using data from the Listen dataset and MIMIC III datasets. The SQI was found to accurately classify segments of impedance pneumography signal as either high or low quality. Furthermore, when it was coupled with a high performance RR algorithm, highly accurate and precise RRs were estimated from those segments deemed to be high quality. In this study performance was assessed in the critical care environment - further work is required to deteremine whether the SQI is suitable for use with wearable sensors. Both the dataset and code used to perform this study are publicly available.

    Reproducing this Publication

    The work relating to the MIMIC dataset in this publication can be reproduced as follows:

    • Reproducing the analysis These steps can be used to quickly reproduce the analysis using the curated and annotated dataset.

    • Download the curated and annotated dataset from Zenodo using this direct download link.

    • Run the analysis using the run_imp_sqi_mimic.m script.

      • Full reproduction These steps include downloading the raw data files, extracting data from these files, collating the dataset, manually annotating the data, and performing the analysis.
    • Use the ImP_SQI_mimic_data_importer.m script to download raw MIMIC data files from PhysioNet, and collate them into a single Matlab file.

    • Prepare the dataset for manual annotation by running the run_imp_sqi_mimic.m script.

    • Manually annotate the signals by running the run_mimic_imp_annotation.m script - the annotations are stored in separate files (the original annotation files are available here).

    • Import the manual annotations into the collated data file by re-running the ImP_SQI_mimic_data_importer.m script.

    • Run run_imp_sqi_mimic.m to perform the analysis described in the publication.

    The scripts are also stored (alongside details of how to use them) are available in the RRest GitHub repository at: https://github.com/peterhcharlton/RRest/tree/master/RRest_v3.0/Publication_Specific_Scripts/ImP_SQI

    License: The dataset (mimic_imp_sqi_data.mat) is distributed under the terms specified in the accompanying LICENSE file. The scripts are distributed under the GNU General Public Licence (as specified towards the start of each file).

    Version 0.1.1: This is the version at the time of initial submission.

  12. Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II)

    • healthdata.gov
    • data.virginia.gov
    • +4more
    application/rdfxml +5
    Updated Feb 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) [Dataset]. https://healthdata.gov/dataset/Multiparameter-Intelligent-Monitoring-in-Intensive/e2km-2dm6
    Explore at:
    csv, application/rssxml, application/rdfxml, tsv, json, xmlAvailable download formats
    Dataset updated
    Feb 13, 2021
    Description

    The objective of this Bioengineering Research Partnership is to focus the resources of a powerful interdisciplinary team from academia (MIT), industry (Philips Medical Systems) and clinical medicine (Beth Israel Deaconess Medical Center) to develop and evaluate advanced ICU patient monitoring systems that will substantially improve the efficiency, accuracy and timeliness of clinical decision making in intensive care.

  13. h

    Ova-sense

    • huggingface.co
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soumil (2025). Ova-sense [Dataset]. https://huggingface.co/datasets/SoumilB7/Ova-sense
    Explore at:
    Dataset updated
    Mar 23, 2025
    Authors
    Soumil
    Description

    Components

    Datasets : dataset_pre : Pre menopause biomarker levels chip reading dataset_post : Post menopause biomarker levels chip reading dataset_straged : biomarker levels chip reading distributed into cancer stages

      Description
    

    Synthetic dataset created to mimic biomarker activity on a designed chip for early stage cancer detection More details on github

      Repository
    

    https://github.com/SoumilB7/Ova-sense

      license: mit
    
  14. Z

    Perceptual maps of Heliconiini butterflies: images, 3D spaces, 2D maps, and...

    • data.niaid.nih.gov
    Updated Feb 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Doré, Maël (2025). Perceptual maps of Heliconiini butterflies: images, 3D spaces, 2D maps, and mimicry ring listings [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10076355
    Explore at:
    Dataset updated
    Feb 28, 2025
    Dataset authored and provided by
    Doré, Maël
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary

    This repository contains images, 3D animated spaces, 2D perceptual maps with GMM, and mimicry ring lists for heliconiine butterflies complementing the analyses presented in this research paper: "Doré et al., 2025 - Perceptual maps reveal rampant convergence in butterfly wing patterns across the Neotropics. in prep.".

    Abstract

    In 1879, Fritz Müller formulated the first mathematical evolutionary model to explain mutualistic mimicry between coexisting defended prey. Yet, the degree to which local mimicry drives the structure of prey aposematic signals at continental scale remains unclear, because the perception of pattern similarity has never been assessed at large spatial scale. Here, we implement a Citizen Science survey to quantify and analyze the structure of perceived variation in the wing patterns of heliconiine butterflies (Nymphalidae: Heliconiini) throughout the entire Neotropics. Despite a continuum of perceived wing patterns at the continental scale, we show that the convergence of sympatric species into discrete mimicry rings is ubiquitous among communities. These results expand Müller’s historical predictions by supporting the rampant convergence of prey signals across an entire continent. 
    

    Contents

    This repository contains three folders:

    "3D_maps" contains the animated 3D perceptual spaces of heliconiine wing patterns for the Citizen Science dataset (N = 432) and the Local reference for the five local communities highlighted in the article.

    "Clustering" contains the 2D perceptual maps and associated lists of mimicry rings built for each of the five local communities, for different level of clustering from GMM (K from 5 to 10).

    "Images" contains the 432 images of dorsal wing patterns of heliconiine butterflies used in the online survey (https://memometic.cleverapps.io/) designed for this study.

    How to cite

    Please cite this research article as:

    Doré, M., Pérochon, E., Aubier, T.G., Le Poul, Y., Joron, M., Elias, M., 2025. Perceptual maps reveal rampant convergence in butterfly wing patterns across the Neotropics. in prep. https://doi.org/TBA

    Associated ressources

    The source codes for the analyses carried out in the study are available on GitHub. The occurrences data and distribution maps used in this study are publicly available from Zenodo: Occurrences data at https://doi.org/10.5281/zenodo.10906853; Distribution maps at https://doi.org/10.5281/zenodo.10903661.

    The online Citizen Science survey on the perception of mimicry in wing color patterns of heliconiine butterflies is temporary available at https://memometic.cleverapps.io/.Source code for the online Citizen Science survey are accessible on GitHub.

  15. EHRSHOT

    • redivis.com
    • stanford.redivis.com
    application/jsonl +7
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shah Lab (2025). EHRSHOT [Dataset]. http://doi.org/10.57761/0gv9-nd83
    Explore at:
    csv, application/jsonl, sas, parquet, stata, spss, arrow, avroAvailable download formats
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Redivis Inc.
    Authors
    Shah Lab
    Description

    Abstract

    👂💉 EHRSHOT is a dataset for benchmarking the few-shot performance of foundation models for clinical prediction tasks. EHRSHOT contains de-identified structured data (e.g., diagnosis and procedure codes, medications, lab values) from the electronic health records (EHRs) of 6,739 Stanford Medicine patients and includes 15 prediction tasks. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and includes data beyond ICU and emergency department patients.

    ⚡️Quickstart 1. To recreate the original EHRSHOT paper, download the EHRSHOT_ASSETS.zip file from the "Files" tab 2. To work with OMOP CDM formatted data, download all the tables in the "Tables" tab

    ⚙️ Please see the "Methodology" section below for details on the dataset and downloadable files.

    Methodology

    1. 📖 Overview

    EHRSHOT is a benchmark for evaluating models on few-shot learning for patient classification tasks. The dataset contains:

    • **6,739 **patients
    • 41.6 million clinical events
    • 921,499 visits
    • 15 prediction tasks

    %3C!-- --%3E

    2. 💽 Dataset

    EHRSHOT is sourced from Stanford’s STARR-OMOP database.

    • Data follows the OMOP CDM and is fully de-identified.
    • Unlike most other EHR research datasets, EHRSHOT is not restricted to ED/ICU visits and instead includes longitudinal patient data for all hospital encounter types.
    • EHRSHOT does not contain clinical notes or images.

    %3C!-- --%3E

    We provide two versions of the dataset:

    • EHRSHOT-Original is the same exact dataset used in the original EHRSHOT paper.
    • EHRSHOT-OMOP is a more complete version of the EHRSHOT dataset which includes all OMOP CDM tables and additional OMOP metadata.

    %3C!-- --%3E

    To access the raw data, please see the "Tables" and "Files"** **tabs above:

    3. 💽 Data Files and Formats

    We provide EHRSHOT in two file formats:

    • OMOP CDM v5.4
    • Medical Event Data Standard (MEDS)

    %3C!-- --%3E

    Within the "Tables" tab...

    1. %3Cu%3EEHRSHOT-OMOP%3C/u%3E

    * Dataset Version: EHRSHOT-OMOP

    * Notes: Contains all OMOP CDM tables for the EHRSHOT patients. Note that this dataset is slightly different than the original EHRSHOT dataset, as these tables contain the full OMOP schema rather than a filtered subset.

    Within the "Files" tab...

    1. %3Cu%3EEHRSHOT_ASSETS.zip%3C/u%3E

    * Dataset Version: EHRSHOT-Original

    * Data Format: FEMR 0.1.16

    * Notes: The original EHRSHOT dataset as detailed in the paper. Also includes model weights.

    2. %3Cu%3EEHRSHOT_MEDS.zip%3C/u%3E

    * Dataset Version: EHRSHOT-Original

    * Data Format: MEDS 0.3.3

    * Notes: The original EHRSHOT dataset as detailed in the paper. It does not include any models.

    3. %3Cu%3EEHRSHOT_OMOP_MEDS.zip%3C/u%3E

    * Dataset Version: EHRSHOT-OMOP

    * Data Format: MEDS 0.3.3 + MEDS-ETL 0.3.8

    * Notes: Converts the dataset from EHRSHOT-OMOP into MEDS format via the `meds_etl_omop`command from MEDS-ETL.

    4. %3Cu%3EEHRSHOT_OMOP_MEDS_Reader.zip%3C/u%3E

    * Dataset Version: EHRSHOT-OMOP

    * Data Format: MEDS Reader 0.1.9 + MEDS 0.3.3 + MEDS-ETL 0.3.8

    * Notes: Same data as EHRSHOT_OMOP_MEDS.zip, but converted into a MEDS-Reader database for faster reads.

    4. 🤖 Model

    We also release the full weights of **CLMBR-T-base, **a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Please download from https://huggingface.co/StanfordShahLab/clmbr-t-base

    **5. 🧑‍💻 Code **

    Please see our Github repo to obtain code for loading the dataset and running a set of pretrained baseline models: https://github.com/som-shahlab/ehrshot-benchmark/

    Usage

    **NOTE: You must authenticate to Redivis using your formal affiliation's email address. If you use gmail or other personal email addresses, you will not be granted access. **

    Access to the EHRSHOT dataset requires the following:

    • Verified Affiliation with an **Academic, Government, **o
  16. h

    MedQA

    • huggingface.co
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Guimarães (2025). MedQA [Dataset]. https://huggingface.co/datasets/araag2/MedQA
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Artur Guimarães
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    MedQA-USMLE — A Large-scale Open Domain Question Answering Dataset from Medical Exams

      Dataset Description
    

    Links

    Homepage: Github.io

    Repository: Github

    Paper: arXiv

    Leaderboard: Papers with Code

    Contact (Original Authors): Di Jin (jindi15@mit.edu)

    Contact (Curator): Artur Guimarães (artur.guimas@gmail.com)

      Dataset Summary
    

    MedQA is a large-scale multiple-choice question-answering dataset designed to mimic the style of professional… See the full description on the dataset page: https://huggingface.co/datasets/araag2/MedQA.

  17. h

    RaTE-NER

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weike Zhao (2024). RaTE-NER [Dataset]. https://huggingface.co/datasets/Angelakeke/RaTE-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Authors
    Weike Zhao
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for RaTE-NER Dataset

    GitHub | Paper

      Dataset Summary
    

    RaTE-NER dataset is a large-scale, radiological named entity recognition (NER) dataset, including 13,235 manually annotated sentences from 1,816 reports within the MIMIC-IV database, that spans 9 imaging modalities and 23 anatomical regions, ensuring comprehensive coverage. Additionally, we further enriched the dataset with 33,605 sentences from the 17,432 reports available on Radiopaedia, by… See the full description on the dataset page: https://huggingface.co/datasets/Angelakeke/RaTE-NER.

  18. Data from: The genomics and evolution of inter-sexual mimicry and...

    • zenodo.org
    application/gzip, bin +2
    Updated Sep 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beatriz Willink; Beatriz Willink; Kalle Tunström; Kalle Tunström; Sofie Nilén; Rayan Chikhi; Rayan Chikhi; Téo Lemane; Téo Lemane; Michihiko Takahashi; Yuma Takahashi; Yuma Takahashi; Erik I. Svensson; Erik I. Svensson; Chris W. Wheat; Chris W. Wheat; Sofie Nilén; Michihiko Takahashi (2023). Data from: The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies [Dataset]. http://doi.org/10.5281/zenodo.8304153
    Explore at:
    application/gzip, bin, csv, txtAvailable download formats
    Dataset updated
    Sep 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Beatriz Willink; Beatriz Willink; Kalle Tunström; Kalle Tunström; Sofie Nilén; Rayan Chikhi; Rayan Chikhi; Téo Lemane; Téo Lemane; Michihiko Takahashi; Yuma Takahashi; Yuma Takahashi; Erik I. Svensson; Erik I. Svensson; Chris W. Wheat; Chris W. Wheat; Sofie Nilén; Michihiko Takahashi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains intermediate output files required to reproduce the figures in the main text and Supporting Material of Willink et al. 2023. The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies.

    FILE OVERVIEW:

    1. Morph-specific assemblies
    A. File names: Afem_1354_ragtag.fasta.gz, Ifem_1049_ragtag.fa.gz, Ofem_0081_ragtag.fa.gz, O054_Shasta_run2.PMDV.HAP1.purged.fasta.gz, A059_Shasta_run1.PMDV.HAP1.purged.fa.gz
    B. Description: genome assemblies for different morphs of Ischnura elegans (Afem_1354, Ifem_1049, and Ofem_0081) and Ischnura senegalensis (A059 and O054), generated in this study from long-read Nanopore data using Shasta v 0.7.0 (https://github.com/paoloshasta/shasta).

    2. Assembly statistics
    A. File names: Assembly_statistics.csv, Assembly_statistics_sen.csv
    B. Description: Completeness and quality metrics for de novo genome assemblies of I. elegans and I. senegalensis female morphs. See Fig. S1-S2.

    3. Repetitive content annotation
    A. File names: A1354_ragtag_RED.bed.repeats.bed.gz, Afem_Shasta1_polished_ragtag_UPPER.fa.out.gz, Ifem_Shasta2_polished_ragtag_UPPER.fa.out.gz, ioIscEleg1.1.primary_UPPER.fa.out.gz, ToL_RED.repeats.bed.gz
    B. Description: Annotation of repetitive sequences in morph-specific assemblies. All morph assemblies (A, I and Darwin Tree of Life assemblies) were annotated using RepeatModeler v 2.0.1 and RepeatMasker v 1.0.93 (http://www.repeatmasker.org). The A morph and DToL assemblies were additionally annotated using Red v 0.0.1 (https://github.com/BioinformaticsToolsmith/Red). RepeatMasker annotations were then used to estimate TE coverage. See Extended Data Fig. 4 and Fig. S7.

    4. GWAS output
    A. File names: A1354_ragtag_AvI.assoc_filtered.txt.gz, A1354_ragtag_AvO.assoc_filtered.txt.gz, A1354_ragtag_IvO.assoc_filtered.txt.gz, ToL_AvI.assoc_filtered.txt.gz, ToL_AvO.assoc_filtered.txt.gz, ToL_IvO.assoc_filtered.txt.gz
    B. Description: filtered SNPs in pairwise association tests between morphs (n = 19 resequencing samples per morph) of I. elegans. Analyses were conducted in PLINK v 1.9 (http://pngu.mgh.harvard.edu/purcell/plink/), using either the A morph assembly (Fig. 2a-b), or the Darwin Tree of Life (DToL) reference assembly (Extended Data Figure 8a-b) as mapping reference.

    5. Population statistics
    A. File names: Afem_pixy_30K_fst.txt.gz, A1354_30kb.Tajima.D.gz, Afem_pi_30K_pi.txt.gz, ToL_30K_fst.txt.gz, ToL_30kb.Tajima.D.gz, ToL_30K_onepop_pi.txt.gz
    B. Description: Genetic differentiation (fst) between morphs, Tajima's D statistics, and nucleotide diversity across 30 kb windows of the I. elegans genome. Population statistics were computed using either the A morph assembly (Fig. 2c-e), or the DToL reference assembly (Extended Data Figure 8c-e) as mapping reference.

    6. k-mer based GWAS
    A. File names: AvI_kmers.fa.gz, AvO_kmers.fa.gz, OvAI_kmers.fa.gz, AvI_kmers.fa_v_A1354_Shasta_run1_table.tsv.gz, AvO_kmers.fa_v_A1354_Shasta_run1_table.tsv.gz, OvAI_kmers.fa_v_A1354_Shasta_run1_table.tsv.gz, OvAI_kmers.fa_v_Ifem_1049_ragtag_table.tsv.gz
    B. Description: List of significant k-mers (in fasta format) in three k-mer based association analyses (n = 19 resequencing samples per morph) between morphs of I. elegans. Significant k-mers were then mapped to morph-specific assemblies using Blast v 2.22.28 (https://blast.ncbi.nlm.nih.gov/Blast.cgi) for short sequences. We include mapping results shown in Fig. 3a-b.

    7. Read-depth coverage
    A. File names: reseq_coverage_norepeat_500_window.bed.gz, nano_coverage_norepeat_500_window.bed.gz, Ifem_nano_coverage_norepeat_500_window.bed.gz, Ifem_reseq_coverage_norepeat_500_window_15Mb.bed.gz, poolseq_coverage_norepeat_500_window.bed.gz, morph_coverage_norepeat_diff_500.tsv.gz, SwD_popmap
    B. Description: Read depth coverage of the morph locus and a 15 mb region used to estimate baseline read depths. 19 Illumina resequencing samples, and one long-read Nanopore sample of each morph of I. elegans were mapped to both the A and I assemblies to estimate read depth. Two poolseq samples (each pool consisting of 30 females of each morph) of I. senegalensis were mapped to the A assembly of I. elegans to estimate read depth. Read depth was estimated in mosdepth v 0.2.8 (https://github.com/brentp/mosdepth) across 500 bp windows after filtering windows with more than 10% repetitive content. For poolseq samples, the difference in coverage values between the A and O pools was computed across the entire genome. Sample information for resequencing samples is recorded in the file SwD_popmap. See Fig. 3c-d, 5b, and S8.

    8. Assembly alignment
    A. File names: nucmer_aln_Ifem_1049_ragtag_Afem_1354_ragtag.qr1_filter.reformat.coords.gz, nucmer_aln_Ofem_0081_ragtag_Afem_1354_ragtag.qr1_filter.reformat.coords.gz, nucmer_aln_Afem_Isen_Afem_Iele.qr1_filter.reformat.coords.gz, nucmer_aln_Ofem_Isen_Afem_Iele.qr1_filter.reformat.coords.gz, karyotype_AI_RagTag.csv, karyotype_AO_RagTag.csv, karyotype_AIsen_AIele.cs, karyotype_OIsen_AIele.csv
    B. Description: Assembly alignments using nucmer v 4.0.0 (https://github.com/mummer4/mummer) and contig synteny for plotting using RIdeogram v 0.2.2 (https://cran.r-project.org/web/packages/RIdeogram/vignettes/RIdeogram.html) in R v 4.2.2 (https://www.r-project.org/). The A morph assembly of I. elegans was aligned to the I and O morph assemblies of I. elegans and to the A and O-like assemblies of I. senegalensis. See Fig. 4a, 5c.

    9. Genotyping the Darwin Tree of Life assembly
    A. File names: nucmer_aln_Afem_ragtag_ToL-haplotigs.qr1_filter.reformat.coords.gz, nucmer_aln_Afem_ragtag_ToL-primary.qr1_filter.reformat.coords.gz, ToL_500_norepeat.regions.bed.gz, karyotype_AToL_13_unloc_RagTag.csv, karyotype_AToL_RagTag_haplotigs.csv
    B. Description: To genotype the DToL reference assembly of I. elegans, we estimated read-depth coverage of the DToL long-read Pacbio data mapped to the A morph assembly of I. elegans generated in this study, and aligned the A morph assembly to both the primary DToL assembly and to the purged haplotigs. Read depth was estimated in mosdepth v 0.2.8 (https://github.com/brentp/mosdepth) and assembly alignments were conducted using nucmer v 4.0.0 (https://github.com/mummer4/mummer). See Fig. S3.

    10. SV calling
    A. File names: A_to_A.bam, A_to_A.bam.bai, A_to_I.bam, A_to_I.bam.bai, A_to_O.bam, A_to_O.bam.bai, A_to_ToL_2mb.bam, A_to_ToL_2mb.bam.bai, I_to_A.bam, I_to_A.bam.bai, I_to_I.bam, I_to_I.bam.bai, I_to_O.bam, I_to_O.bam.bai, I_to_ToL_2mb.bam, I_to_ToL_2mb.bam.bai, O_to_A.bam, O_to_A.bam.bai, O_to_I.bam, O_to_I.bam.bai, O_to_O.bam, O_to_O.bam.bai, O_to_ToL_2mb.bam, O_to_ToL_2mb.bam.bai
    B. Description: mergede alignements of resequencing samples (n = 19 per morph) to alternative reference assemblies (A, I, O, and DToL) for I. elegans. The alignments have been filtered by quality and to contain only the unlocalized scaffold 2 of chromosome 13, which includes the morph locus. These files were used to call morph-specific structural variants using samplot v 1.3.0 (https://github.com/ryanlayer/samplot). See Extended Data Figs 2, 7, and Fig. S5-S6.

    11. Mapping of inversion breakpoint reads
    A. File names: AvO_3K.tsv.gz, AvO_22K.tsv.gz, AvO_sen_3K.tsv.gz, AvO_sen_22K.tsv.gz, IvO_3K.tsv.gz
    B. Description: Signatures of an inversion with breakpoints at ~ 3 kb and ~ 22 kb of the unlocalized scaffold 2 of chromosome 13 on the O assembly were found in A and I resequencing samples of I. elegans and in poolseq samples of A females of I. senegalensis. We queried the reads mapping to the inversion breakpoints and then tabulated their mapping locations of the A morph assembly of I. elegans (Fig. 6 and Extended Data Fig. 3, 7b-c). For the first inversion breakpoint, we also mapped reads on the I morph assembly of I. elegans (Fig. S12).

    12. Evidence of translocation in I
    A. File names: Ifem_nano_SUPER_13_unloc_2.bam, Ifem_nano_SUPER_13_unloc_2.bam.bai
    B. Description: Long-read Nanopore data of a I morph female of I. elegans mapped to the A morph of I. elegans and filtered to contain the entire unlocalized scaffold 2 of chromosome 13. Read mapping was conducted in minimap2 v 2.22-r1110 (https://github.com/lh3/minimap2) and used to identify a translocation signature in the I morph, relative to the A morph of I. elegans. See Extended Data Fig. 6.

    13. PCA output
    A. File names: A1354_all.eigenval, A1354_all.eigenvec, I1049_all.eigenval, I1049_all.eigenvec
    B. Description: Eigenvectors and eigenvalues of PCA analyses of population structure between morphs of I. elegans. PCA analysis were conducted on morph locus, using either the A morph or the I morph assembly as mapping reference in PLINK v 1.9 (http://pngu.mgh.harvard.edu/purcell/plink/). See Fig. S4.

    14. Linkage disequilibrium
    A. File names: A1354_SUPER_1_allr.ld.gz, A1354_SUPER_2_allr.ld.gz, A1354_SUPER_3_allr.ld.gz, A1354_SUPER_4_allr.ld.gz, A1354_SUPER_5_allr.ld.gz, A1354_SUPER_6_allr.ld.gz, A1354_SUPER_7_allr.ld.gz, A1354_SUPER_8_allr.ld.gz, A1354_SUPER_9_allr.ld.gz, A1354_SUPER_10_allr.ld.gz, A1354_SUPER_11_allr.ld.gz, A1354_SUPER_12_allr.ld.gz, A1354_SUPER_13_allr.ld.gz, A1354_SUPER_13_unloc_1_allr.ld.gz, A1354_SUPER_13_unloc_2_allr.ld.gz, A1354_SUPER_13_unloc_3_allr.ld.gz, A1354_SUPER_13_unloc_4_allr.ld.gz, A1354_SUPER_X_allr.ld.gz
    B. Description: Estimates of recombination rate (R2) between SNPs across the first 15 mb of each chromosome and unlocalized segments of chromosome 13 of

  19. h

    Emotional_Interpretability

    • huggingface.co
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soumil (2025). Emotional_Interpretability [Dataset]. https://huggingface.co/datasets/SoumilB7/Emotional_Interpretability
    Explore at:
    Dataset updated
    Apr 25, 2025
    Authors
    Soumil
    Description

    Components

    Dataset : Emotional_perspectives : Response to a given context under 27 emotional lenses

      Description
    

    Synthetic dataset created to mimic emotional responses primarily made for alignment and interpretability research More details will be listed on github soon

      license: mit
    
  20. Z

    Raw and post-processing data for using auditory models to mimic human...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osses, Alejandro (2023). Raw and post-processing data for using auditory models to mimic human listeners in reverse correlation experiments from the fastACI toolbox [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7886231
    Explore at:
    Dataset updated
    May 3, 2023
    Dataset provided by
    Osses, Alejandro
    Varnet, Léo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description: The current dataset provides all the stimuli (folder ../01-Stimuli/), raw data (folder ../02-Raw-data/) and post-processed data (../03-Post-proc-data/) used in the Forum Acusticum 2013 paper titled "Using auditory models to mimic human listeners in reverse correlation experiments from the fastACI toolbox" by the same authors. In this paper, we replicated the tone-in-noise experiment by Ahumada et al. (1975) but using an artificial listener instead of collecting data from real participants. The behavioural data were mimicked using an artificial listener based on 'king2019' (King et al., 2019) as a front-end model using a template-matching decision to indicate whether a 500-Hz tone was (or not) present in each of the noisy trials. This study offers a step-by-step guide of how can be an artificial listener integrated into fastACI.

    Use these data: Download all these data, locate them in a local directory of your computer. If you have MATLAB and you downloaded a local copy of the fastACI toolbox (open access at: https://github.com/aosses-tue/fastACI) you can recreate the figures of our paper. After downloading and initialising the toolbox (type 'startup_fastACI;', without quotation marks in MATLAB), run the script g20230501_FA_Artificial_listener_paper_figs.m (provided in this dataset) and follow the instructions on the screen to generate one of the four study figures. This script calls the function publ_osses2023b_FA_figs.m from the toolbox.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Thouria (2023). Mimic4Dataset [Dataset]. https://huggingface.co/datasets/thbndi/Mimic4Dataset

Mimic4Dataset

thbndi/Mimic4Dataset

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 7, 2023
Authors
Thouria
Description

Dataset for mimic4 data, by default for the Mortality task. Available tasks are: Mortality, Length of Stay, Readmission, Phenotype. The data is extracted from the mimic4 database using this pipeline: 'https://github.com/healthylaife/MIMIC-IV-Data-Pipeline/tree/main' mimic path should have this form : "path/to/mimic4data/from/username/mimiciv/2.2" If you choose a Custom task provide a configuration file for the Time series. Currently working with Mimic-IV ICU Data.

Search
Clear search
Close search
Google apps
Main menu