32 datasets found

h
Mimic4Dataset
huggingface.co
Updated Jul 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thouria (2023). Mimic4Dataset [Dataset]. https://huggingface.co/datasets/thbndi/Mimic4Dataset
Explore at:
Dataset updated
Jul 7, 2023
Authors
Thouria
Description
Dataset for mimic4 data, by default for the Mortality task. Available tasks are: Mortality, Length of Stay, Readmission, Phenotype. The data is extracted from the mimic4 database using this pipeline: 'https://github.com/healthylaife/MIMIC-IV-Data-Pipeline/tree/main' mimic path should have this form : "path/to/mimic4data/from/username/mimiciv/2.2" If you choose a Custom task provide a configuration file for the Time series. Currently working with Mimic-IV ICU Data.
f
mimic-2-preprocessed
figshare.com
hdf
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Heumos (2023). mimic-2-preprocessed [Dataset]. http://doi.org/10.6084/m9.figshare.22331755.v1
Explore at:
hdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22331755.v1
Dataset updated
Mar 24, 2023
Dataset provided by
figshare
Authors
Lukas Heumos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Preprocessed version of the MIMIC-II dataset. See https://github.com/theislab/ehrapy-datasets/tree/main/mimic_2
p
Data from: MIMIC-IV-Ext Triage Instruction Corpus
physionet.org
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qingyang Shen; Quan Guo (2025). MIMIC-IV-Ext Triage Instruction Corpus [Dataset]. http://doi.org/10.13026/q1nc-2e47
Explore at:
Unique identifier
https://doi.org/10.13026/q1nc-2e47
Dataset updated
Mar 4, 2025
Authors
Qingyang Shen; Quan Guo
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Emergency department (ED) overcrowding leads to delayed care, increased patient risk, and inefficient resource use. The MIMIC-IV-Ext Triage Instruction Corpus (MIETIC) addresses this by providing 9,629 structured triage cases from MIMIC-IV, aligned with the Emergency Severity Index (ESI). MIETIC supports large language model (LLM) training for AI-assisted triage, improving accuracy, consistency, and risk assessment. The dataset includes chief complaints, vital signs, demographics, and medical history, ensuring realistic triage decision-making. Developed through automated quality control and expert validation, MIETIC enhances model performance in high-risk and moderate-risk classification. Available in CSV formats, MIETIC enables research in clinical NLP, AI-driven triage, and decision-support tools. The dataset module includes:

Structured triage cases with ESI labels. Triage case generation prompts for instruction tuning. Expert-validated samples for quality control. SQL scripts for data extraction and validation, hosted on GitHub.

MIETIC provides a standardized, reproducible dataset to advance AI-driven emergency triage, optimizing accuracy, efficiency, and resource allocation.
MIMIC-III Clinical Database(Open Access)
kaggle.com
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ihssane Ned (2025). MIMIC-III Clinical Database(Open Access) [Dataset]. https://www.kaggle.com/datasets/ihssanened/mimic-iii-clinical-databaseopen-access/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ihssane Ned
Description
Dataset Source

This dataset is a portion of MIMIC-III Clinical Database, a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset. The full dataset is available on PhysioNet this** link**

Dataset Description:

This dataset contains solely 4 tables (extracted from the original dataset), more informations about each table can be found in its corresponding link - admissions.csv
- d_labitems.csv - labevents.csv - patient.csv a nice visualization of this dataset can be found here

Future Perspectives:

This portion of the dataset will be combined to build a comprehensive dataset of simulated medical reports.
The evolutionary dynamics between viral mimics and host proteins- input data...
zenodo.org
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rotem Fuchs; Ofir Schor; Bar Naim; Dafna Tussia-Cohen; Alessandra Mozzi; Diego Forni; Sivan Friedman; Zohar Haggai; Manuela Sironi; Tzachi Hagai; Rotem Fuchs; Ofir Schor; Bar Naim; Dafna Tussia-Cohen; Alessandra Mozzi; Diego Forni; Sivan Friedman; Zohar Haggai; Manuela Sironi; Tzachi Hagai (2025). The evolutionary dynamics between viral mimics and host proteins- input data for github repository [Dataset]. http://doi.org/10.5281/zenodo.15880296
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15880296
Dataset updated
Jul 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rotem Fuchs; Ofir Schor; Bar Naim; Dafna Tussia-Cohen; Alessandra Mozzi; Diego Forni; Sivan Friedman; Zohar Haggai; Manuela Sironi; Tzachi Hagai; Rotem Fuchs; Ofir Schor; Bar Naim; Dafna Tussia-Cohen; Alessandra Mozzi; Diego Forni; Sivan Friedman; Zohar Haggai; Manuela Sironi; Tzachi Hagai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains input files necessary for runnig our analysis pipeline, as describrd in 'The evolutionary dynamics between viral mimics and host proteins" by Fuchs, Schor, Naim et al. The pipeline can be found at out github repository- "domain_mimicry" by HagaiLab: https://github.com/HagaiLab/domain_mimicry

Structure Annotations of Assessment and Plan Sections from MIMIC-III

zenodo.org
data.niaid.nih.gov

csv

Updated Apr 17, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Doron Stupp; Doron Stupp; Ronnie Barequet; I-Ching Lee; Eyal Oren; Amir Feder; Amir Feder; Ayelet Benjamini; Avinatan Hassidim; Avinatan Hassidim; Yossi Matias; Eran Ofek; Alvin Rajkomar; Alvin Rajkomar; Ronnie Barequet; I-Ching Lee; Eyal Oren; Ayelet Benjamini; Yossi Matias; Eran Ofek (2022). Structure Annotations of Assessment and Plan Sections from MIMIC-III [Dataset]. http://doi.org/10.5281/zenodo.6413405

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6413405

Dataset updated

Apr 17, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Physicians record their detailed thought-processes about diagnoses and treatments as unstructured text in a section of a clinical note called the "assessment and plan". This information is more clinically rich than structured billing codes assigned for an encounter but harder to reliably extract given the complexity of clinical language and documentation habits. To structure these sections we collected a dataset of annotations over assessment and plan sections from the publicly available and de-identified MIMIC-III dataset, and developed deep-learning based models to perform this task, described in the associated paper available as a pre-print at: https://www.medrxiv.org/content/10.1101/2022.04.13.22273438v1

When using this data please cite our paper:

@article {Stupp2022.04.13.22273438,
author = {Stupp, Doron and Barequet, Ronnie and Lee, I-Ching and Oren, Eyal and Feder, Amir and Benjamini, Ayelet and Hassidim, Avinatan and Matias, Yossi and Ofek, Eran and Rajkomar, Alvin},
title = {Structured Understanding of Assessment and Plans in Clinical Documentation},
year = {2022},
doi = {10.1101/2022.04.13.22273438},
publisher = {Cold Spring Harbor Laboratory Press},
URL = {https://www.medrxiv.org/content/early/2022/04/17/2022.04.13.22273438},
journal = {medRxiv}
}

The dataset, presented here, contains annotations of assessment and plan sections of notes from the publicly available and de-identified MIMIC-III dataset, marking the active problems, their assessment description, and plan action items. Action items are additionally marked as one of 8 categories (listed below). The dataset contains over 30,000 annotations of 579 notes from distinct patients, annotated by 6 medical residents and students.

The dataset is divided into 4 partitions - a training set (481 notes), validation set (50 notes), test set (48 notes) and an inter-rater set. The inter-rater set contains the annotations of each of the raters over the test set. Rater 1 in the inter-rater set should be regarded as an intra-rater comparison (details in the paper). The labels underwent automatic normalization to capture entire word boundaries and remove flanking non-alphanumeric characters.

Code for transforming labels into TensorFlow examples and training models as described in the paper will be made available at GitHub: https://github.com/google-research/google-research/tree/master/assessment_plan_modeling

In order to use these annotations, the user additionally needs to obtain the text of the notes which is found in the NOTE_EVENTS table from MIMIC-III, access to which is to be acquired independently (https://mimic.mit.edu/)

Annotations are given as character spans in a CSV file with the following schema:

Field	Type	Semantics
partition	categorical (one of [train, val, test, interrater]	The set of ratings the span belongs to.
rater_id	int	Unique id for each the raters
note_id	int	The note’s unique note_id, links to the MIMIC-III notes table (as ROW-ID).
span_type	categorical (one of [PROBLEM_TITLE, PROBLEM_DESCRIPTION, ACTION_ITEM]	Type of the span as annotated by raters.
char_start	int	Character offsets from note start
char_end	int
action_item_type	categorical (one of [MEDICATIONS, IMAGING, OBSERVATIONS_LABS, CONSULTS, NUTRITION, THERAPEUTIC_PROCEDURES, OTHER_DIAGNOSTIC_PROCEDURES, OTHER])	Type of action item if the span is an action item (empty otherwise) as annotated by raters.

Curated CXR report generation dataset
kaggle.com
Updated Feb 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FinanceKim (2023). Curated CXR report generation dataset [Dataset]. https://www.kaggle.com/datasets/financekim/curated-cxr-report-generation-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
FinanceKim
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Original Dataset Reference: OpenI, MIMIC-CXR

Curated by authors of MediViLL (https://github.com/SuperSupermoon/MedViLL)

Image files are saved seperately. The corresponding report is saved in the MediVILL folder (in dataset subdir), with jsonl extension -> read files with pandas will do the trick (https://stackoverflow.com/questions/50475635/loading-jsonl-file-as-json-objects)
Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) -...
healthdata.gov
application/rdfxml +5
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) - e2km-2dm6 - Archive Repository [Dataset]. https://healthdata.gov/dataset/Multiparameter-Intelligent-Monitoring-in-Intensive/wra9-c8dz
Explore at:
application/rssxml, xml, csv, tsv, json, application/rdfxmlAvailable download formats
Dataset updated
Jul 26, 2023
Description
This dataset tracks the updates made on the dataset "Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II)" as a repository for previous versions of the data and metadata.
Data from: FastText embeddings for SNOMED CT concepts using MIMIC-IV notes...
zenodo.org
portalinvestigacion.um.es
+2more
json
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Castell Díaz; Javier Castell Díaz; John D. Kelleher; John D. Kelleher; Jesualdo Tomás Fernández Breis; Jesualdo Tomás Fernández Breis; Catalina Martínez Costa; Catalina Martínez Costa (2025). FastText embeddings for SNOMED CT concepts using MIMIC-IV notes and SNOMED CT walks [Dataset]. http://doi.org/10.5281/zenodo.14899937
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14899937
Dataset updated
Feb 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Javier Castell Díaz; Javier Castell Díaz; John D. Kelleher; John D. Kelleher; Jesualdo Tomás Fernández Breis; Jesualdo Tomás Fernández Breis; Catalina Martínez Costa; Catalina Martínez Costa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Embeddings for SNOMED CT concepts produced by models of FastText trained on different corpora. Each file contains a JSON file that links the ID of a SNOMED CT concept to its corresponding embedding. Files ft_mimicN_dict.json contain the embeddings of models trained on subsets of MIMIC-IV, where N denotes the percentage of MIMIC-IV used in the training of the model; whereas ft_snomed_ct_walks_dict.json contains the embeddings of a FastText model trained on an artifical corpus obtained by performing walks on SNOMED CT (https://doi.org/10.1016/j.jbi.2023.104297).

These embeddings were generated and studied in the paper Assessing the Effectiveness of Embedding Methods in Capturing Clinical Information from SNOMED CT () and more information can also be found in the following repository: https://github.com/JavierCastellD/AssessingSNOMEDEmbeddings.
Dataset corresponding to "Blood Pressure Morphology Assessment from...
zenodo.org
zip
Updated Mar 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolas Aguirre; Nicolas Aguirre; Edith Grall-Maës; Leandro Javier Cymberknop; Ricardo Luis Armentano; Edith Grall-Maës; Leandro Javier Cymberknop; Ricardo Luis Armentano (2021). Dataset corresponding to "Blood Pressure Morphology Assessment from Photoplethysmogram and Demographic Information Using Deep Learning with Attention Mechanism" [Dataset]. http://doi.org/10.5281/zenodo.4598938
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4598938
Dataset updated
Mar 13, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nicolas Aguirre; Nicolas Aguirre; Edith Grall-Maës; Leandro Javier Cymberknop; Ricardo Luis Armentano; Edith Grall-Maës; Leandro Javier Cymberknop; Ricardo Luis Armentano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset containing arterial blood pressure (ABP) signals and their corresponding finger photoplestimography (PPG). This dataset is a processed version of the MIMIC-III Waveform Database Matched Subset.

File names were inherited from MIMIC-III. Files are saved in ".mat" and each file contains 2 structures with raw signals and different computed characteristics. Each structure corresponds to 15-second segments sampled at 125Hz.

For more details, please refer to MIMIC-III Waveform Database Matched Subset and the processing source code.
Z
Supplementary Material for: 'An impedance pneumography signal quality index:...
data.niaid.nih.gov
zenodo.org
Updated Aug 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlton, Peter H (2021). Supplementary Material for: 'An impedance pneumography signal quality index: design, assessment and application to respiratory rate monitoring' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3973770
Explore at:
Dataset updated
Aug 17, 2021
Dataset authored and provided by
Charlton, Peter H
Description
This supplementary material accompanies:

Charlton P.H. et al., "An impedance pneumography signal quality index for respiratory rate monitoring: design, assessment and application", [under review], 2020

The Impedance Pneumography Signal Quality Index (SQI) dataset and accompanying scripts (in Matlab format) are provided to facilitate reproduction of the analyses using data from the MIMIC III dataset in this publication.

Summary of Publication

In this article we developed and assessed the performance of a signal quality index (SQI) for the impedance pneumography signal. The SQI was developed using data from the Listen dataset, and assessed using data from the Listen dataset and MIMIC III datasets. The SQI was found to accurately classify segments of impedance pneumography signal as either high or low quality. Furthermore, when it was coupled with a high performance RR algorithm, highly accurate and precise RRs were estimated from those segments deemed to be high quality. In this study performance was assessed in the critical care environment - further work is required to deteremine whether the SQI is suitable for use with wearable sensors. Both the dataset and code used to perform this study are publicly available.

Reproducing this Publication

The work relating to the MIMIC dataset in this publication can be reproduced as follows:

Reproducing the analysis These steps can be used to quickly reproduce the analysis using the curated and annotated dataset.

Download the curated and annotated dataset from Zenodo using this direct download link.

Run the analysis using the run_imp_sqi_mimic.m script.

Full reproduction These steps include downloading the raw data files, extracting data from these files, collating the dataset, manually annotating the data, and performing the analysis.

Use the ImP_SQI_mimic_data_importer.m script to download raw MIMIC data files from PhysioNet, and collate them into a single Matlab file.

Prepare the dataset for manual annotation by running the run_imp_sqi_mimic.m script.

Manually annotate the signals by running the run_mimic_imp_annotation.m script - the annotations are stored in separate files (the original annotation files are available here).

Import the manual annotations into the collated data file by re-running the ImP_SQI_mimic_data_importer.m script.

Run run_imp_sqi_mimic.m to perform the analysis described in the publication.

The scripts are also stored (alongside details of how to use them) are available in the RRest GitHub repository at: https://github.com/peterhcharlton/RRest/tree/master/RRest_v3.0/Publication_Specific_Scripts/ImP_SQI

License: The dataset (mimic_imp_sqi_data.mat) is distributed under the terms specified in the accompanying LICENSE file. The scripts are distributed under the GNU General Public Licence (as specified towards the start of each file).

Version 0.1.1: This is the version at the time of initial submission.
Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II)
healthdata.gov
data.virginia.gov
+4more
application/rdfxml +5
Updated Feb 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) [Dataset]. https://healthdata.gov/dataset/Multiparameter-Intelligent-Monitoring-in-Intensive/e2km-2dm6
Explore at:
csv, application/rssxml, application/rdfxml, tsv, json, xmlAvailable download formats
Dataset updated
Feb 13, 2021
Description
The objective of this Bioengineering Research Partnership is to focus the resources of a powerful interdisciplinary team from academia (MIT), industry (Philips Medical Systems) and clinical medicine (Beth Israel Deaconess Medical Center) to develop and evaluate advanced ICU patient monitoring systems that will substantially improve the efficiency, accuracy and timeliness of clinical decision making in intensive care.
h
Ova-sense
huggingface.co
Updated Mar 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soumil (2025). Ova-sense [Dataset]. https://huggingface.co/datasets/SoumilB7/Ova-sense
Explore at:
Dataset updated
Mar 23, 2025
Authors
Soumil
Description
Components

Datasets : dataset_pre : Pre menopause biomarker levels chip reading dataset_post : Post menopause biomarker levels chip reading dataset_straged : biomarker levels chip reading distributed into cancer stages

Description

Synthetic dataset created to mimic biomarker activity on a designed chip for early stage cancer detection More details on github

Repository

https://github.com/SoumilB7/Ova-sense

license: mit
Z
Perceptual maps of Heliconiini butterflies: images, 3D spaces, 2D maps, and...
data.niaid.nih.gov
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doré, Maël (2025). Perceptual maps of Heliconiini butterflies: images, 3D spaces, 2D maps, and mimicry ring listings [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10076355
Explore at:
Dataset updated
Feb 28, 2025
Dataset authored and provided by
Doré, Maël
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary

This repository contains images, 3D animated spaces, 2D perceptual maps with GMM, and mimicry ring lists for heliconiine butterflies complementing the analyses presented in this research paper: "Doré et al., 2025 - Perceptual maps reveal rampant convergence in butterfly wing patterns across the Neotropics. in prep.".

Abstract

In 1879, Fritz Müller formulated the first mathematical evolutionary model to explain mutualistic mimicry between coexisting defended prey. Yet, the degree to which local mimicry drives the structure of prey aposematic signals at continental scale remains unclear, because the perception of pattern similarity has never been assessed at large spatial scale. Here, we implement a Citizen Science survey to quantify and analyze the structure of perceived variation in the wing patterns of heliconiine butterflies (Nymphalidae: Heliconiini) throughout the entire Neotropics. Despite a continuum of perceived wing patterns at the continental scale, we show that the convergence of sympatric species into discrete mimicry rings is ubiquitous among communities. These results expand Müller’s historical predictions by supporting the rampant convergence of prey signals across an entire continent.

Contents

This repository contains three folders:

"3D_maps" contains the animated 3D perceptual spaces of heliconiine wing patterns for the Citizen Science dataset (N = 432) and the Local reference for the five local communities highlighted in the article.

"Clustering" contains the 2D perceptual maps and associated lists of mimicry rings built for each of the five local communities, for different level of clustering from GMM (K from 5 to 10).

"Images" contains the 432 images of dorsal wing patterns of heliconiine butterflies used in the online survey (https://memometic.cleverapps.io/) designed for this study.

How to cite

Please cite this research article as:

Doré, M., Pérochon, E., Aubier, T.G., Le Poul, Y., Joron, M., Elias, M., 2025. Perceptual maps reveal rampant convergence in butterfly wing patterns across the Neotropics. in prep. https://doi.org/TBA

Associated ressources

The source codes for the analyses carried out in the study are available on GitHub. The occurrences data and distribution maps used in this study are publicly available from Zenodo: Occurrences data at https://doi.org/10.5281/zenodo.10906853; Distribution maps at https://doi.org/10.5281/zenodo.10903661.

The online Citizen Science survey on the perception of mimicry in wing color patterns of heliconiine butterflies is temporary available at https://memometic.cleverapps.io/.Source code for the online Citizen Science survey are accessible on GitHub.
EHRSHOT
redivis.com
stanford.redivis.com
application/jsonl +7
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shah Lab (2025). EHRSHOT [Dataset]. http://doi.org/10.57761/0gv9-nd83
Explore at:
csv, application/jsonl, sas, parquet, stata, spss, arrow, avroAvailable download formats
Unique identifier
https://doi.org/10.57761/0gv9-nd83
Dataset updated
Feb 13, 2025
Dataset provided by
Redivis Inc.
Authors
Shah Lab
Description
Abstract

👂💉 EHRSHOT is a dataset for benchmarking the few-shot performance of foundation models for clinical prediction tasks. EHRSHOT contains de-identified structured data (e.g., diagnosis and procedure codes, medications, lab values) from the electronic health records (EHRs) of 6,739 Stanford Medicine patients and includes 15 prediction tasks. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and includes data beyond ICU and emergency department patients.

⚡️Quickstart 1. To recreate the original EHRSHOT paper, download the EHRSHOT_ASSETS.zip file from the "Files" tab 2. To work with OMOP CDM formatted data, download all the tables in the "Tables" tab

⚙️ Please see the "Methodology" section below for details on the dataset and downloadable files.

Methodology

1. 📖 Overview

EHRSHOT is a benchmark for evaluating models on few-shot learning for patient classification tasks. The dataset contains:

**6,739 **patients

41.6 million clinical events

921,499 visits

15 prediction tasks

%3C!-- --%3E

2. 💽 Dataset

EHRSHOT is sourced from Stanford’s STARR-OMOP database.

Data follows the OMOP CDM and is fully de-identified.

Unlike most other EHR research datasets, EHRSHOT is not restricted to ED/ICU visits and instead includes longitudinal patient data for all hospital encounter types.

EHRSHOT does not contain clinical notes or images.

%3C!-- --%3E

We provide two versions of the dataset:

EHRSHOT-Original is the same exact dataset used in the original EHRSHOT paper.

EHRSHOT-OMOP is a more complete version of the EHRSHOT dataset which includes all OMOP CDM tables and additional OMOP metadata.

%3C!-- --%3E

To access the raw data, please see the "Tables" and "Files"** **tabs above:

3. 💽 Data Files and Formats

We provide EHRSHOT in two file formats:

OMOP CDM v5.4

Medical Event Data Standard (MEDS)

%3C!-- --%3E

Within the "Tables" tab...

1. %3Cu%3EEHRSHOT-OMOP%3C/u%3E

* Dataset Version: EHRSHOT-OMOP

* Notes: Contains all OMOP CDM tables for the EHRSHOT patients. Note that this dataset is slightly different than the original EHRSHOT dataset, as these tables contain the full OMOP schema rather than a filtered subset.

Within the "Files" tab...

1. %3Cu%3EEHRSHOT_ASSETS.zip%3C/u%3E

* Dataset Version: EHRSHOT-Original

* Data Format: FEMR 0.1.16

* Notes: The original EHRSHOT dataset as detailed in the paper. Also includes model weights.

2. %3Cu%3EEHRSHOT_MEDS.zip%3C/u%3E

* Dataset Version: EHRSHOT-Original

* Data Format: MEDS 0.3.3

* Notes: The original EHRSHOT dataset as detailed in the paper. It does not include any models.

3. %3Cu%3EEHRSHOT_OMOP_MEDS.zip%3C/u%3E

* Dataset Version: EHRSHOT-OMOP

* Data Format: MEDS 0.3.3 + MEDS-ETL 0.3.8

* Notes: Converts the dataset from EHRSHOT-OMOP into MEDS format via the `meds_etl_omop`command from MEDS-ETL.

4. %3Cu%3EEHRSHOT_OMOP_MEDS_Reader.zip%3C/u%3E

* Dataset Version: EHRSHOT-OMOP

* Data Format: MEDS Reader 0.1.9 + MEDS 0.3.3 + MEDS-ETL 0.3.8

* Notes: Same data as EHRSHOT_OMOP_MEDS.zip, but converted into a MEDS-Reader database for faster reads.

4. 🤖 Model

We also release the full weights of **CLMBR-T-base, **a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Please download from https://huggingface.co/StanfordShahLab/clmbr-t-base

**5. 🧑‍💻 Code **

Please see our Github repo to obtain code for loading the dataset and running a set of pretrained baseline models: https://github.com/som-shahlab/ehrshot-benchmark/

Usage

**NOTE: You must authenticate to Redivis using your formal affiliation's email address. If you use gmail or other personal email addresses, you will not be granted access. **

Access to the EHRSHOT dataset requires the following:

Verified Affiliation with an **Academic, Government, **o
h
MedQA
huggingface.co
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Guimarães (2025). MedQA [Dataset]. https://huggingface.co/datasets/araag2/MedQA
Explore at:
Dataset updated
Jul 27, 2025
Authors
Artur Guimarães
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
MedQA-USMLE — A Large-scale Open Domain Question Answering Dataset from Medical Exams

Dataset Description

Links

Homepage: Github.io

Repository: Github

Paper: arXiv

Leaderboard: Papers with Code

Contact (Original Authors): Di Jin (jindi15@mit.edu)

Contact (Curator): Artur Guimarães (artur.guimas@gmail.com)

Dataset Summary

MedQA is a large-scale multiple-choice question-answering dataset designed to mimic the style of professional… See the full description on the dataset page: https://huggingface.co/datasets/araag2/MedQA.
h
RaTE-NER
huggingface.co
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weike Zhao (2024). RaTE-NER [Dataset]. https://huggingface.co/datasets/Angelakeke/RaTE-NER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2024
Authors
Weike Zhao
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for RaTE-NER Dataset

GitHub | Paper

Dataset Summary

RaTE-NER dataset is a large-scale, radiological named entity recognition (NER) dataset, including 13,235 manually annotated sentences from 1,816 reports within the MIMIC-IV database, that spans 9 imaging modalities and 23 anatomical regions, ensuring comprehensive coverage. Additionally, we further enriched the dataset with 33,605 sentences from the 17,432 reports available on Radiopaedia, by… See the full description on the dataset page: https://huggingface.co/datasets/Angelakeke/RaTE-NER.
Data from: The genomics and evolution of inter-sexual mimicry and...
zenodo.org
application/gzip, bin +2
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Beatriz Willink; Beatriz Willink; Kalle Tunström; Kalle Tunström; Sofie Nilén; Rayan Chikhi; Rayan Chikhi; Téo Lemane; Téo Lemane; Michihiko Takahashi; Yuma Takahashi; Yuma Takahashi; Erik I. Svensson; Erik I. Svensson; Chris W. Wheat; Chris W. Wheat; Sofie Nilén; Michihiko Takahashi (2023). Data from: The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies [Dataset]. http://doi.org/10.5281/zenodo.8304153
Explore at:
application/gzip, bin, csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8304153
Dataset updated
Sep 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Beatriz Willink; Beatriz Willink; Kalle Tunström; Kalle Tunström; Sofie Nilén; Rayan Chikhi; Rayan Chikhi; Téo Lemane; Téo Lemane; Michihiko Takahashi; Yuma Takahashi; Yuma Takahashi; Erik I. Svensson; Erik I. Svensson; Chris W. Wheat; Chris W. Wheat; Sofie Nilén; Michihiko Takahashi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains intermediate output files required to reproduce the figures in the main text and Supporting Material of Willink et al. 2023. The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies.

FILE OVERVIEW:

1. Morph-specific assemblies
A. File names: Afem_1354_ragtag.fasta.gz, Ifem_1049_ragtag.fa.gz, Ofem_0081_ragtag.fa.gz, O054_Shasta_run2.PMDV.HAP1.purged.fasta.gz, A059_Shasta_run1.PMDV.HAP1.purged.fa.gz
B. Description: genome assemblies for different morphs of Ischnura elegans (Afem_1354, Ifem_1049, and Ofem_0081) and Ischnura senegalensis (A059 and O054), generated in this study from long-read Nanopore data using Shasta v 0.7.0 (https://github.com/paoloshasta/shasta).

2. Assembly statistics
A. File names: Assembly_statistics.csv, Assembly_statistics_sen.csv
B. Description: Completeness and quality metrics for de novo genome assemblies of I. elegans and I. senegalensis female morphs. See Fig. S1-S2.

3. Repetitive content annotation
A. File names: A1354_ragtag_RED.bed.repeats.bed.gz, Afem_Shasta1_polished_ragtag_UPPER.fa.out.gz, Ifem_Shasta2_polished_ragtag_UPPER.fa.out.gz, ioIscEleg1.1.primary_UPPER.fa.out.gz, ToL_RED.repeats.bed.gz
B. Description: Annotation of repetitive sequences in morph-specific assemblies. All morph assemblies (A, I and Darwin Tree of Life assemblies) were annotated using RepeatModeler v 2.0.1 and RepeatMasker v 1.0.93 (http://www.repeatmasker.org). The A morph and DToL assemblies were additionally annotated using Red v 0.0.1 (https://github.com/BioinformaticsToolsmith/Red). RepeatMasker annotations were then used to estimate TE coverage. See Extended Data Fig. 4 and Fig. S7.

4. GWAS output
A. File names: A1354_ragtag_AvI.assoc_filtered.txt.gz, A1354_ragtag_AvO.assoc_filtered.txt.gz, A1354_ragtag_IvO.assoc_filtered.txt.gz, ToL_AvI.assoc_filtered.txt.gz, ToL_AvO.assoc_filtered.txt.gz, ToL_IvO.assoc_filtered.txt.gz
B. Description: filtered SNPs in pairwise association tests between morphs (n = 19 resequencing samples per morph) of I. elegans. Analyses were conducted in PLINK v 1.9 (http://pngu.mgh.harvard.edu/purcell/plink/), using either the A morph assembly (Fig. 2a-b), or the Darwin Tree of Life (DToL) reference assembly (Extended Data Figure 8a-b) as mapping reference.

5. Population statistics
A. File names: Afem_pixy_30K_fst.txt.gz, A1354_30kb.Tajima.D.gz, Afem_pi_30K_pi.txt.gz, ToL_30K_fst.txt.gz, ToL_30kb.Tajima.D.gz, ToL_30K_onepop_pi.txt.gz
B. Description: Genetic differentiation (fst) between morphs, Tajima's D statistics, and nucleotide diversity across 30 kb windows of the I. elegans genome. Population statistics were computed using either the A morph assembly (Fig. 2c-e), or the DToL reference assembly (Extended Data Figure 8c-e) as mapping reference.

6. k-mer based GWAS
A. File names: AvI_kmers.fa.gz, AvO_kmers.fa.gz, OvAI_kmers.fa.gz, AvI_kmers.fa_v_A1354_Shasta_run1_table.tsv.gz, AvO_kmers.fa_v_A1354_Shasta_run1_table.tsv.gz, OvAI_kmers.fa_v_A1354_Shasta_run1_table.tsv.gz, OvAI_kmers.fa_v_Ifem_1049_ragtag_table.tsv.gz
B. Description: List of significant k-mers (in fasta format) in three k-mer based association analyses (n = 19 resequencing samples per morph) between morphs of I. elegans. Significant k-mers were then mapped to morph-specific assemblies using Blast v 2.22.28 (https://blast.ncbi.nlm.nih.gov/Blast.cgi) for short sequences. We include mapping results shown in Fig. 3a-b.

7. Read-depth coverage
A. File names: reseq_coverage_norepeat_500_window.bed.gz, nano_coverage_norepeat_500_window.bed.gz, Ifem_nano_coverage_norepeat_500_window.bed.gz, Ifem_reseq_coverage_norepeat_500_window_15Mb.bed.gz, poolseq_coverage_norepeat_500_window.bed.gz, morph_coverage_norepeat_diff_500.tsv.gz, SwD_popmap
B. Description: Read depth coverage of the morph locus and a 15 mb region used to estimate baseline read depths. 19 Illumina resequencing samples, and one long-read Nanopore sample of each morph of I. elegans were mapped to both the A and I assemblies to estimate read depth. Two poolseq samples (each pool consisting of 30 females of each morph) of I. senegalensis were mapped to the A assembly of I. elegans to estimate read depth. Read depth was estimated in mosdepth v 0.2.8 (https://github.com/brentp/mosdepth) across 500 bp windows after filtering windows with more than 10% repetitive content. For poolseq samples, the difference in coverage values between the A and O pools was computed across the entire genome. Sample information for resequencing samples is recorded in the file SwD_popmap. See Fig. 3c-d, 5b, and S8.

8. Assembly alignment
A. File names: nucmer_aln_Ifem_1049_ragtag_Afem_1354_ragtag.qr1_filter.reformat.coords.gz, nucmer_aln_Ofem_0081_ragtag_Afem_1354_ragtag.qr1_filter.reformat.coords.gz, nucmer_aln_Afem_Isen_Afem_Iele.qr1_filter.reformat.coords.gz, nucmer_aln_Ofem_Isen_Afem_Iele.qr1_filter.reformat.coords.gz, karyotype_AI_RagTag.csv, karyotype_AO_RagTag.csv, karyotype_AIsen_AIele.cs, karyotype_OIsen_AIele.csv
B. Description: Assembly alignments using nucmer v 4.0.0 (https://github.com/mummer4/mummer) and contig synteny for plotting using RIdeogram v 0.2.2 (https://cran.r-project.org/web/packages/RIdeogram/vignettes/RIdeogram.html) in R v 4.2.2 (https://www.r-project.org/). The A morph assembly of I. elegans was aligned to the I and O morph assemblies of I. elegans and to the A and O-like assemblies of I. senegalensis. See Fig. 4a, 5c.

9. Genotyping the Darwin Tree of Life assembly
A. File names: nucmer_aln_Afem_ragtag_ToL-haplotigs.qr1_filter.reformat.coords.gz, nucmer_aln_Afem_ragtag_ToL-primary.qr1_filter.reformat.coords.gz, ToL_500_norepeat.regions.bed.gz, karyotype_AToL_13_unloc_RagTag.csv, karyotype_AToL_RagTag_haplotigs.csv
B. Description: To genotype the DToL reference assembly of I. elegans, we estimated read-depth coverage of the DToL long-read Pacbio data mapped to the A morph assembly of I. elegans generated in this study, and aligned the A morph assembly to both the primary DToL assembly and to the purged haplotigs. Read depth was estimated in mosdepth v 0.2.8 (https://github.com/brentp/mosdepth) and assembly alignments were conducted using nucmer v 4.0.0 (https://github.com/mummer4/mummer). See Fig. S3.

10. SV calling
A. File names: A_to_A.bam, A_to_A.bam.bai, A_to_I.bam, A_to_I.bam.bai, A_to_O.bam, A_to_O.bam.bai, A_to_ToL_2mb.bam, A_to_ToL_2mb.bam.bai, I_to_A.bam, I_to_A.bam.bai, I_to_I.bam, I_to_I.bam.bai, I_to_O.bam, I_to_O.bam.bai, I_to_ToL_2mb.bam, I_to_ToL_2mb.bam.bai, O_to_A.bam, O_to_A.bam.bai, O_to_I.bam, O_to_I.bam.bai, O_to_O.bam, O_to_O.bam.bai, O_to_ToL_2mb.bam, O_to_ToL_2mb.bam.bai
B. Description: mergede alignements of resequencing samples (n = 19 per morph) to alternative reference assemblies (A, I, O, and DToL) for I. elegans. The alignments have been filtered by quality and to contain only the unlocalized scaffold 2 of chromosome 13, which includes the morph locus. These files were used to call morph-specific structural variants using samplot v 1.3.0 (https://github.com/ryanlayer/samplot). See Extended Data Figs 2, 7, and Fig. S5-S6.

11. Mapping of inversion breakpoint reads
A. File names: AvO_3K.tsv.gz, AvO_22K.tsv.gz, AvO_sen_3K.tsv.gz, AvO_sen_22K.tsv.gz, IvO_3K.tsv.gz
B. Description: Signatures of an inversion with breakpoints at ~ 3 kb and ~ 22 kb of the unlocalized scaffold 2 of chromosome 13 on the O assembly were found in A and I resequencing samples of I. elegans and in poolseq samples of A females of I. senegalensis. We queried the reads mapping to the inversion breakpoints and then tabulated their mapping locations of the A morph assembly of I. elegans (Fig. 6 and Extended Data Fig. 3, 7b-c). For the first inversion breakpoint, we also mapped reads on the I morph assembly of I. elegans (Fig. S12).

12. Evidence of translocation in I
A. File names: Ifem_nano_SUPER_13_unloc_2.bam, Ifem_nano_SUPER_13_unloc_2.bam.bai
B. Description: Long-read Nanopore data of a I morph female of I. elegans mapped to the A morph of I. elegans and filtered to contain the entire unlocalized scaffold 2 of chromosome 13. Read mapping was conducted in minimap2 v 2.22-r1110 (https://github.com/lh3/minimap2) and used to identify a translocation signature in the I morph, relative to the A morph of I. elegans. See Extended Data Fig. 6.

13. PCA output
A. File names: A1354_all.eigenval, A1354_all.eigenvec, I1049_all.eigenval, I1049_all.eigenvec
B. Description: Eigenvectors and eigenvalues of PCA analyses of population structure between morphs of I. elegans. PCA analysis were conducted on morph locus, using either the A morph or the I morph assembly as mapping reference in PLINK v 1.9 (http://pngu.mgh.harvard.edu/purcell/plink/). See Fig. S4.

14. Linkage disequilibrium
A. File names: A1354_SUPER_1_allr.ld.gz, A1354_SUPER_2_allr.ld.gz, A1354_SUPER_3_allr.ld.gz, A1354_SUPER_4_allr.ld.gz, A1354_SUPER_5_allr.ld.gz, A1354_SUPER_6_allr.ld.gz, A1354_SUPER_7_allr.ld.gz, A1354_SUPER_8_allr.ld.gz, A1354_SUPER_9_allr.ld.gz, A1354_SUPER_10_allr.ld.gz, A1354_SUPER_11_allr.ld.gz, A1354_SUPER_12_allr.ld.gz, A1354_SUPER_13_allr.ld.gz, A1354_SUPER_13_unloc_1_allr.ld.gz, A1354_SUPER_13_unloc_2_allr.ld.gz, A1354_SUPER_13_unloc_3_allr.ld.gz, A1354_SUPER_13_unloc_4_allr.ld.gz, A1354_SUPER_X_allr.ld.gz
B. Description: Estimates of recombination rate (R2) between SNPs across the first 15 mb of each chromosome and unlocalized segments of chromosome 13 of
h
Emotional_Interpretability
huggingface.co
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soumil (2025). Emotional_Interpretability [Dataset]. https://huggingface.co/datasets/SoumilB7/Emotional_Interpretability
Explore at:
Dataset updated
Apr 25, 2025
Authors
Soumil
Description
Components

Dataset : Emotional_perspectives : Response to a given context under 27 emotional lenses

Description

Synthetic dataset created to mimic emotional responses primarily made for alignment and interpretability research More details will be listed on github soon

license: mit
Z
Raw and post-processing data for using auditory models to mimic human...
data.niaid.nih.gov
zenodo.org
Updated May 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Osses, Alejandro (2023). Raw and post-processing data for using auditory models to mimic human listeners in reverse correlation experiments from the fastACI toolbox [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7886231
Explore at:
Dataset updated
May 3, 2023
Dataset provided by
Osses, Alejandro
Varnet, Léo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description: The current dataset provides all the stimuli (folder ../01-Stimuli/), raw data (folder ../02-Raw-data/) and post-processed data (../03-Post-proc-data/) used in the Forum Acusticum 2013 paper titled "Using auditory models to mimic human listeners in reverse correlation experiments from the fastACI toolbox" by the same authors. In this paper, we replicated the tone-in-noise experiment by Ahumada et al. (1975) but using an artificial listener instead of collecting data from real participants. The behavioural data were mimicked using an artificial listener based on 'king2019' (King et al., 2019) as a front-end model using a template-matching decision to indicate whether a 500-Hz tone was (or not) present in each of the noisy trials. This study offers a step-by-step guide of how can be an artificial listener integrated into fastACI.

Use these data: Download all these data, locate them in a local directory of your computer. If you have MATLAB and you downloaded a local copy of the fastACI toolbox (open access at: https://github.com/aosses-tue/fastACI) you can recreate the figures of our paper. After downloading and initialising the toolbox (type 'startup_fastACI;', without quotation marks in MATLAB), run the script g20230501_FA_Artificial_listener_paper_figs.m (provided in this dataset) and follow the instructions on the screen to generate one of the four study figures. This script calls the function publ_osses2023b_FA_figs.m from the toolbox.

Facebook

Twitter

Click to copy link

Link copied

Cite

Thouria (2023). Mimic4Dataset [Dataset]. https://huggingface.co/datasets/thbndi/Mimic4Dataset

Mimic4Dataset

thbndi/Mimic4Dataset

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jul 7, 2023

Authors

Thouria

Description

Dataset for mimic4 data, by default for the Mortality task. Available tasks are: Mortality, Length of Stay, Readmission, Phenotype. The data is extracted from the mimic4 database using this pipeline: 'https://github.com/healthylaife/MIMIC-IV-Data-Pipeline/tree/main' mimic path should have this form : "path/to/mimic4data/from/username/mimiciv/2.2" If you choose a Custom task provide a configuration file for the Time series. Currently working with Mimic-IV ICU Data.

Clear search

Close search

Google apps

Main menu

Mimic4Dataset

mimic-2-preprocessed

Data from: MIMIC-IV-Ext Triage Instruction Corpus

MIMIC-III Clinical Database(Open Access)

Dataset Source

Dataset Description:

Future Perspectives:

The evolutionary dynamics between viral mimics and host proteins- input data...

Structure Annotations of Assessment and Plan Sections from MIMIC-III

Curated CXR report generation dataset

Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) -...

Data from: FastText embeddings for SNOMED CT concepts using MIMIC-IV notes...

Dataset corresponding to "Blood Pressure Morphology Assessment from...

Supplementary Material for: 'An impedance pneumography signal quality index:...

Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II)

Ova-sense

Perceptual maps of Heliconiini butterflies: images, 3D spaces, 2D maps, and...

EHRSHOT

Abstract

Methodology

Usage

MedQA

RaTE-NER

Data from: The genomics and evolution of inter-sexual mimicry and...

Emotional_Interpretability

Raw and post-processing data for using auditory models to mimic human...

Mimic4Dataset

thbndi/Mimic4Dataset