MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies were performed at Beth Israel Deaconess Medical Center in Boston, MA.
itsanmolgupta/mimic-cxr-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
MIMIC CXR [1] is a large publicly available dataset of chest radiographs in DICOM format with free-text radiology reports. In addition, labels for the presence of 12 different chest-related pathologies, as well as of any support devices, and overall normal/abnormal status were made available via the MIMIC Chest X-ray JPG (MIMIC-CXR-JPG) [2] labels, which were generated using the CheXpert and NegBio algorithms.
Based on these labels, we created a visual question answering dataset comprising 224 questions for 48 cases from the official test set, and 111 questions for 23 validation cases. A majority (68%) of the questions are close-ended (answerable with yes or no), and focus on the presence of one out of 15 chest pathologies, or any support device, or generically on any abnormality, whereas the remaining open-ended questions inquire about the location, size, severity or type of a pathology/device, if present in the specific case, indicated by the MIMIC-CXR-JPG labels.
For each question and case we also provide a reference answer, which was authored by a board-certified radiologist (with 17 years of post-residency experience) based on the chest X-ray and original radiology report
MIMIC-CXR-LT. We construct a single-label, long-tailed version of MIMIC-CXR in a similar manner. MIMIC-CXR is a multi-label classification dataset with over 200,000 chest X-rays labeled with 13 pathologies and a “No Findings” class. The resulting MIMIC-CXR-LT dataset contains 19 classes, of which 10 are head classes, 6 are medium classes, and 3 are tail classes. MIMIC-CXR-LT contains 111,792 images labeled with one of 18 diseases, with 87,493 training images and 23,550 test set images. The validation and balanced test sets contain 15 and 30 images per class, respectively.
This dataset was created by Nikesh reddy patlolla
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
CXR-PRO is an adaptation of the MIMIC-CXR dataset that omits references to prior radiology reports. Consisting of 374,139 free-text radiology reports and associated chest radiographs, CXR-PRO addresses the issue of hallucinated references to priors produced by radiology report generation models. By removing nearly all prior references in MIMIC-CXR, CXR-PRO, when used as training data for report generation models, is capable of broadly improving the factual consistency and accuracy of generated reports. More generally, this dataset aims to support a wide body of research in medical image analysis and natural language processing. MIMIC-CXR is a de-identified dataset, so no protected health information (PHI) is included.
MLforHealthcare/mimic-cxr dataset hosted on Hugging Face and contributed by the HF Datasets community
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
LLaVA-Rad MIMIC-CXR features more accurate section extractions from MIMIC-CXR free-text radiology reports. Traditionally, rule-based methods were used to extract sections such as the reason for exam, findings, and impression. However, these approaches often fail due to inconsistencies in report structure and clinical language. In this work, we leverage GPT-4 to extract these sections more reliably, adding 237,073 image-text pairs to the training split and 1,952 pairs to the validation split. This enhancement afforded the development and fine-tuning of LLaVA-Rad, a multimodal large language model (LLM) tailored for radiology applications, achieving improved performance on report generation tasks.
This resource is provided to support reproducibility and for the benefit of the research community, enabling further exploration in vision–language modeling. For more details, please refer to the accompanying paper [1].
ayyuce/mimic-cxr dataset hosted on Hugging Face and contributed by the HF Datasets community
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Generative vision-language models have exciting potential implications for radiology report generation, but unfortunately such models are also known to produce hallucinations and other nonsensical statements. For example, radiology report generation models regularly hallucinate prior exams, making statements such as “The lungs are hyperinflated with emphysematous changes as seen on prior CT” despite not having access to any prior exam. To address this shortcoming, we propose ReXPref-Prior, an adapted version of MIMIC-CXR where GPT-4 has removed references to prior exams from both findings and impression sections of chest X-ray reports. We expect ReXPref-Prior will be useful for training models that hallucinate prior exams less frequently, through techniques such as direct preference optimization. Additionally, ReXPref-Prior’s validation and test sets can be used as a new benchmark for evaluating report generation models.
wza/mimic-cxr-rad-dino dataset hosted on Hugging Face and contributed by the HF Datasets community
1083 cases from the MIMIC-CXR dataset. For each case, a gray-scaled X-ray image with the size of around 3000x3000, eye-gaze data, and ground-truth classification labels are provided. These cases are classified into 3 categories: Normal, Congestive Heart Failure (CHF), and Pneumonia.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
A multimodal combination of the MIMIC-IV v1.0.0 and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 databases filtered to only include patients that have at least one chest X-ray performed with the goal of validating multi-modal predictive analytics in healthcare operations can be generated with the present resource. This multimodal dataset generated through this code contains 34,540 individual patient files in the form of "pickle" Python object structures, which covers a total of 7,279 hospitalization stays involving 6,485 unique patients. Additionally, code to extract feature embeddings as well as the list of pre-processed features are included in this repository.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by ARPAN.GOSWAMI.0
Released under Apache 2.0
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Interpreting medical images and writing radiology reports is a critical yet challenging task in healthcare. Despite their importance, both human-written and AI-generated reports are liable to errors, leaving a need for robust and representative datasets that capture the diversity of errors present across different mediums of report generation. Thus, we present Chest X-Ray Report Errors (ReXErr-v1), a new dataset based on MIMIC-CXR and constructed using large language models (LLMs) that contains synthetic error reports for the majority of MIMIC-CXR. Developed with input from board-certified radiologists, ReXErr-v1 contains plausible errors that closely mimic those found in real-world scenarios. Furthermore, ReXErr-v1 utilizes a novel sampling methodology that selects three errors to inject among a set of frequent errors made by both human and AI models. We include errors both at report and sentence level, improving the versatility of ReXErr-v1. Our dataset can enhance future AI reporting tools by aiding the development and evaluation of report-generation and error-screening algorithms.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RPW/mimic-cxr-dataset-findings-impression dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
varun-v-rao/mimic-cxr-dpo-with-metrics dataset hosted on Hugging Face and contributed by the HF Datasets community
itsanmolgupta/mimic-cxr-dataset-cleaned dataset hosted on Hugging Face and contributed by the HF Datasets community
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Clinical management decisions for patients with acutely decompensated heart failure and many other diseases are often based on grades of pulmonary edema severity, rather than its mere absence or presence. Chest radiographs are commonly performed to assess pulmonary edema. The MIMIC-CXR dataset that consists of 377,110 chest radiographs with free-text radiology reports offers a tremendous opportunity to study this subject.
This dataset is curated based on MIMIC-CXR, containing 3 metadata files that consist of pulmonary edema severity grades extracted from the MIMIC-CXR dataset through different means: 1) by regular expression (regex) from radiology reports, 2) by expert labeling from radiology reports, and 3) by consensus labeling from chest radiographs.
This dataset aims to support the algorithmic development of pulmonary edema assessment from chest x-ray images and benchmark its performance. The metadata files have subject IDs, study IDs, DICOM IDs, and the numerical grades of pulmonary edema severity. The IDs listed in this dataset have the same mapping structure as in MIMIC-CXR.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data quality assessment of the MIMIC-CXR dataset (65,379 patients, 227,827 individual reports, 377,100 images). Indication of mismatched sex mentions in reports attributed to the same individual, number (%) of poor quality images indicated by our poor quality image classification model, and number (%) of wrongly labelled views (in the metadata) indicated by our view classification model. All reports and images indicated above were manually checked, and we provide a spreadsheet in S1 Data with the corrected view labels and reports likely from different individuals due to sex differences with other reports attributed to the same person identifier.
MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies were performed at Beth Israel Deaconess Medical Center in Boston, MA.