MIMIC-III is a dataset comprising health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Mechanically ventilated patients in the intensive care unit (ICU) have high mortality rates. There are multiple prediction scores, such as the Simplified Acute Physiology Score II (SAPS II), Oxford Acute Severity of Illness Score (OASIS), and Sequential Organ Failure Assessment (SOFA), widely used in the general ICU population. We aimed to establish prediction scores on mechanically ventilated patients with the combination of these disease severity scores and other features available on the first day of admission.Methods: A retrospective administrative database study from the Medical Information Mart for Intensive Care (MIMIC-III) database was conducted. The exposures of interest consisted of the demographics, pre-ICU comorbidity, ICU diagnosis, disease severity scores, vital signs, and laboratory test results on the first day of ICU admission. Hospital mortality was used as the outcome. We used the machine learning methods of k-nearest neighbors (KNN), logistic regression, bagging, decision tree, random forest, Extreme Gradient Boosting (XGBoost), and neural network for model establishment. A sample of 70% of the cohort was used for the training set; the remaining 30% was applied for testing. Areas under the receiver operating characteristic curves (AUCs) and calibration plots would be constructed for the evaluation and comparison of the models' performance. The significance of the risk factors was identified through models and the top factors were reported.Results: A total of 28,530 subjects were enrolled through the screening of the MIMIC-III database. After data preprocessing, 25,659 adult patients with 66 predictors were included in the model analyses. With the training set, the models of KNN, logistic regression, decision tree, random forest, neural network, bagging, and XGBoost were established and the testing set obtained AUCs of 0.806, 0.818, 0.743, 0.819, 0.780, 0.803, and 0.821, respectively. The calibration curves of all the models, except for the neural network, performed well. The XGBoost model performed best among the seven models. The top five predictors were age, respiratory dysfunction, SAPS II score, maximum hemoglobin, and minimum lactate.Conclusion: The current study indicates that models with the risk of factors on the first day could be successfully established for predicting mortality in ventilated patients. The XGBoost model performs best among the seven machine learning models.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by chan hainguyen
Released under MIT
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide some annotations of the Medical Information Mart for Intensive Care (MIMIC) III waveform database matched Subset. The annotations are for the electrocardiogram recordings and denote atrial fibrillation status.More annotations will be added in future.Details about MIMIC III matched subset can be found at Physionet.https://archive.physionet.org/physiobank/database/mimic3wdb/matched/If you use the annotations, please cite the following paper:Bashar, S.K., Ding, E., Walkey, A.J., McManus, D.D. and Chon, K.H., 2019. Noise Detection in Electrocardiogram Signals for Intensive Care Unit Patients. IEEE Access, 7, pp.88357-88368
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The MIMIC PERform datasets are a series of datasets extracted from the MIMIC III Waveform Database. Each dataset contains recordings of physiological signals from critically-ill patients during routine clinical care. Specifically, the datasets contain the following signals:
Further details of the datasets are provided in the documentation accompanying the ppg-beats project, which is available at: https://ppg-beats.readthedocs.io/en/latest/ . In particular, documentation is provided on the following datasets:
Each dataset is accompanied by a licence which acknowledges the source(s) of the data - please see the individual licenses for these acknowledgements.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The Medical Information Mart for Intensive Care (MIMIC)-IV database is comprised of deidentified electronic health records for patients admitted to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed users. Here, we have provided an openly-available demo of MIMIC-IV containing a subset of 100 patients. The dataset includes similar content to MIMIC-IV, but excludes free-text clinical notes. The demo may be useful for running workshops and for assessing whether the MIMIC-IV is appropriate for a study before making an access request.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
MIMIC_III_IPI - Discharge Summaries from Medical Information Mart for Intensive Care-III with Indirect Personal Identifiers Annotations
The discharge summaries we use for demonstrating our Indirect Personal Identifiers (IPI) schema are randomly sampled from the Medical Information Mart for Intensive Care (MIMIC-III) dataset. MIMIC-III comprises health-related data from over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. Among other types of data, such as patient demographics, the database also includes various types of textual data, such as diagnostic reports and discharge summaries. We chose discharge summaries for our study, since these are richer in information than other notes in MIMIC-III. Details:
This is the Discharge Summaries from MIMIC-III with Indirect Personal Identifiers Annotations as an external source of the paper accepted at the PrivateNLP workshop at NAACL 2025, a preprint can be found in:
This repository contains the annotations in a CSV file and the annotation guidelines document. Inspecting the exact annotation texts requires access to the MIMIC-III Clinical Database, see https://physionet.org/content/mimiciii/1.4/. Each row in the CSV file has an ID together with a list of the IPI annotated spans, each in the format {"start": ,"end": ,"label": }. The ID in the ipi_annotations.csv table corresponds to the same ROW_ID in the MIMIC-III NOTEEVENTS.csv table and can be used for merging the tables to inspect the original documents and reconstruct the annotations using the offsets.
Please note that only authenticated users can request access to review and download the annotations and guidelines. If you encounter any issues, feel free to reach out to the contact person.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundSleep disorders, the serious challenges faced by the intensive care unit (ICU) patients are important issues that need urgent attention. Despite some efforts to reduce sleep disorders with common risk-factor controlling, unidentified risk factors remain.ObjectivesThis study aimed to develop and validate a risk prediction model for sleep disorders in ICU adults.MethodsData were retrieved from the MIMIC-III database. Matching analysis was used to match the patients with and without sleep disorders. A nomogram was developed based on the logistic regression, which was used to identify risk factors for sleep disorders. The calibration and discrimination of the nomogram were evaluated with the 1000 bootstrap resampling and receiver operating characteristic curve (ROC). Besides, the decision curve analysis (DCA) was applied to evaluate the clinical utility of the prediction model.Results2,082 patients were included in the analysis, 80% of whom (n = 1,666) and the remaining 20% (n = 416) were divided into the training and validation sets. After the multivariate analysis, hemoglobin, diastolic blood pressure, respiratory rate, cardiovascular disease, and delirium were the independent risk predictors for sleep disorders. The nomogram showed high sensitivity and specificity of 75.6% and 72.9% in the ROC. The threshold probability of the net benefit was between 55% and 90% in the DCA.ConclusionThe model showed high performance in predicting sleep disorders in ICU adults, the good clinical utility of which may be a useful tool for providing clinical decision support to improve sleep quality in the ICU.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
We propose a derivative dataset (derived from MIMIC-III Waveform Database Matched Subset) composed of 380 hours of the most common biomedical signals, including arterial blood pressure, photoplethysmograph, and electrocardiogram for 1,524 de-identified subjects, each having 30 segments of 30 seconds of those signals. For more detailed information, please refer to the scientific article at this link: https://www.nature.com/articles/s41597-024-04041-1
The objective of this Bioengineering Research Partnership is to focus the resources of a powerful interdisciplinary team from academia (MIT), industry (Philips Medical Systems) and clinical medicine (Beth Israel Deaconess Medical Center) to develop and evaluate advanced ICU patient monitoring systems that will substantially improve the efficiency, accuracy and timeliness of clinical decision making in intensive care.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundIntracerebral hemorrhage (ICH) is a stroke syndrome with an unfavorable prognosis. Currently, there is no comprehensive clinical indicator for mortality prediction of ICH patients. The purpose of our study was to construct and evaluate a nomogram for predicting the 30-day mortality risk of ICH patients.MethodsICH patients were extracted from the MIMIC-III database according to the ICD-9 code and randomly divided into training and verification cohorts. The least absolute shrinkage and selection operator (LASSO) method and multivariate logistic regression were applied to determine independent risk factors. These risk factors were used to construct a nomogram model for predicting the 30-day mortality risk of ICH patients. The nomogram was verified by the area under the receiver operating characteristic curve (AUC), integrated discrimination improvement (IDI), net reclassification improvement (NRI), and decision curve analysis (DCA).ResultsA total of 890 ICH patients were included in the study. Logistic regression analysis revealed that age (OR = 1.05, P < 0.001), Glasgow Coma Scale score (OR = 0.91, P < 0.001), creatinine (OR = 1.30, P < 0.001), white blood cell count (OR = 1.10, P < 0.001), temperature (OR = 1.73, P < 0.001), glucose (OR = 1.01, P < 0.001), urine output (OR = 1.00, P = 0.020), and bleeding volume (OR = 1.02, P < 0.001) were independent risk factors for 30-day mortality of ICH patients. The calibration curve indicated that the nomogram was well calibrated. When predicting the 30-day mortality risk, the nomogram exhibited good discrimination in the training and validation cohorts (C-index: 0.782 and 0.778, respectively). The AUCs were 0.778, 0.733, and 0.728 for the nomogram, Simplified Acute Physiology Score II (SAPSII), and Oxford Acute Severity of Illness Score (OASIS), respectively, in the validation cohort. The IDI and NRI calculations and DCA analysis revealed that the nomogram model had a greater net benefit than the SAPSII and OASIS scoring systems.ConclusionThis study identified independent risk factors for 30-day mortality of ICH patients and constructed a predictive nomogram model, which may help to improve the prognosis of ICH patients.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The advent of large, open access text databases has driven advances in state-of-the-art model performance in natural language processing (NLP). The relatively limited amount of clinical data available for NLP has been cited as a significant barrier to the field's progress. Here we describe MIMIC-IV-Note: a collection of deidentified free-text clinical notes for patients included in the MIMIC-IV clinical database. MIMIC-IV-Note contains 331,794 deidentified discharge summaries from 145,915 patients admitted to the hospital and emergency department at the Beth Israel Deaconess Medical Center in Boston, MA, USA. The database also contains 2,321,355 deidentified radiology reports for 237,427 patients. All notes have had protected health information removed in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. All notes are linkable to MIMIC-IV providing important context to the clinical data therein. The database is intended to stimulate research in clinical natural language processing and associated areas.
Background: This study aimed to develop and validate a nomogram for predicting mortality in patients with thoracic fractures without neurological compromise and hospitalized in the intensive care unit. Methods: A total of 298 patients from the Medical Information Mart for Intensive Care III (MIMIC-III) database were included in the study, and 35 clinical indicators were collected within 24 h of patient admission. Risk factors were identified using the least absolute shrinkage and selection operator (LASSO) regression. A multivariate logistic regression model was established, and a nomogram was constructed. Internal validation was performed by the 1,000 bootstrap samples; a receiver operating curve (ROC) was plotted, and the area under the curve (AUC), sensitivity, and specificity were calculated. In addition, the calibration of our model was evaluated by the calibration curve and Hosmer-Lemeshow goodness-of-fit test (HL test). A decision curve analysis (DCA) was performed, and the nomogram was compared with scoring systems commonly used during clinical practice to assess the net clinical benefit. Results: Indicators included in the nomogram were age, OASIS score, SAPS II score, respiratory rate, partial thromboplastin time (PTT), cardiac arrhythmias, and fluid-electrolyte disorders. The results showed that our model yielded satisfied diagnostic performance with an AUC value of 0.902 and 0.883 using the training set and on internal validation. The calibration curve and the Hosmer-Lemeshow goodness-of-fit (HL). The HL tests exhibited satisfactory concordance between predicted and actual outcomes (P = 0.648). The DCA showed a superior net clinical benefit of our model over previously reported scoring systems. Conclusion: In summary, we explored the incidence of mortality during the ICU stay of thoracic fracture patients without neurological compromise and developed a prediction model that facilitates clinical decision making. However, external validation will be needed in the future.
👂💉 EHRSHOT is a dataset for benchmarking the few-shot performance of foundation models for clinical prediction tasks. EHRSHOT contains de-identified structured data (e.g., diagnosis and procedure codes, medications, lab values) from the electronic health records (EHRs) of 6,739 Stanford Medicine patients and includes 15 prediction tasks. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and includes data beyond ICU and emergency department patients.
⚡️Quickstart 1. To recreate the original EHRSHOT paper, download the EHRSHOT_ASSETS.zip file from the "Files" tab 2. To work with OMOP CDM formatted data, download all the tables in the "Tables" tab
⚙️ Please see the "Methodology" section below for details on the dataset and downloadable files.
1. 📖 Overview
EHRSHOT is a benchmark for evaluating models on few-shot learning for patient classification tasks. The dataset contains:
%3C!-- --%3E
2. 💽 Dataset
EHRSHOT is sourced from Stanford’s STARR-OMOP database.
%3C!-- --%3E
We provide two versions of the dataset:
%3C!-- --%3E
To access the raw data, please see the "Tables" and "Files"** **tabs above:
3. 💽 Data Files and Formats
We provide EHRSHOT in two file formats:
%3C!-- --%3E
Within the "Tables" tab...
1. %3Cu%3EEHRSHOT-OMOP%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Notes: Contains all OMOP CDM tables for the EHRSHOT patients. Note that this dataset is slightly different than the original EHRSHOT dataset, as these tables contain the full OMOP schema rather than a filtered subset.
Within the "Files" tab...
1. %3Cu%3EEHRSHOT_ASSETS.zip%3C/u%3E
* Dataset Version: EHRSHOT-Original
* Data Format: FEMR 0.1.16
* Notes: The original EHRSHOT dataset as detailed in the paper. Also includes model weights.
2. %3Cu%3EEHRSHOT_MEDS.zip%3C/u%3E
* Dataset Version: EHRSHOT-Original
* Data Format: MEDS 0.3.3
* Notes: The original EHRSHOT dataset as detailed in the paper. It does not include any models.
3. %3Cu%3EEHRSHOT_OMOP_MEDS.zip%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Data Format: MEDS 0.3.3 + MEDS-ETL 0.3.8
* Notes: Converts the dataset from EHRSHOT-OMOP into MEDS format via the `meds_etl_omop`command from MEDS-ETL.
4. %3Cu%3EEHRSHOT_OMOP_MEDS_Reader.zip%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Data Format: MEDS Reader 0.1.9 + MEDS 0.3.3 + MEDS-ETL 0.3.8
* Notes: Same data as EHRSHOT_OMOP_MEDS.zip, but converted into a MEDS-Reader database for faster reads.
4. 🤖 Model
We also release the full weights of **CLMBR-T-base, **a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Please download from https://huggingface.co/StanfordShahLab/clmbr-t-base
**5. 🧑💻 Code **
Please see our Github repo to obtain code for loading the dataset and running a set of pretrained baseline models: https://github.com/som-shahlab/ehrshot-benchmark/
**NOTE: You must authenticate to Redivis using your formal affiliation's email address. If you use gmail or other personal email addresses, you will not be granted access. **
Access to the EHRSHOT dataset requires the following:
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Clinical management decisions for patients with acutely decompensated heart failure and many other diseases are often based on grades of pulmonary edema severity, rather than its mere absence or presence. Chest radiographs are commonly performed to assess pulmonary edema. The MIMIC-CXR dataset that consists of 377,110 chest radiographs with free-text radiology reports offers a tremendous opportunity to study this subject.
This dataset is curated based on MIMIC-CXR, containing 3 metadata files that consist of pulmonary edema severity grades extracted from the MIMIC-CXR dataset through different means: 1) by regular expression (regex) from radiology reports, 2) by expert labeling from radiology reports, and 3) by consensus labeling from chest radiographs.
This dataset aims to support the algorithmic development of pulmonary edema assessment from chest x-ray images and benchmark its performance. The metadata files have subject IDs, study IDs, DICOM IDs, and the numerical grades of pulmonary edema severity. The IDs listed in this dataset have the same mapping structure as in MIMIC-CXR.
This research studies Ensemble learning approaches for prediction of mortality at Cardiac Intensive Care Units. The models are trained and tested with data from the MIMIC-III v1.4 Critical Care database. Code in PostgreSQL programming language was developed for the selection of the ICU stays.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Natural Language Inference (NLI) is the task of determining whether a given hypothesis can be inferred from a given premise. Also known as Recognizing Textual Entailment (RTE), this task has enjoyed popularity among researchers for some time. However, almost all datasets for this task focused on open domain data such as as news texts, blogs, and so on. To address this gap, the MedNLI dataset was created for language inference in the medical domain. MedNLI is a derived dataset with data sourced from MIMIC-III v1.4. In order to stimulate research for this problem, a shared task on Medical Inference and Question Answering (MEDIQA) was organized at the workshop for biomedical natural language processing (BioNLP) 2019. The dataset provided herein is a test set of 405 premise hypothesis pairs for the NLI challenge in the MEDIQA shared task. Participants of the shared task are expected to use the MedNLI data for development of their models and this dataset was used as an unseen dataset for scoring each participant submission.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RegulaTome corpus: this file contains the RegulaTome corpus in BRAT format. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system
RegulaTome annodoc: The annotation guidelines along with the annotation configuration files for BRAT are provided in annodoc+config.tar.gz. The online version of the annotation documentation can be found here: https://katnastou.github.io/regulatome-annodoc/
The tagger software can be found here: https://github.com/larsjuhljensen/tagger. The command used to run tagger before large-scale execution of the RE system is:
gzip -cd ls -1 pmc/*.en.merged.filtered.tsv.gz
ls -1r
pubmed/*.tsv.gz
| cat dictionary/excluded_documents.txt - |
tagger/tagcorpus --threads=16 --autodetect
--types=dictionary/curated_types.tsv
--entities=dictionary/all_entities.tsv
--names=dictionary/all_names_textmining.tsv
--groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv
--local-stopwords=dictionary/all_local.tsv
--type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv
Input documents for large-scale execution, which is done on entire PubMed (as of March 2024) and PMC Open Access (as of November 2023) articles in BioC format. The files are converted to a tab-delimited format to be compatible with the RE system input (see below).
Input dictionary files: all the files necessary to execute the command above are available in tagger_dictionary_files.tar.gz
Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz
Relation extraction system input: combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large-scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.
Relation extraction models. The Transformer-based model used for large-scale relation extraction and prediction on the test set is at relation_extraction_multi-label-best_model.tar.gz
The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learned from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.
Relation extraction system output: the tab-delimited outputs of the relation extraction system are found at large_scale_relation_extraction_results.tar.gz !!!ATTENTION this file is approximately 1TB in size, so make sure you have enough space to download it on your machine!!!
The relation extraction system output files have 86 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, and scores per class produced by the relation extraction model. Each file has a header to denote which score is in which column.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 3.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We show precision, recall, F1-Score, and AUC.
MIMIC-III is a dataset comprising health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012