Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Digitization of healthcare data along with algorithmic breakthroughts in AI will have a major impact on healthcare delivery in coming years. Its intresting to see application of AI to assist clinicians during patient treatment in a privacy preserving way. While scientific knowledge can help guide interventions, there remains a key need to quickly cut through the space of decision policies to find effective strategies to support patients during the care process.
Offline Reinforcement learning (also referred to as safe or batch reinforcement learning) is a promising sub-field of RL which provides us with a mechanism for solving real world sequential decision making problems where access to simulator is not available. Here we assume that learn a policy from fixed dataset of trajectories with further interaction with the environment(agent doesn't receive reward or punishment signal from the environment). It has shown that such an approach can leverage vast amount of existing logged data (in the form of previous interactions with the environment) and can outperform supervised learning approaches or heuristic based policies for solving real world - decision making problems. Offline RL algorithms when trained on sufficiently large and diverse offline datasets can produce close to optimal policies(ability to generalize beyond training data).
As Part of my PhD, research, I investigated the problem of developing a Clinical Decision Support System for Sepsis Management using Offline Deep Reinforcement Learning.
MIMIC-III ('Medical Information Mart for Intensive Care') is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.
we try to answer the following question:
Given a particular patient’s characteristics and physiological information at each time step as input, can our DeepRL approach, learn an optimal treatment policy that can prescribe the right intervention(e.g use of ventilator) to the patient each stage of the treatment process, in order to improve the final outcome(e.g patient mortality)?
we can use popular state-of-the-art algorithms such as Deep Q Learning(DQN), Double Deep Q Learning (DDQN), DDQN combined with BNC, Mixed Monte Carlo(MMC) and Persistent Advantage Learning (PAL). Using these methods we can train an RL policy to recommend optimum treatment path for a given patient.
Data acquisition, standard pre-processing and modelling details can be found here in Github repo: https://github.com/asjad99/MIMIC_RL_COACH
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Preprocessed version of the MIMIC-II dataset. See https://github.com/theislab/ehrapy-datasets/tree/main/mimic_2
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Physicians record their detailed thought-processes about diagnoses and treatments as unstructured text in a section of a clinical note called the "assessment and plan". This information is more clinically rich than structured billing codes assigned for an encounter but harder to reliably extract given the complexity of clinical language and documentation habits. To structure these sections we collected a dataset of annotations over assessment and plan sections from the publicly available and de-identified MIMIC-III dataset, and developed deep-learning based models to perform this task, described in the associated paper available as a pre-print at: https://www.medrxiv.org/content/10.1101/2022.04.13.22273438v1
When using this data please cite our paper:
@article {Stupp2022.04.13.22273438, author = {Stupp, Doron and Barequet, Ronnie and Lee, I-Ching and Oren, Eyal and Feder, Amir and Benjamini, Ayelet and Hassidim, Avinatan and Matias, Yossi and Ofek, Eran and Rajkomar, Alvin}, title = {Structured Understanding of Assessment and Plans in Clinical Documentation}, year = {2022}, doi = {10.1101/2022.04.13.22273438}, publisher = {Cold Spring Harbor Laboratory Press}, URL = {https://www.medrxiv.org/content/early/2022/04/17/2022.04.13.22273438}, journal = {medRxiv} }
The dataset, presented here, contains annotations of assessment and plan sections of notes from the publicly available and de-identified MIMIC-III dataset, marking the active problems, their assessment description, and plan action items. Action items are additionally marked as one of 8 categories (listed below). The dataset contains over 30,000 annotations of 579 notes from distinct patients, annotated by 6 medical residents and students.
The dataset is divided into 4 partitions - a training set (481 notes), validation set (50 notes), test set (48 notes) and an inter-rater set. The inter-rater set contains the annotations of each of the raters over the test set. Rater 1 in the inter-rater set should be regarded as an intra-rater comparison (details in the paper). The labels underwent automatic normalization to capture entire word boundaries and remove flanking non-alphanumeric characters.
Code for transforming labels into TensorFlow examples and training models as described in the paper will be made available at GitHub: https://github.com/google-research/google-research/tree/master/assessment_plan_modeling
In order to use these annotations, the user additionally needs to obtain the text of the notes which is found in the NOTE_EVENTS table from MIMIC-III, access to which is to be acquired independently (https://mimic.mit.edu/)
Annotations are given as character spans in a CSV file with the following schema:
Field
Type
Semantics
partition
categorical (one of [train, val, test, interrater]
The set of ratings the span belongs to.
rater_id
int
Unique id for each the raters
note_id
int
The note’s unique note_id, links to the MIMIC-III notes table (as ROW-ID).
span_type
categorical (one of [PROBLEM_TITLE,
PROBLEM_DESCRIPTION, ACTION_ITEM]
Type of the span as annotated by raters.
char_start
int
Character offsets from note start
char_end
int
action_item_type
categorical (one of [MEDICATIONS, IMAGING, OBSERVATIONS_LABS, CONSULTS, NUTRITION, THERAPEUTIC_PROCEDURES, OTHER_DIAGNOSTIC_PROCEDURES, OTHER])
Type of action item if the span is an action item (empty otherwise) as annotated by raters.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
DexMimicGen Datasets
This repository contains the official dataset release of simulation environments and datasets for the ICRA 2025 paper "DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning". Website: https://dexmimicgen.github.io For business inquiries, please submit this form: NVIDIA Research Licensing
Facebook
TwitterThis dataset is a portion of MIMIC-III Clinical Database, a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset. The full dataset is available on PhysioNet this** link**
This dataset contains solely 4 tables (extracted from the original dataset), more informations about each table can be found in its corresponding link
- admissions.csv
- d_labitems.csv
- labevents.csv
- patient.csv
a nice visualization of this dataset can be found here
This portion of the dataset will be combined to build a comprehensive dataset of simulated medical reports.
Facebook
Twitterhttps://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Fast Healthcare Interoperability Resources (FHIR) has emerged as a robust standard for healthcare data exchange. To explore the use of FHIR for the process of data harmonization, we converted the Medical Information Mart for Intensive Care IV (MIMIC-IV) and MIMIC-IV Emergency Department (MIMIC-IV-ED) databases into FHIR. We extended base FHIR to encode information in MIMIC-IV and aimed to retain the data in FHIR with minimal additional processing, aligning to US Core v4.0.0 where possible. A total of 24 profiles were created for MIMIC-IV data, and an additional 6 profiles were created for MIMIC-IV-ED data. All MIMIC terminology was converted into code systems and value sets, as necessary. We hope MIMIC-IV in FHIR provides a useful restructuring of the data to support applications around data harmonization, interoperability, and other areas of research.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
https://github.com/KeithPinson/butterfly_mimics_2022_dataset/raw/main/DocResources/monarch_on_daisy-600.png" alt="Images of Monarch and Viceroy butterflies from tiny dataset">
This 2022 version of the dataset consists of 1028 total images. Each image is a 224x224 pixel jpg showing a single butterfly in the wild. The images are of 6 species of common North American butterflies. Some of the butterflies are toxic and some mimic the looks of the toxic. The images are for education and research purposes only. CSV files contain the labels and no label information is contained in the folder or image names. Additionally, a “tiny” dataset is included.
The abbreviated tiny dataset for image classification is of just 2 species with an accompanying
tiny dataset document.
https://github.com/KeithPinson/butterfly_mimics_2022_dataset/raw/main/DocResources/the-monarchs-and-viceroys.png" alt="Images of Monarch and Viceroy butterflies from tiny dataset">
The full data version the dataset for image classification is of 6 butterfly species. It too has an accompanying dataset document.
https://github.com/KeithPinson/butterfly_mimics_2022_dataset/raw/main/DocResources/the-butterflies.png" alt="Images of Black, Monarch, Pipevine, Spicebush, Tiger and Viceroy butterflies from the dataset">
Facebook
TwitterThis supplementary material accompanies:
Charlton P.H. et al., "An impedance pneumography signal quality index for respiratory rate monitoring: design, assessment and application", [under review], 2020
The Impedance Pneumography Signal Quality Index (SQI) dataset and accompanying scripts (in Matlab format) are provided to facilitate reproduction of the analyses using data from the MIMIC III dataset in this publication.
Summary of Publication
In this article we developed and assessed the performance of a signal quality index (SQI) for the impedance pneumography signal. The SQI was developed using data from the Listen dataset, and assessed using data from the Listen dataset and MIMIC III datasets. The SQI was found to accurately classify segments of impedance pneumography signal as either high or low quality. Furthermore, when it was coupled with a high performance RR algorithm, highly accurate and precise RRs were estimated from those segments deemed to be high quality. In this study performance was assessed in the critical care environment - further work is required to deteremine whether the SQI is suitable for use with wearable sensors. Both the dataset and code used to perform this study are publicly available.
Reproducing this Publication
The work relating to the MIMIC dataset in this publication can be reproduced as follows:
Reproducing the analysis These steps can be used to quickly reproduce the analysis using the curated and annotated dataset.
Download the curated and annotated dataset from Zenodo using this direct download link.
Run the analysis using the run_imp_sqi_mimic.m script.
Use the ImP_SQI_mimic_data_importer.m script to download raw MIMIC data files from PhysioNet, and collate them into a single Matlab file.
Prepare the dataset for manual annotation by running the run_imp_sqi_mimic.m script.
Manually annotate the signals by running the run_mimic_imp_annotation.m script - the annotations are stored in separate files (the original annotation files are available here).
Import the manual annotations into the collated data file by re-running the ImP_SQI_mimic_data_importer.m script.
Run run_imp_sqi_mimic.m to perform the analysis described in the publication.
The scripts are also stored (alongside details of how to use them) are available in the RRest GitHub repository at: https://github.com/peterhcharlton/RRest/tree/master/RRest_v3.0/Publication_Specific_Scripts/ImP_SQI
License: The dataset (mimic_imp_sqi_data.mat) is distributed under the terms specified in the accompanying LICENSE file. The scripts are distributed under the GNU General Public Licence (as specified towards the start of each file).
Version 0.1.1: This is the version at the time of initial submission.
Facebook
Twitterhttps://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Emergency department (ED) overcrowding leads to delayed care, increased patient risk, and inefficient resource use. The MIMIC-IV-Ext Triage Instruction Corpus (MIETIC) addresses this by providing 9,629 structured triage cases from MIMIC-IV, aligned with the Emergency Severity Index (ESI). MIETIC supports large language model (LLM) training for AI-assisted triage, improving accuracy, consistency, and risk assessment. The dataset includes chief complaints, vital signs, demographics, and medical history, ensuring realistic triage decision-making. Developed through automated quality control and expert validation, MIETIC enhances model performance in high-risk and moderate-risk classification. Available in CSV formats, MIETIC enables research in clinical NLP, AI-driven triage, and decision-support tools. The dataset module includes:
Structured triage cases with ESI labels. Triage case generation prompts for instruction tuning. Expert-validated samples for quality control. SQL scripts for data extraction and validation, hosted on GitHub.
MIETIC provides a standardized, reproducible dataset to advance AI-driven emergency triage, optimizing accuracy, efficiency, and resource allocation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Embeddings for SNOMED CT concepts produced by models of FastText trained on different corpora. Each NPZ file encodes a dictionary, which links the ID of a SNOMED CT concept to its corresponding embedding.
Files ft_mimicN_dict.npz contain the embeddings of models trained on subsets of MIMIC-IV, where N denotes the percentage of MIMIC-IV used in the training of the model; whereas ft_snomed_ct_walks_dict.npz contains the embeddings of a FastText model trained on an artifical corpus obtained by performing walks on SNOMED CT (https://doi.org/10.1016/j.jbi.2023.104297" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.jbi.2023.104297).
These embeddings were generated and studied in the paper Assessing the Effectiveness of Embedding Methods in Capturing Clinical Information from SNOMED CT () and more information can also be found in the following repository: https://github.com/JavierCastellD/AssessingSNOMEDEmbeddings.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
This project takes the datasets obtained from MIMIC-III.
MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012
This repository contains the code to derive the Hematology Complete Blood Count (CBC) Dataset from the MIMIC-III dataset.
Our objective is to derive a dataset that can be used to predict the disease of a patient based on the CBC values.
I have achieved an accuracy of 0.01 using various machine learning algorithms. Which is very low. I am still working on it to improve the accuracy i have released the code so that others can contribute to it or give me some suggestions and feedback.
This project can help your guys also on how to derive datasets from MIMIC-III using Pandas.
I have also built a Streamlit app to take Xk parameters as input and predict the y disease of the patient.
X in ['Hemoglobin', 'Eosinophils', 'Lymphocytes', 'Monocytes', 'Basophils', 'Neutrophils', 'Red Blood Cells', 'White Blood Cells']
y ['Anemia', 'Leukemia', 'Thrombocytopenia', 'Thrombocytosis', 'Normal',....'Other']
Objectives of this project: - Derive the Hematology Complete Blood Count (CBC) Dataset from the MIMIC-III dataset. (Done)
Build a Machine Learning model to predict the disease of a patient based on the CBC values at least 0.8 accuracy. (Not done yet but i have achieved 0.01 accuracy)
Build a Streamlit app to take Xk parameters as input and predict the y disease of the patient. (Done)
Facebook
Twitterđź‘‚đź’‰ EHRSHOT is a dataset for benchmarking the few-shot performance of foundation models for clinical prediction tasks. EHRSHOT contains de-identified structured data (e.g., diagnosis and procedure codes, medications, lab values) from the electronic health records (EHRs) of 6,739 Stanford Medicine patients and includes 15 prediction tasks. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and includes data beyond ICU and emergency department patients.
⚡️Quickstart 1. To recreate the original EHRSHOT paper, download the EHRSHOT_ASSETS.zip file from the "Files" tab 2. To work with OMOP CDM formatted data, download all the tables in the "Tables" tab
⚙️ Please see the "Methodology" section below for details on the dataset and downloadable files.
1. đź“– Overview
EHRSHOT is a benchmark for evaluating models on few-shot learning for patient classification tasks. The dataset contains:
%3C!-- --%3E
2. đź’˝ Dataset
EHRSHOT is sourced from Stanford’s STARR-OMOP database.
%3C!-- --%3E
We provide two versions of the dataset:
%3C!-- --%3E
To access the raw data, please see the "Tables" and "Files"** **tabs above:
3. đź’˝ Data Files and Formats
We provide EHRSHOT in two file formats:
%3C!-- --%3E
Within the "Tables" tab...
1. %3Cu%3EEHRSHOT-OMOP%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Notes: Contains all OMOP CDM tables for the EHRSHOT patients. Note that this dataset is slightly different than the original EHRSHOT dataset, as these tables contain the full OMOP schema rather than a filtered subset.
Within the "Files" tab...
1. %3Cu%3EEHRSHOT_ASSETS.zip%3C/u%3E
* Dataset Version: EHRSHOT-Original
* Data Format: FEMR 0.1.16
* Notes: The original EHRSHOT dataset as detailed in the paper. Also includes model weights.
2. %3Cu%3EEHRSHOT_MEDS.zip%3C/u%3E
* Dataset Version: EHRSHOT-Original
* Data Format: MEDS 0.3.3
* Notes: The original EHRSHOT dataset as detailed in the paper. It does not include any models.
3. %3Cu%3EEHRSHOT_OMOP_MEDS.zip%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Data Format: MEDS 0.3.3 + MEDS-ETL 0.3.8
* Notes: Converts the dataset from EHRSHOT-OMOP into MEDS format via the `meds_etl_omop`command from MEDS-ETL.
4. %3Cu%3EEHRSHOT_OMOP_MEDS_Reader.zip%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Data Format: MEDS Reader 0.1.9 + MEDS 0.3.3 + MEDS-ETL 0.3.8
* Notes: Same data as EHRSHOT_OMOP_MEDS.zip, but converted into a MEDS-Reader database for faster reads.
4. 🤖 Model
We also release the full weights of **CLMBR-T-base, **a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Please download from https://huggingface.co/StanfordShahLab/clmbr-t-base
**5. 🧑‍💻 Code **
Please see our Github repo to obtain code for loading the dataset and running a set of pretrained baseline models: https://github.com/som-shahlab/ehrshot-benchmark/
**NOTE: You must authenticate to Redivis using your formal affiliation's email address. If you use gmail or other personal email addresses, you will not be granted access. **
Access to the EHRSHOT dataset requires the following:
Facebook
TwitterYou can query some of the data online there. There is also the download link. Of course you can download it here.
Electronic medical records contain multi-format electronic medical data that consist of an abundance of medical knowledge. Facing with patients symptoms, experienced caregivers make right medical decisions based on their professional knowledge that accurately grasps relationships between symptoms, diagnosis, and treatments. We aim to capture these relationships by constructing a large and high-quality heterogeneous graph linking patients, diseases, and drugs (PDD) in EMRs.
Specifically, we extract important medical entities from MIMIC-III (Medical Information Mart for Intensive Care III) and automatically link them with the existing biomedical knowledge graphs, including ICD-9 ontology and DrugBank. The PDD graph presented is accessible on the Web via the SPARQL endpoint, and provides a pathway for medical discovery and applications, such as effective treatment recommendations.
A subgraph of PDD is illustrated in the followng figure to betterunderstand the PDD graph.
https://github.com/wangmengsd/pdd-graph/raw/master/example.png" alt="enter image description here">
Data set belongs to Meng Wang, Jiaheng Zhang, Jun Liu,Wei Hu, Sen Wang, , Wenqiang Liu and Lei Shi
They come from: 1. MOEKLINNS lab, Xi’an Jiaotong University, Xi’an, China 2. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 3. Griffith Universtiy, Gold Coast Campus, Australia
Some Email: - Meng Wang:wangmengsd@stu.xjtu.edu.cn - Lei Shi:xjtushilei@foxmail.com - Jun Liu:liukeen@xjtu.edu.cn
The paper is being reviewed and is not easily disclosed.So it can't be linked here.
If you have any questions, please contact the email address above.
Do you have any suggestions ? And send them to an e-mail address above.
This work is licensed under a Creative Commons Attribution 4.0 International License.
### If your article needs to be reference our work , you can reference our github.
Facebook
TwitterThe Robomimic proficient human datasets were collected by 1 proficient operator using the RoboTurk platform (with the exception of Transport, which had 2 proficient operators working together). Each dataset consists of 200 successful trajectories.
Each task has two versions: one with low dimensional observations (low_dim),
and one with images (image).
The datasets follow the RLDS format to represent steps and episodes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('robomimic_ph', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains part 5/7 of the full dataset used for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy".
This dataset comprises 3 years of normalized hourly data for both low-resolution predictors [16 km] and high-resolution target variables [2km] (2mT and 10-m U and V), from 2012-2014. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.
To use the data, clone the corresponding repository, unzip this zip file in the data folder, and download from Zenodo the other parts of the dataset listed in the related works.
Facebook
TwitterThe objective of this Bioengineering Research Partnership is to focus the resources of a powerful interdisciplinary team from academia (MIT), industry (Philips Medical Systems) and clinical medicine (Beth Israel Deaconess Medical Center) to develop and evaluate advanced ICU patient monitoring systems that will substantially improve the efficiency, accuracy and timeliness of clinical decision making in intensive care.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: The current dataset provides all the stimuli (folder ../01-Stimuli/), raw data (folder ../02-Raw-data/) and post-processed data (../03-Post-proc-data/) used in the Forum Acusticum 2013 paper titled "Using auditory models to mimic human listeners in reverse correlation experiments from the fastACI toolbox" by the same authors. In this paper, we replicated the tone-in-noise experiment by Ahumada et al. (1975) but using an artificial listener instead of collecting data from real participants. The behavioural data were mimicked using an artificial listener based on 'king2019' (King et al., 2019) as a front-end model using a template-matching decision to indicate whether a 500-Hz tone was (or not) present in each of the noisy trials. This study offers a step-by-step guide of how can be an artificial listener integrated into fastACI.
Use these data: Download all these data, locate them in a local directory of your computer. If you have MATLAB and you downloaded a local copy of the fastACI toolbox (open access at: https://github.com/aosses-tue/fastACI) you can recreate the figures of our paper. After downloading and initialising the toolbox (type 'startup_fastACI;', without quotation marks in MATLAB), run the script g20230501_FA_Artificial_listener_paper_figs.m (provided in this dataset) and follow the instructions on the screen to generate one of the four study figures. This script calls the function publ_osses2023b_FA_figs.m from the toolbox.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Natural Language Inference for Patient Recruitment (NLI4PR)
Dataset Description
Links
Homepage: Github.io
Repository: Github
Paper: arXiv
Contact (Original Authors): Mathilde Aguiar (mathilde.aguiar@lisn.fr)
Contact (Curator):Artur GuimarĂŁes (artur.guimas@gmail.com)
Dataset Summary
MedQA is a large-scale multiple-choice question-answering dataset designed to mimic the style of professional medical board exams, particularly the USMLE… See the full description on the dataset page: https://huggingface.co/datasets/araag2/NLI4PR.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
CheXmask Database: a large-scale dataset of anatomical segmentation masks for chest x-ray images
This particular data contains only the segmentation for NIH Data set that is: Chest X Ray 8
Nicolas Gaggion , Candelaria Mosquera , Martina Aineseder , Lucas Mansilla , Diego Milone , Enzo Ferrante
Published: March 1, 2024. Version: 0.4
This data set was downloaded from "https://physionet.org/content/chexmask-cxr-segmentation-data/0.4/OriginalResolution/#files-panel">Physionet. PhysioNet is a repository of freely-available medical research data, managed by the MIT Laboratory for Computational Physiology. Supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number R01EB030362. For more accessibility options, see the MIT Accessibility Page.
Abstract The CheXmask Database presents a comprehensive, uniformly annotated collection of chest radiographs, constructed from five public databases: ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest and VinDr-CXR. The database aggregates 657,566 anatomical segmentation masks derived from images which have been processed using the HybridGNet model to ensure consistent, high-quality segmentation. To confirm the quality of the segmentations, we include in this database individual Reverse Classification Accuracy (RCA) scores for each of the segmentation masks. This dataset is intended to catalyze further innovation and refinement in the field of semantic chest X-ray analysis, offering a significant resource for researchers in the medical imaging domain.
Ethics All publicly available datasets utilized in this study adhered to strict ethical standards and underwent thorough anonymization, with identifiable details removed. The study does not release any part of the original image datasets; it only provides already anonymized image identifiers to allow researchers to match the original images with our annotations. MIMIC-CXR-JPG dataset required additional ethics training and research courses for access. The study authors fulfilled all ethics courses and data use agreement requirements to ensure ethical data usage.
Conflicts of Interest The authors have no conflict of interests to declare.
References Wang R, Chen LC, Moukheiber L, Seastedt KP, Moukheiber M, Moukheiber D, Zaiman Z, Moukheiber S, Litchman T, Trivedi H, Steinberg R. Enabling chronic obstructive pulmonary disease diagnosis through chest X-rays: A multi-site and multi-modality study. International Journal of Medical Informatics. 2023 Oct 1;178:105211. Gaggion N, Mansilla L, Mosquera C, Milone DH, Ferrante E. Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis. IEEE Trans Med Imaging. 2022. doi:10.1109/TMI.2022.3224660. Wang X, et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. Irvin J, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI conference on artificial intelligence. 2019;33(01). Johnson AE, et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint. 2019. arXiv:1901.07042. Bustos A, et al. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Med Image Anal. 2020;66:101797. Nguyen HQ, et al. VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Sci Data. 2022;9(1):429. Valindria VV, et al. Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE Trans Med Imaging. 2017;36:1597–1606. Gaggion N, Vakalopoulou M, Milone DH, Ferrante E. Multi-center anatomical segmentation with heterogeneous labels via landmark-based models. In: 20th IEEE International Symposium on Biomedical Imaging (ISBI). IEEE; 2023. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A, editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Cham: Springer; 2015. p. 234-241. (Lecture Notes in Computer Science; vol 9351). Gaggion N. Chest-xray-landmark-dataset [Internet]. GitHub repository. Available from: https://github.com/ngaggion/Chest-xray-landmark-dataset. [Accessed 6/27/2023]
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Digitization of healthcare data along with algorithmic breakthroughts in AI will have a major impact on healthcare delivery in coming years. Its intresting to see application of AI to assist clinicians during patient treatment in a privacy preserving way. While scientific knowledge can help guide interventions, there remains a key need to quickly cut through the space of decision policies to find effective strategies to support patients during the care process.
Offline Reinforcement learning (also referred to as safe or batch reinforcement learning) is a promising sub-field of RL which provides us with a mechanism for solving real world sequential decision making problems where access to simulator is not available. Here we assume that learn a policy from fixed dataset of trajectories with further interaction with the environment(agent doesn't receive reward or punishment signal from the environment). It has shown that such an approach can leverage vast amount of existing logged data (in the form of previous interactions with the environment) and can outperform supervised learning approaches or heuristic based policies for solving real world - decision making problems. Offline RL algorithms when trained on sufficiently large and diverse offline datasets can produce close to optimal policies(ability to generalize beyond training data).
As Part of my PhD, research, I investigated the problem of developing a Clinical Decision Support System for Sepsis Management using Offline Deep Reinforcement Learning.
MIMIC-III ('Medical Information Mart for Intensive Care') is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.
we try to answer the following question:
Given a particular patient’s characteristics and physiological information at each time step as input, can our DeepRL approach, learn an optimal treatment policy that can prescribe the right intervention(e.g use of ventilator) to the patient each stage of the treatment process, in order to improve the final outcome(e.g patient mortality)?
we can use popular state-of-the-art algorithms such as Deep Q Learning(DQN), Double Deep Q Learning (DDQN), DDQN combined with BNC, Mixed Monte Carlo(MMC) and Persistent Advantage Learning (PAL). Using these methods we can train an RL policy to recommend optimum treatment path for a given patient.
Data acquisition, standard pre-processing and modelling details can be found here in Github repo: https://github.com/asjad99/MIMIC_RL_COACH