Problem Statement
👉 Download the case studies here
Healthcare providers often rely on generalized treatment protocols that may not address the unique needs of individual patients. This approach led to variability in treatment outcomes, reduced efficacy, and limited patient satisfaction. A leading hospital sought a solution to develop personalized treatment plans tailored to each patient’s medical history, genetic profile, and current health status.
Challenge
Implementing a personalized healthcare treatment system involved overcoming the following challenges:
Integrating diverse patient data, including medical history, lab results, genetic information, and lifestyle factors.
Developing predictive models capable of identifying optimal treatment plans for individual patients.
Ensuring compliance with privacy regulations and maintaining data security throughout the process.
Solution Provided
An advanced healthcare treatment recommendation system was developed using machine learning models and predictive analytics. The solution was designed to:
Analyze patient data to identify patterns and predict treatment outcomes.
Recommend individualized treatment plans optimized for efficacy and patient preferences.
Continuously learn and adapt to improve recommendations based on new medical insights and patient feedback.
Development Steps
Data Collection
Aggregated data from electronic health records (EHR), genetic testing reports, and patient-provided health information.
Preprocessing
Standardized and anonymized data to ensure accuracy, consistency, and compliance with healthcare privacy regulations.
Model Development
Trained machine learning models to identify correlations between patient characteristics and treatment outcomes. Developed predictive algorithms to recommend personalized treatment plans for conditions like chronic diseases, cancer, and rare disorders.
Validation
Tested the system on historical patient data to evaluate its accuracy in predicting successful treatment outcomes.
Deployment
Integrated the solution into the hospital’s clinical decision support systems, enabling healthcare providers to access personalized treatment recommendations during consultations.
Continuous Monitoring & Improvement
Established a feedback mechanism to refine models using real-world treatment outcomes and patient satisfaction data.
Results
Improved Patient Outcomes
The system delivered personalized treatment recommendations that significantly improved recovery rates and health outcomes.
Increased Treatment Efficacy
Optimized treatment plans reduced trial-and-error approaches, leading to more effective interventions and fewer side effects.
Personalized Healthcare Experiences
Patients reported higher satisfaction levels due to treatment plans tailored to their individual needs and preferences.
Enhanced Decision-Making
Healthcare providers benefited from data-driven insights, enabling more informed and confident decisions.
Scalable and Future-Ready Solution
The system scaled seamlessly to support diverse medical specialties and adapted to incorporate emerging medical research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This page shares the technical validation datasets used to evaluate a Large Dataset of Annotated Incident Reports on Medication Errors and its machine annotator. The files contain in this repository include the IFMIR gold standard dataset (CrossValid_IFMIR_522.xlsx), randomly sampled labeled incident reports from 2010 – 2020 (InternalValid_JQ2010-20_40.xlsx), randomly sampled labeled incident reports from 2021 (ExternalValid_JQ2021_20.xlsx) and Error-free reports (Error_analysis.xlsx).
To use any of these datasets, one should also cite this original data source: Medical Adverse Event Information Collection Project [Iryō jiko jōhō shūshū-tō jigyō]  Japan Council for Quality Health Care; 2022 [Available from: https://www.med-safe.jp/index.html.]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The researcher tests the QA capability of ChatGPT in the medical field from the following aspects:1. Test their reserve capacity for medical knowledge2. Check their ability to read literature and understand medical literature3. Test their ability of auxiliary diagnosis after reading case data4. Test its error correction ability for case data5. Test its ability to standardize medical terms6. Test their evaluation ability to experts7. Check their ability to evaluate medical institutionsThe conclusion is:ChatGPT has great potential in the application of medical and health care, and may directly replace human beings or even professionals at a certain level in some fields;The researcher preliminarily believe that ChatGPT has basic medical knowledge and the ability of multiple rounds of dialogue, and its ability to understand Chinese is not weak;ChatGPT has the ability to read, understand and correct cases;ChatGPT has the ability of information extraction and terminology standardization, and is quite excellent;ChatGPT has the reasoning ability of medical knowledge;ChatGPT has the ability of continuous learning. After continuous training, its level has improved significantly;ChatGPT does not have the academic evaluation ability of Chinese medical talents, and the results are not ideal;ChatGPT does not have the academic evaluation ability of Chinese medical institutions, and the results are not ideal;ChatGPT is an epoch-making product, which can become a useful assistant for medical diagnosis and treatment, knowledge service, literature reading, review and paper writing.
The CarePrecise U.S. HCP/HCO Collection Dataset includes deep data on all 6.7 million U.S. HIPAA-covered healthcare practitioners and organizations. Monthly full updates. Includes linkages between the individual practitioners and their practice groups, hospitals, and hospital systems. Licensing plans are available for basic (internal use), derivative products, and redistribution. Data updates are delivered quarterly or monthly to suit customer need; FTP push is available, standard delivery is via CDN. Single download for evaluation is available. CarePrecise is a leader in the fields of HCP/HCO data, supplying provider data to the industry since 2008. Note regarding pricing: The Collection price shown in Pricing is separate from email addresses. Email addresses are priced as low as $0.075 per, based on volume. Pricing shown is without derivative product (DP) licensing for use in web applications; DP license ranges in price from $1,900/year to $9,000/year on top of data purchase, based on application and overall exposure estimate. DP license is sold in two-year term and requires a license agreement.
LLM Health Benchmarks Dataset The Health Benchmarks Dataset is a specialized resource for evaluating large language models (LLMs) in different medical specialties. It provides structured question-answer pairs designed to test the performance of AI models in understanding and generating domain-specific knowledge.
Primary Purpose This dataset is built to: - Benchmark LLMs in medical specialties and subfields. - Assess the accuracy and contextual understanding of AI in healthcare. - Serve as a standardized evaluation suite for AI systems designed for medical applications.
Key Features
Covers 50+ medical and health-related topics, including both clinical and non-clinical domains. Includes ~7,500 structured question-answer pairs. Designed for fine-grained performance evaluation in medical specialties.
Applications
LLM Evaluation: Benchmarking AI models for domain-specific performance. Healthcare AI Research: Standardized testing for AI in healthcare. Medical Education AI: Testing AI systems designed for tutoring medical students.
Dataset Structure The dataset is organized by medical specialties and subfields, each represented as a split. Below is a snapshot:
Specialty | Number of Rows |
---|---|
Lab Medicine | 158 |
Ethics | 174 |
Dermatology | 170 |
Gastroenterology | 163 |
Internal Medicine | 178 |
Oncology | 180 |
Orthopedics | 177 |
General Surgery | 178 |
Pediatrics | 180 |
...(and more) | ... |
Each split contains: - Questions: The medical questions for the specialty. - Answers: Corresponding high-quality answers.
Usage Instructions Here’s how you can load and use the dataset:
from datasets import load_dataset
Load the dataset
dataset = load_dataset("yesilhealth/Health_Benchmarks")
Access specific specialty splits
oncology = dataset["Oncology"]
internal_medicine = dataset["Internal_Medicine"]
View sample data
print(oncology[:5])
Evaluation Workflow
Model Input: Provide the questions from each split to the LLM. Model Output: Collect the AI-generated answers. Scoring: Compare model answers to ground truth answers using metrics such as: Exact Match (EM) F1 Score Semantic Similarity
Citation If you use this dataset for research or development, please cite:
plaintext @dataset{yesilhealth_health_benchmarks, title={Health Benchmarks Dataset}, author={Yesil Health AI}, year={2024}, url={https://huggingface.co/datasets/yesilhealth/Health_Benchmarks} }
License This dataset is licensed under the Apache 2.0 License.
Feedback For questions, suggestions, or feedback, feel free to contact us via email at [hello@yesilhealth.com].
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
"'https://www.nature.com/articles/s41597-022-01721-8'">MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification https://www.nature.com/articles/s41597-022-01721-8
A large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning.Providers benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools.
MedMNIST Landscape :
https://storage.googleapis.com/kagglesdsdata/datasets/4390240/7539891/medmnistlandscape.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20240202%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240202T132716Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=479c8d80a4c6f28bf9532fea037969292a4f963662b022484a79c139297cfa1afc82db06c9b5275d6c52d5555d7fb178701d3ad7ebb036c9cf3d076fcf41014c05a6230d293f39dd320303efaa81d18e9c5888c23fe19884148a3be618e3e7c041383119a4c5547f0fa6cb1ddb5f3bf4dc1330a6fd5c693f32280e90fde5735e02052f2fc5b0003085d9ea70039903439814154dc39980dce3bace422d0672a69c4f4cefbe6bcebaacd2c5192a60172143667b14ba050a8383d0a7c6c639526c820ae58bbad99b4afc84e97bc87b2da6002d6faf181d4138e2a33961514370578892409b1e1a662424051573a3392273b00132a4f39becff877dff16a594848f" alt="medmnistlandscape">
About MedMNIST Landscape figure: The horizontal axis denotes the base-10 logarithm of the dataset scale, and the vertical axis denotes base-10 logarithm of imaging resolution. The upward and downward triangles are used to distinguish between 2D datasets and 3D datasets, and the 4 different colors represent different tasks
###
Diverse: It covers diverse data modalities, dataset scales (from 100 to 100,000), and tasks (binary/multi-class, multi-label, and ordinal regression). It is as diverse as the VDD and MSD to fairly evaluate the generalizable performance of machine learning algorithms in different settings, but both 2D and 3D biomedical images are provided.
Standardized: Each sub-dataset is pre-processed into the same format, which requires no background knowledge for users. As an MNIST-like dataset collection to perform classification tasks on small images, it primarily focuses on the machine learning part rather than the end-to-end system. Furthermore, we provide standard train-validation-test splits for all datasets in MedMNIST, therefore algorithms could be easily compared.
User-Friendly: The small size of 28Ă—28 (2D) or 28Ă—28Ă—28 (3D) is lightweight and ideal for evaluating machine learning algorithms. We also offer a larger-size version, MedMNIST+: 64x64 (2D), 128x128 (2D), 224x224 (2D), and 64x64x64 (3D). Serving as a complement to the 28-size MedMNIST, this could be a standardized resource for developing medical foundation models. All these datasets are accessible via the same API.
Educational: As an interdisciplinary research area, biomedical image analysis is difficult to hand on for researchers from other communities, as it requires background knowledge from computer vision, machine learning, biomedical imaging, and clinical science. Our data with the Creative Commons (CC) License is easy to use for educational purposes.
Refer to the paper to learn more about data : https://www.nature.com/articles/s41597-022-01721-8
Github Page: https://github.com/MedMNIST/MedMNIST
My Kaggle Starter Notebook: https://www.kaggle.com/code/arashnic/medmnist-download-and-use-data?scriptVersionId=161421937
Jiancheng Yang,Rui Shi,Donglai Wei,Zequan Liu,Lin Zhao,Bilian Ke,Hanspeter Pfister,Bingbing Ni Shanghai Jiao Tong University, Shanghai, China, Boston College, Chestnut Hill, MA RWTH Aachen University, Aachen, Germany, Fudan Institute of Metabolic Diseases, Zhongshan Hospital, Fudan University, Shanghai, China, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China, Harvard University, Cambridge, MA
The code is under Apache-2.0 License.
The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0)...
Problem Statement
👉 Download the case studies here
Hospitals and healthcare providers faced challenges in ensuring continuous monitoring of patient vitals, especially for high-risk patients. Traditional monitoring methods often lacked real-time data processing and timely alerts, leading to delayed responses and increased hospital readmissions. The healthcare provider needed a solution to monitor patient health continuously and deliver actionable insights for improved care.
Challenge
Implementing an advanced patient monitoring system involved overcoming several challenges:
Collecting and analyzing real-time data from multiple IoT-enabled medical devices.
Ensuring accurate health insights while minimizing false alarms.
Integrating the system seamlessly with hospital workflows and electronic health records (EHR).
Solution Provided
A comprehensive patient monitoring system was developed using IoT-enabled medical devices and AI-based monitoring systems. The solution was designed to:
Continuously collect patient vital data such as heart rate, blood pressure, oxygen levels, and temperature.
Analyze data in real-time to detect anomalies and provide early warnings for potential health issues.
Send alerts to healthcare professionals and caregivers for timely interventions.
Development Steps
Data Collection
Deployed IoT-enabled devices such as wearable monitors, smart sensors, and bedside equipment to collect patient data continuously.
Preprocessing
Cleaned and standardized data streams to ensure accurate analysis and integration with hospital systems.
AI Model Development
Built machine learning models to analyze vital trends and detect abnormalities in real-time
Validation
Tested the system in controlled environments to ensure accuracy and reliability in detecting health issues.
Deployment
Implemented the solution in hospitals and care facilities, integrating it with EHR systems and alert mechanisms for seamless operation.
Continuous Monitoring & Improvement
Established a feedback loop to refine models and algorithms based on real-world data and healthcare provider feedback.
Results
Enhanced Patient Care
Real-time monitoring and proactive alerts enabled healthcare professionals to provide timely interventions, improving patient outcomes.
Early Detection of Health Issues
The system detected potential health complications early, reducing the severity of conditions and preventing critical events.
Reduced Hospital Readmissions
Continuous monitoring helped manage patient health effectively, leading to a significant decrease in readmission rates.
Improved Operational Efficiency
Automation and real-time insights reduced the burden on healthcare staff, allowing them to focus on critical cases.
Scalable Solution
The system adapted seamlessly to various healthcare settings, including hospitals, clinics, and home care environments.
đź‘‚đź’‰ EHRSHOT is a dataset for benchmarking the few-shot performance of foundation models for clinical prediction tasks. EHRSHOT contains de-identified structured data (e.g., diagnosis and procedure codes, medications, lab values) from the electronic health records (EHRs) of 6,739 Stanford Medicine patients and includes 15 prediction tasks. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and includes data beyond ICU and emergency department patients.
⚡️Quickstart 1. To recreate the original EHRSHOT paper, download the EHRSHOT_ASSETS.zip file from the "Files" tab 2. To work with OMOP CDM formatted data, download all the tables in the "Tables" tab
⚙️ Please see the "Methodology" section below for details on the dataset and downloadable files.
1. đź“– Overview
EHRSHOT is a benchmark for evaluating models on few-shot learning for patient classification tasks. The dataset contains:
%3C!-- --%3E
2. đź’˝ Dataset
EHRSHOT is sourced from Stanford’s STARR-OMOP database.
%3C!-- --%3E
We provide two versions of the dataset:
%3C!-- --%3E
To access the raw data, please see the "Tables" and "Files"** **tabs above:
3. đź’˝ Data Files and Formats
We provide EHRSHOT in two file formats:
%3C!-- --%3E
Within the "Tables" tab...
1. %3Cu%3EEHRSHOT-OMOP%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Notes: Contains all OMOP CDM tables for the EHRSHOT patients. Note that this dataset is slightly different than the original EHRSHOT dataset, as these tables contain the full OMOP schema rather than a filtered subset.
Within the "Files" tab...
1. %3Cu%3EEHRSHOT_ASSETS.zip%3C/u%3E
* Dataset Version: EHRSHOT-Original
* Data Format: FEMR 0.1.16
* Notes: The original EHRSHOT dataset as detailed in the paper. Also includes model weights.
2. %3Cu%3EEHRSHOT_MEDS.zip%3C/u%3E
* Dataset Version: EHRSHOT-Original
* Data Format: MEDS 0.3.3
* Notes: The original EHRSHOT dataset as detailed in the paper. It does not include any models.
3. %3Cu%3EEHRSHOT_OMOP_MEDS.zip%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Data Format: MEDS 0.3.3 + MEDS-ETL 0.3.8
* Notes: Converts the dataset from EHRSHOT-OMOP into MEDS format via the `meds_etl_omop`command from MEDS-ETL.
4. %3Cu%3EEHRSHOT_OMOP_MEDS_Reader.zip%3C/u%3E
* Dataset Version: EHRSHOT-OMOP
* Data Format: MEDS Reader 0.1.9 + MEDS 0.3.3 + MEDS-ETL 0.3.8
* Notes: Same data as EHRSHOT_OMOP_MEDS.zip, but converted into a MEDS-Reader database for faster reads.
4. 🤖 Model
We also release the full weights of **CLMBR-T-base, **a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Please download from https://huggingface.co/StanfordShahLab/clmbr-t-base
**5. 🧑‍💻 Code **
Please see our Github repo to obtain code for loading the dataset and running a set of pretrained baseline models: https://github.com/som-shahlab/ehrshot-benchmark/
**NOTE: You must authenticate to Redivis using your formal affiliation's email address. If you use gmail or other personal email addresses, you will not be granted access. **
Access to the EHRSHOT dataset requires the following:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public health insurance coverage in India before and after PM-JAY: repeated cross-sectional analysis of nationally representative survey dataThe National Family Health Survey (NFHS), India data is publicly available data set and can be accessed on request. It can be downloaded upon registration from the Demographic and Health Survey (DHS) website upon registration at The DHS Program - Request Access To Datasets. We have used data from the fourth and fifth round of NFHS, which can be accessed after registration from the link given here for NFHS 4 and NFHS 5 https://dhsprogram.com/data/dataset/India_Standard-DHS_2015.cfm?flag=0 and here https://dhsprogram.com/data/dataset/India_Standard-DHS_2020.cfm?flag=0 respectively. These datasets (HR file) have been used to obtain this combined dataset of a paper entitled "Public health insurance coverage in India before and after PM-JAY: repeated cross-sectional analysis of nationally representative survey data" submitted to BMJ Global Health August 2023.
https://fair.healthdata.be/dataset/12d69eca-4449-47d2-943d-e4448a467292https://fair.healthdata.be/dataset/12d69eca-4449-47d2-943d-e4448a467292
The MZG is a registration with which all non-psychiatric hospitals in Belgium must make their (anonymised) administrative, medical and nursing data available to the Federal Public Service (FPS) Public Health. The aim of the MZG is to support the government's health policy by
The MZG aims also to support the health policy of hospitals by providing national and individual feedback so that a hospital can compare itself with other hospitals and adapt its internal policy.
All reports can be found here (in French/Dutch).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table presents the data extraction from the 99 studies included according to the criteria outlined in the main manuscript. It is provided as supplementary material to enhance the readability of the paper while ensuring that all relevant information is preserved and accessible without loss of detail.
The names of the variables and their descriptions are provided in the attached file, along with the following details:
Variable | Description | |
Ref. | The citation in the format: First author et al. [Year] (e.g., AuthorA et al. [2022]). This identifies the study's primary citation for easy reference. | |
Title | The title of the paper | |
Standard | The healthcare data standard used in the study. Possible values are: OMOP, OpenEHR, FHIR. | |
Study Location | The country where the study was conducted. | |
Objective for using the standard | Detailed | The comprehensive explanation of the specific objective of using the standard in the study, describing how it supports the study’s goals. |
Short | The primary purpose for applying the healthcare standard. Possible values are: Secondary data reuse, Data exchange, Clinical decision support, Vocabulary definition, EHR system design, | |
Application domain | Type | The application domain type that represents the healthcare standard. Possible solution are: Clinical: Studies with a direct impact on clinical practice, applying established tools or methods in healthcare settings (e.g., predicting in-hospital mortality for heart attack patients) and Research: Studies proposing innovative tools, methodologies, or frameworks still in the design/testing phase, not yet clinically implemented. |
Healthcare Area | The relevant healthcare domain for the study, such as Cardiovascular, Intensive Care Unit, Emergency Department, Oncology, Biology, etc. | |
Cluster | The healthcare domain clusterized for easier readability. Possible values include: Clinical Medicine, Clinical Services and Diagnostics, Public Health, Health Information Management and Biomedical Sciences | |
Use | This report if the results of the paper serving a Primary use (direct care) or a Secondary use (repurposing existing data or tools for new objectives). | |
Scale | The scale of the study. Possible values are: Single center (one hospital/clinic), Multi-center (multiple institutions), Regional (specific region), National level (countrywide). | |
Dataset magnitude in patients | The magnitude of the dataset expressed in chars. Possible values are: A (<10 to 99), B (100 to 9,999), C (10,000 to 999,999) and D (1,000,000 and above). | |
N° Elements | The number of variables of input in the process of standardization. | |
Percentuage of mapped variables | The percentage of successful data standardisation. | |
Coverage of the standard | The methodology of standardisation wheter it was adapted or not. | |
ETL Tools | Data cleaning & extraction | The tools adopted for supporting data cleaning and extraction. |
Mapping | The tools adopted for the mapping of the variables. | |
Validation | The tools adopted for the validation of the standardization process. | |
Database | The database adopted for storing the result of the healthcare data standardization. | |
Process efficiency and Economic assessment | The information about the economic impact if the consequences are concrete and measured by the authors (e.g., actual cost savings, resource usage reductions). If the authors did not measure the economic impact, this field remains blank. | |
Comments by authors | Limitations | The significant limitations or challenges faced during the study about the standard adopted, such as issues with data compatibility, scalability, or the need for customization. |
Advantages | The benefits of applying the standard model, such as improved data consistency, enhanced clinical outcomes, better interoperability, or more efficient workflows. |
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
We introduce the PolyMed dataset, designed to address the limitations of existing medical case data for Automatic Diagnosis Systems (ADS). ADS assists doctors by predicting diseases based on patients' basic information, such as age, gender, and symptoms. However, these systems face challenges due to imbalanced disease label data and difficulties in accessing or collecting medical data. To tackle these issues, the PolyMed dataset has been developed to improve the evaluation of ADS by incorporating medical knowledge graph data and diagnosis case data. The dataset aims to provide comprehensive evaluation, include diverse disease information, effectively utilize external knowledge, and perform tasks closer to real-world scenarios.
We have also made the data collection tools publicly available to enable researchers and other interested parties to contribute additional data in a standardized format. These tools feature a range of customizable input fields that can be selectively utilized according to the user's specific requirements, ensuring consistency and professionalism in the data collection process.
All train and test code of our data available in https://github.com/krchanyang/PolyMed
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 1 row and is filtered where the books is The political economy of universal healthcare in Africa : evidence from Ghana. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
The Global Health Expenditure Database (GHED) provides internationally comparable data on health spending for close to 190 countries. The database is open access and supports the goal of Universal Health Coverage by helping monitor the availability of resources for health and the extent to which they are used efficiently and equitably. This, in turn, helps ensure health services are available and affordable when people need them...WHO works collaboratively with Member States and updates the database annually using available data such as government budgets and health accounts studies. Where necessary, modifications and estimates are made to ensure the comprehensiveness and consistency of the data across countries and years. GHED is the source of the health expenditure data republished by the World Bank and the WHO Global Health Observatory. (from website)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data repository for MedMNIST v1 is out of date! Please check the latest version of MedMNIST v2.
Abstract
We present MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28x28 images, which requires no background knowledge. Covering the primary data modalities in medical image analysis, it is diverse on data scale (from 100 to 100,000) and tasks (binary/multi-class, ordinal regression and multi-label). MedMNIST could be used for educational purpose, rapid prototyping, multi-modal machine learning or AutoML in medical image analysis. Moreover, MedMNIST Classification Decathlon is designed to benchmark AutoML algorithms on all 10 datasets; We have compared several baseline methods, including open-source or commercial AutoML tools. The datasets, evaluation code and baseline methods for MedMNIST are publicly available at https://medmnist.github.io/.
Please note that this dataset is NOT intended for clinical use.
We recommend our official code to download, parse and use the MedMNIST dataset:
pip install medmnist
Citation and Licenses
If you find this project useful, please cite our ISBI'21 paper as: Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis," arXiv preprint arXiv:2010.14925, 2020.
or using bibtex: @article{medmnist, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, journal={arXiv preprint arXiv:2010.14925}, year={2020} }
Besides, please cite the corresponding paper if you use any subset of MedMNIST. Each subset uses the same license as that of the source dataset.
PathMNIST
Jakob Nikolas Kather, Johannes Krisam, et al., "Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study," PLOS Medicine, vol. 16, no. 1, pp. 1–22, 01 2019.
License: CC BY 4.0
ChestMNIST
Xiaosong Wang, Yifan Peng, et al., "Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases," in CVPR, 2017, pp. 3462–3471.
License: CC0 1.0
DermaMNIST
Philipp Tschandl, Cliff Rosendahl, and Harald Kittler, "The ham10000 dataset, a large collection of multisource dermatoscopic images of common pigmented skin lesions," Scientific data, vol. 5, pp. 180161, 2018.
Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, Harald Kittler, and Allan Halpern: “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”, 2018; arXiv:1902.03368.
License: CC BY-NC 4.0
OCTMNIST/PneumoniaMNIST
Daniel S. Kermany, Michael Goldbaum, et al., "Identifying medical diagnoses and treatable diseases by image-based deep learning," Cell, vol. 172, no. 5, pp. 1122 – 1131.e9, 2018.
License: CC BY 4.0
RetinaMNIST
DeepDR Diabetic Retinopathy Image Dataset (DeepDRiD), "The 2nd diabetic retinopathy – grading and image quality estimation challenge," https://isbi.deepdr.org/data.html, 2020.
License: CC BY 4.0
BreastMNIST
Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy, "Dataset of breast ultrasound images," Data in Brief, vol. 28, pp. 104863, 2020.
License: CC BY 4.0
OrganMNIST_{Axial,Coronal,Sagittal}
Patrick Bilic, Patrick Ferdinand Christ, et al., "The liver tumor segmentation benchmark (lits)," arXiv preprint arXiv:1901.04056, 2019.
Xuanang Xu, Fugen Zhou, et al., "Efficient multiple organ localization in ct image using 3d region proposal network," IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1885–1898, 2019.
License: CC BY 4.0
MedAlign is a benchmark dataset of 983 clinician-curated natural language instructions for EHR data, grounded by 275 longitudinal EHRs. It includes reference responses for 303 instructions and supports evaluation of LLMs on healthcare-specific tasks.
**IMPORTANT USAGE NOTE: **MedAlign only includes test set examples. No training examples are provided for fine-tuning models.
1. Overview
MedAlign is a longitudinal EHR benchmark for instruction-following with LLMs. The dataset includes:
%3C!-- --%3E
2. EHR Data
EHR data is sourced from Stanford’s STARR-OMOP database. Data are standardized in the OMOP CDM schema and are scrubbed on identifying PHI information. Complete technical details are included in the paper, but key highlights:
%3C!-- --%3E
%3C!-- --%3E
%3C!-- --%3E
3. Instruction Following Benchmark
See "medalign_instructions_responses_v1_2.zip" for instructions, responses, and EHR text timelines.
Please see our Github repo to obtain code for loading the dataset.
Access to the MedAlign dataset requires the following:
%3C!-- --%3E
**These data must remain on your encrypted machine. Redistribution of data is FORBIDDEN and will result in immediate termination of access privileges. **
IMPORTANT NOTES:
%3C!-- --%3E
Please allow 7-10 business days to process applications.
In 2020, the Washington State Legislature enacted Engrossed Substitute Senate Bill (ESSB) 6404 (Chapter 316, Laws of 2020, codified at RCW 48.43.0161), which requires that health carriers with at least one percent of the market share in Washington State annually report certain aggregated and de-identified data related to prior authorization to the Office of the Insurance Commissioner (OIC). Prior authorization is a utilization review tool used by carriers to review the medical necessity of requested health care services for specific health plan enrollees. Carriers choose the services that are subject to prior authorization review. The reported data includes prior authorization information for the following categories of health services: • Inpatient medical/surgical • Outpatient medical/surgical • Inpatient mental health and substance use disorder • Outpatient mental health and substance use disorder • Diabetes supplies and equipment • Durable medical equipment The carriers must report the following information for the prior plan year (PY) for their individual and group health plans for each category of services: • The 10 codes with the highest number of prior authorization requests and the percent of approved requests. • The 10 codes with the highest percentage of approved prior authorization requests and the total number of requests. • The 10 codes with the highest percentage of prior authorization requests that were initially denied and then approved on appeal and the total number of such requests. Carriers also must include the average response time in hours for prior authorization requests and the number of requests for each covered service in the lists above for: • Expedited decisions. • Standard decisions. • Extenuating-circumstances decisions. Engrossed Second Substitute House Bill 1357 added additional prescription drug prior authorization reporting requirements for health carriers beginning in reporting year 2024. Carriers were provided the opportunity to submit voluntary prescription drug prior authorization data for the 2023 reporting period. Prescription drug reporting was required for the 2024 reporting period.
The data package contains NPI related datasets. The NPI number of all the covered health care professionals, the deactivated NPI's and dfferent codes used within the NPI dataset
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
đźš‘ Clinical Field Mappings for Healthcare Systems
This synthetic dataset provides a wide variety of alternative names for clinical database fields, mapping them to standardized targets for healthcare data normalization.
Using LLMs, we generated and validated thousands of plausible variations, including misspellings, abbreviations, country-specific nuances, and common real-world typos.
This dataset is perfect for training models that need to standardize, clean, or map heterogeneous healthcare data schemas into unified, normalized formats.
âś… Applications include: - Data cleaning and ETL pipelines for clinical databases - Fine-tuning LLMs for schema matching - Clinical data interoperability projects - Zero-shot field matching research
The dataset is machine-generated and validated with LLM feedback loops to ensure high-quality mappings.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The Canadian Clinical Drug Dataset is a drug terminology and coding system designed to allow the interchange of standardized drug and medical device information between diverse digital health systems. Some use cases include electronic prescribing, electronic medical records, medication reconciliation and analytics. It also provides for the classification and identification of defined groups of medications (called special groupings), such as narcotic and controlled drugs. It has the capacity to be used by knowledge-based vendors, clinicians, researchers, statistical users, government agencies, healthcare organisations and consumers. The data source for the CCDD is the Drug Product Database (DPD) which contains information on drugs approved by Health Canada. However, the data is modeled differently following the CCDD Editorial Guidelines which take into consideration international terminology standards. For example, DPD uses the dosage form, “tablet (delayed-release)”, whereas CCDD uses the equivalent term “gastro-resistant tablet.” The Canadian Clinical Drug Data Set does not replace the Health Canada Drug Product Database (DPD) but is published in addition to it. The scope of health products included in CCDD is limited to those classified as human in DPD (veterinary, radiopharmaceutical and disinfectant products are out of scope). Some exclusions apply within the human class but are subject to periodic review: For a full list of exclusions, please see the Scope section in the CCDD Editorial Guidelines. In addition, a limited number of medical devices that are commonly prescribed and dispensed at a community pharmacy are included. This data set was developed in collaboration with Canada Health Infoway and is also available in their Terminology Gateway at https://tgateway.infoway-inforoute.ca/ccdd.html (Free login required)
Problem Statement
👉 Download the case studies here
Healthcare providers often rely on generalized treatment protocols that may not address the unique needs of individual patients. This approach led to variability in treatment outcomes, reduced efficacy, and limited patient satisfaction. A leading hospital sought a solution to develop personalized treatment plans tailored to each patient’s medical history, genetic profile, and current health status.
Challenge
Implementing a personalized healthcare treatment system involved overcoming the following challenges:
Integrating diverse patient data, including medical history, lab results, genetic information, and lifestyle factors.
Developing predictive models capable of identifying optimal treatment plans for individual patients.
Ensuring compliance with privacy regulations and maintaining data security throughout the process.
Solution Provided
An advanced healthcare treatment recommendation system was developed using machine learning models and predictive analytics. The solution was designed to:
Analyze patient data to identify patterns and predict treatment outcomes.
Recommend individualized treatment plans optimized for efficacy and patient preferences.
Continuously learn and adapt to improve recommendations based on new medical insights and patient feedback.
Development Steps
Data Collection
Aggregated data from electronic health records (EHR), genetic testing reports, and patient-provided health information.
Preprocessing
Standardized and anonymized data to ensure accuracy, consistency, and compliance with healthcare privacy regulations.
Model Development
Trained machine learning models to identify correlations between patient characteristics and treatment outcomes. Developed predictive algorithms to recommend personalized treatment plans for conditions like chronic diseases, cancer, and rare disorders.
Validation
Tested the system on historical patient data to evaluate its accuracy in predicting successful treatment outcomes.
Deployment
Integrated the solution into the hospital’s clinical decision support systems, enabling healthcare providers to access personalized treatment recommendations during consultations.
Continuous Monitoring & Improvement
Established a feedback mechanism to refine models using real-world treatment outcomes and patient satisfaction data.
Results
Improved Patient Outcomes
The system delivered personalized treatment recommendations that significantly improved recovery rates and health outcomes.
Increased Treatment Efficacy
Optimized treatment plans reduced trial-and-error approaches, leading to more effective interventions and fewer side effects.
Personalized Healthcare Experiences
Patients reported higher satisfaction levels due to treatment plans tailored to their individual needs and preferences.
Enhanced Decision-Making
Healthcare providers benefited from data-driven insights, enabling more informed and confident decisions.
Scalable and Future-Ready Solution
The system scaled seamlessly to support diverse medical specialties and adapted to incorporate emerging medical research.