31 datasets found

s
Transcribed Medical Records datasets for Machine Learning
shaip.com
json
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2025). Transcribed Medical Records datasets for Machine Learning [Dataset]. https://www.shaip.com/offerings/transcribed-medical-records-medical-data-catalog/
Explore at:
jsonAvailable download formats
Dataset updated
Jun 15, 2025
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Get premium quality Off-the-shelf transcribed medical records dataset to develop better performing machine learning models. Deep domain expertise. Fast & Cost-effective.
d
Pixta AI | Imagery Data | Global | High volume | Annotation and Labelling...
datarade.ai
.json, .xml, .csv
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixta AI (2023). Pixta AI | Imagery Data | Global | High volume | Annotation and Labelling Services Provided | Multimodal Medical Images OTS Datasets for AI and ML [Dataset]. https://datarade.ai/data-products/multimodal-medical-image-ots-datasets-pixta-ai
Explore at:
.json, .xml, .csvAvailable download formats
Dataset updated
Jul 19, 2023
Dataset authored and provided by
Pixta AI
Area covered
Pitcairn, Uruguay, French Polynesia, Haiti, Malaysia, Guernsey, Serbia, Lebanon, Maldives, Montenegro
Description
Overview This dataset is a collection of multimodal high quality image sets of medical data that are ready to use for optimizing the accuracy of computer vision models. All of the contents are sourced from Pixta AI's partner network with high quality & full data compliance.

Data subject The datasets consist of various models

X-ray datasets

CT datasets

MRI datasets

Mammography datasets

Segmentation datasets

Classification datasets

Regression datasets

Use case The dataset could be used for various Healthcare & Medical models:

Medical Image Analysis

Remote Diagnosis

Medical Record Keeping ... Each data set is supported by both AI and expert doctors review process to ensure labelling consistency and accuracy. Contact us for more custom datasets.

About PIXTA PIXTASTOCK is the largest Asian-featured stock platform providing data, contents, tools and services since 2005. PIXTA experiences 15 years of integrating advanced AI technology in managing, curating, processing over 100M visual materials and serving global leading brands for their creative and data demands. Visit us at https://www.pixta.ai/ or contact via our email admin.bi@pixta.co.jp.
Hospital Management Dataset
kaggle.com
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanak Baghel
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

Dataset Overview

This dataset includes five CSV files:

patients.csv – Patient demographics, contact details, registration info, and insurance data

doctors.csv – Doctor profiles with specializations, experience, and contact information

appointments.csv – Appointment dates, times, visit reasons, and statuses

treatments.csv – Treatment types, descriptions, dates, and associated costs

billing.csv – Billing amounts, payment methods, and status linked to treatments

📁 Files & Column Descriptions

** patients.csv**

Contains patient demographic and registration details.

Column Description

patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

** doctors.csv**

Details about the doctors working in the hospital.

Column Description

doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

appointments.csv

Records of scheduled and completed patient appointments.

Column Description

appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

treatments.csv

Information about the treatments given during appointments.

Column Description

treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

** billing.csv**

Billing and payment details for treatments.

Column Description

bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

Possible Use Cases

SQL queries and relational database design

Exploratory data analysis (EDA) and dashboarding

Machine learning projects (e.g., cost prediction, no-show analysis)

Feature engineering and data cleaning practice

End-to-end healthcare analytics workflows

Recommended Tools & Resources

SQL (joins, filters, window functions)

Pandas and Matplotlib/Seaborn for EDA

Scikit-learn for ML models

Pandas Profiling for automated EDA

Plotly for interactive visualizations

Please Note that :

All data is synthetically generated for educational and project use. No real patient information is included.

If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
Dataset for Automated Medical Transcription
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nazmul Kazi; Nazmul Kazi; Matt Kuntz; Upulee Kanewala; Upulee Kanewala; Indika Kahanda; Indika Kahanda; Matt Kuntz (2023). Dataset for Automated Medical Transcription [Dataset]. http://doi.org/10.5281/zenodo.4279041
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4279041
Dataset updated
Jan 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nazmul Kazi; Nazmul Kazi; Matt Kuntz; Upulee Kanewala; Upulee Kanewala; Indika Kahanda; Indika Kahanda; Matt Kuntz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We generated this dataset to train a machine learning model for automatically generating psychiatric case notes from doctor-patient conversations. Since, we didn't have access to real doctor-patient conversations, we used transcripts from two different sources to generate audio recordings of enacted conversations between a doctor and a patient. We employed eight students who worked in pairs to generate these recordings. Six of the transcripts that we used to produce this recordings were hand-written by Cheryl Bristow and rest of the transcripts were adapted from Alexander Street which were generated from real doctor-patient conversations. Our study requires recording the doctor and the patient(s) in seperate channels which is the primary reason behind generating our own audio recordings of the conversations.

We used Google Cloud Speech-To-Text API to transcribe the enacted recordings. These newly generated transcripts are auto-generated entirely using AI powered automatic speech recognition whereas the source transcripts are either hand-written or fine-tuned by human transcribers (transcripts from Alexander Street).

We provided the generated transcripts back to the students and asked them to write case notes. The students worked independently using a software that we developed earlier for this purpose. The students had past experience of writing case notes and we let the students write case notes as they practiced without any training or instructions from us.

NOTE: Audio recordings are not included in Zenodo due to large file size but they are available in the GitHub repository.
EMRBots: a 100,000-patient database
figshare.com
zip
Updated Sep 3, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Uri Kartoun (2018). EMRBots: a 100,000-patient database [Dataset]. http://doi.org/10.6084/m9.figshare.7040198.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7040198.v1
Dataset updated
Sep 3, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Uri Kartoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A 100,000-patient database that contains in total 100,000 virtual patients, 361,760 admissions, and 107,535,387 lab observations.
Data from: MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D...
zenodo.org
bin
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiancheng Yang; Rui Shi; Donglai Wei; Zequan Liu; Lin Zhao; Bilian Ke; Hanspeter Pfister; Bingbing Ni; Jiancheng Yang; Rui Shi; Donglai Wei; Zequan Liu; Lin Zhao; Bilian Ke; Hanspeter Pfister; Bingbing Ni (2023). MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification [Dataset]. http://doi.org/10.5281/zenodo.5208230
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5208230
Dataset updated
Aug 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jiancheng Yang; Rui Shi; Donglai Wei; Zequan Liu; Lin Zhao; Bilian Ke; Hanspeter Pfister; Bingbing Ni; Jiancheng Yang; Rui Shi; Donglai Wei; Zequan Liu; Lin Zhao; Bilian Ke; Hanspeter Pfister; Bingbing Ni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Note: We recommend to download from Zenodo official link, which is integrated with our code. However, if you find download problem, you can also use this mirror link from Google Drive.

Abstract

We introduce MedMNIST v2, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of 708,069 2D images and 10,214 3D images in total, could support numerous research / educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST v2, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.

Note: This dataset is NOT intended for clinical use.

We recommend our official code to download, parse and use the MedMNIST dataset:

pip install medmnist

Citation

If you find this project useful, please cite both v1 and v2 paper as:

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. "MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification". arXiv preprint arXiv:2110.14795, 2021. Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

or using bibtex:

@article{medmnistv2, title={MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification}, author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing}, journal={arXiv preprint arXiv:2110.14795}, year={2021} } @inproceedings{medmnistv1, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)}, pages={191--195}, year={2021} }

Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.

License

The dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

The code is under Apache-2.0 License.
Medical Diagnostic Fitness Dataset
kaggle.com
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maaz Khan (2025). Medical Diagnostic Fitness Dataset [Dataset]. https://www.kaggle.com/datasets/maazkhan636/medical-diagnostic-fitness-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Maaz Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a synthetic simulation of medical fitness reports collected from routine health check-ups. It contains 5,000 rows representing individuals' vitals, lab test results, physical exam notes, and disease screenings.

Each row is labeled with a FIT or UNFIT outcome to indicate overall health status based on the data. This dataset can be used for machine learning classification tasks, health analytics, and smart form automation.

Use Cases:

Medical classification model (FIT/UNFIT)

EDA & visualization

Data cleaning & preprocessing

Health dashboards

ML feature engineering

Note: All values are randomly generated for educational purposes and do not represent real individuals.
Data from: OpenChart-SE: A corpus of artificial Swedish electronic health...
zenodo.org
data.niaid.nih.gov
bin, csv, pdf, txt
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johanna Berg; Johanna Berg; Carl Ollvik Aasa; Björn Appelgren Thorell; Sonja Aits; Sonja Aits; Carl Ollvik Aasa; Björn Appelgren Thorell (2024). OpenChart-SE: A corpus of artificial Swedish electronic health records for imagined emergency care patients written by physicians in a crowd-sourcing project [Dataset]. http://doi.org/10.5281/zenodo.7499831
Explore at:
txt, csv, bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7499831
Dataset updated
Jul 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johanna Berg; Johanna Berg; Carl Ollvik Aasa; Björn Appelgren Thorell; Sonja Aits; Sonja Aits; Carl Ollvik Aasa; Björn Appelgren Thorell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) are a rich source of information for medical research and public health monitoring. Information systems based on EHR data could also assist in patient care and hospital management. However, much of the data in EHRs is in the form of unstructured text, which is difficult to process for analysis. Natural language processing (NLP), a form of artificial intelligence, has the potential to enable automatic extraction of information from EHRs and several NLP tools adapted to the style of clinical writing have been developed for English and other major languages. In contrast, the development of NLP tools for less widely spoken languages such as Swedish has lagged behind. A major bottleneck in the development of NLP tools is the restricted access to EHRs due to legitimate patient privacy concerns. To overcome this issue we have generated a citizen science platform for collecting artificial Swedish EHRs with the help of Swedish physicians and medical students. These artificial EHRs describe imagined but plausible emergency care patients in a style that closely resembles EHRs used in emergency departments in Sweden. In the pilot phase, we collected a first batch of 50 artificial EHRs, which has passed review by an experienced Swedish emergency care physician. We make this dataset publicly available as OpenChart-SE corpus (version 1) under an open-source license for the NLP research community. The project is now open for general participation and Swedish physicians and medical students are invited to submit EHRs on the project website (https://github.com/Aitslab/openchart-se), where additional batches of quality-controlled EHRs will be released periodically.

Dataset content

OpenChart-SE, version 1 corpus (txt files and and dataset.csv)

The OpenChart-SE corpus, version 1, contains 50 artificial EHRs (note that the numbering starts with 5 as 1-4 were test cases that were not suitable for publication). The EHRs are available in two formats, structured as a .csv file and as separate textfiles for annotation. Note that flaws in the data were not cleaned up so that it simulates what could be encountered when working with data from different EHR systems. All charts have been checked for medical validity by a resident in Emergency Medicine at a Swedish hospital before publication.

Codebook.xlsx

The codebook contain information about each variable used. It is in XLSForm-format, which can be re-used in several different applications for data collection.

suppl_data_1_openchart-se_form.pdf

OpenChart-SE mock emergency care EHR form.

suppl_data_3_openchart-se_dataexploration.ipynb

This jupyter notebook contains the code and results from the analysis of the OpenChart-SE corpus.

More details about the project and information on the upcoming preprint accompanying the dataset can be found on the project website (https://github.com/Aitslab/openchart-se).
m
Synthetic Synthea patient datasets for lung cancer risk prediction machine...
data.mendeley.com
Updated Oct 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anjun Chen (2022). Synthetic Synthea patient datasets for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.17632/b24cb4nn8h.1
Explore at:
Unique identifier
https://doi.org/10.17632/b24cb4nn8h.1
Dataset updated
Oct 31, 2022
Authors
Anjun Chen
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction and simulation study of learning health systems.

In subfolder "unconverted": Five populations of 30K patients were generated by the Synthea patient generator. About 1100 lung cancer patients and 3000 control patients (without lung cancer) were selected and their electronic health records (EHR) were processed to data table files ready for machine learning using common algorithms like XGBoost.

In root directory: The five 30K-patient datasets were combined sequentially to form 5 different size datasets, from 30K to 150K patients. The new datasets were resampled to keep all lung cancer patients plus about 3x control patients. The ML-ready table files also had the continuous numeric values converted to categorical values.

Because Synthea patients are closely resemble real patients, the Synthea patient data can be used to develop and test ML algorithms and pipelines, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns.

The first LHS simulation study titled "Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data" has been published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).
c
A machine learning approach for master patient index record linkage and...
esango.cput.ac.za
csv
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dane Hollenbach (2025). A machine learning approach for master patient index record linkage and deduplication [Dataset]. http://doi.org/10.25381/cput.28593101.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.25381/cput.28593101.v1
Dataset updated
Aug 11, 2025
Dataset provided by
Cape Peninsula University of Technology
Authors
Dane Hollenbach
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Ethics Reference No: 209113723/2023/1Source Code is available on Github. The datasets are used to reproduce the same results: https://github.com/DHollenbach/record-linkage-and-deduplication/blob/main/README.mdAbstract:The research emphasised the vital role of a Master Patient Index (MPI) solution in addressing the challenges public healthcare facilities face in eliminating duplicate patient records and improving record linkage. The study recognised that traditional MPI systems may have limitations in terms of efficiency and accuracy. To address this, the study focused on utilising machine learning techniques to enhance the effectiveness of MPI systems, aiming to support the growing record linkage healthcare ecosystem.It was essential to highlight that integrating machine learning into MPI systems is crucial for optimising their capabilities. The study aimed to improve data linking and deduplication processes within MPI systems by leveraging machine learning techniques. This emphasis on machine learning represented a significant shift towards more sophisticated and intelligent healthcare technologies. Ultimately, the goal was to ensure safe and efficient patient care, benefiting individuals and the broader healthcare industry.This research investigated the performance of five machine learning classification algorithms (random forests, extreme gradient boosting, logistic regression, stacking ensemble, and deep multilayer perceptron) for data linkage and deduplication on four datasets. These techniques improved data linking and deduplication for use in an MPI system.The findings demonstrate the applicability of machine learning models for effective data linkage and deduplication of electronic health records. The random forest algorithm achieved the best performance (identifying duplicates correctly) based on accuracy, F1-Score, and AUC-score for three datasets (Electronic Practice-Based Research Network (ePBRN): Acc = 99.83%, F1-score = 81.09%, AUC = 99.98%; Freely Extensible Biomedical Record Linkage (FEBRL) 3: Acc = 99.55%, F1-score = 96.29%, AUC = 99.77%; Custom-synthetic: Acc = 99.98%, F1-score = 99.18%, AUC = 99.99%). In contrast, the experimentation on the FEBRL4 dataset revealed that the Multi-Layer Perceptron Artificial Neural Network (MLP-ANN) and logistic regression algorithms outperformed the random forest algorithm. The performance results for the MLP-ANN were (FEBRL4: Acc = 99.93%, F1-score = 96.95%, AUC = 99.97%). For the logistic regression algorithm, the results were (FEBRL4: Acc = 99.99%, F1 = 96.91%, AUC = 99.97%).In conclusion, the results of this research have significant implications for the healthcare industry, as they are expected to enhance the utilisation of MPI systems and improve their effectiveness in the record linkage healthcare ecosystem. By improving patient record linking and deduplication, healthcare providers can ensure safer and more efficient care, ultimately benefiting patients and the industry.
s
Electronic Health Records (EHR) Datasets
shaip.com
json
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2022). Electronic Health Records (EHR) Datasets [Dataset]. https://www.shaip.com/offerings/electronic-health-records-ehr-medical-data-catalog/
Explore at:
jsonAvailable download formats
Dataset updated
Apr 8, 2022
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Get premium quality off-the-shelf EHR dataset to develop better performing machine learning models. Speak to our experts for Electronic Health Records data needs.
Multi Cancer Dataset
kaggle.com
Updated Oct 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Obuli Sai Naren (2024). Multi Cancer Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/9537604
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/9537604
Dataset updated
Oct 3, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Obuli Sai Naren
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
🩺 Multi Cancer Dataset - 8 Types of Cancer Images

Overview

This dataset contains images of various cancer types, compiled for research and analysis purposes. It includes 8 main cancer classes and 26 subclasses, providing a rich resource for medical image classification and machine learning applications.

📝 Citation

If you use this dataset in your research or project, please make sure to cite it appropriately. Thanks! ❤️ You can check DOI Citation section at the bottom.

APA

Obuli Sai Naren. (2022). Multi Cancer Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/3415848

📊 Dataset Details

Cancer Classes Images
Acute Lymphoblastic Leukemia 4 20,000
Brain Cancer 3 15,000
Breast Cancer 2 10,000
Cervical Cancer 5 25,000
Kidney Cancer 2 10,000
Lung and Colon Cancer 5 25,000
Lymphoma 3 15,000
Oral Cancer 2 10,000

Total Images: 130,000
Format: JPEG
Dimensions: 512px × 512px

📂 Folder Structure & Class Names

Each subclass folder contains 5,000 images. The datasets referenced for each cancer type are linked below.

📄 Notes on Images

All subclass folders contain 5,000 images each.

Each image follows the naming format <subclass>_<serial_number>.jpg for easy reference.

For more detailed information on the dataset structure, preprocessing, and usage, please refer to the README.md file included in the dataset's main directory.

Feel free to download, analyze, and contribute! 📊💻
f
Table1_Generalizability of machine learning methods in detecting adverse...
frontiersin.figshare.com
docx
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Muntasir Zitu; Shijun Zhang; Dwight H. Owen; Chienwei Chiang; Lang Li (2023). Table1_Generalizability of machine learning methods in detecting adverse drug events from clinical narratives in electronic medical records.DOCX [Dataset]. http://doi.org/10.3389/fphar.2023.1218679.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fphar.2023.1218679.s001
Dataset updated
Jul 12, 2023
Dataset provided by
Frontiers
Authors
Md Muntasir Zitu; Shijun Zhang; Dwight H. Owen; Chienwei Chiang; Lang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We assessed the generalizability of machine learning methods using natural language processing (NLP) techniques to detect adverse drug events (ADEs) from clinical narratives in electronic medical records (EMRs). We constructed a new corpus correlating drugs with adverse drug events using 1,394 clinical notes of 47 randomly selected patients who received immune checkpoint inhibitors (ICIs) from 2011 to 2018 at The Ohio State University James Cancer Hospital, annotating 189 drug-ADE relations in single sentences within the medical records. We also used data from Harvard’s publicly available 2018 National Clinical Challenge (n2c2), which includes 505 discharge summaries with annotations of 1,355 single-sentence drug-ADE relations. We applied classical machine learning (support vector machine (SVM)), deep learning (convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM)), and state-of-the-art transformer-based (bidirectional encoder representations from transformers (BERT) and ClinicalBERT) methods trained and tested in the two different corpora and compared performance among them to detect drug–ADE relationships. ClinicalBERT detected drug–ADE relationships better than the other methods when trained using our dataset and tested in n2c2 (ClinicalBERT F-score, 0.78; other methods, F-scores, 0.61–0.73) and when trained using the n2c2 dataset and tested in ours (ClinicalBERT F-score, 0.74; other methods, F-scores, 0.55–0.72). Comparison among several machine learning methods demonstrated the superior performance and, therefore, the greatest generalizability of findings of ClinicalBERT for the detection of drug–ADE relations from clinical narratives in electronic medical records.
Health Care Analytics
kaggle.com
Updated Jan 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abishek Sudarshan (2022). Health Care Analytics [Dataset]. https://www.kaggle.com/datasets/abisheksudarshan/health-care-analytics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abishek Sudarshan
Description
Context

Part of Janatahack Hackathon in Analytics Vidhya

Content

The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.

MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

The Process:

MedCamp employees / volunteers reach out to people and drive registrations. During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.

Other things to note:

Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people. For a few camps, there was hardware failure, so some information about date and time of registration is lost. MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

Favorable outcome:

For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall. You need to predict the chances (probability) of having a favourable outcome.

Train / Test split:

Camps started on or before 31st March 2006 are considered in Train Test data is for all camps conducted on or after 1st April 2006.

Acknowledgements

Credits to AV

Inspiration

To share with the data science community to jump start their journey in Healthcare Analytics
s
Getranscribeerde medische dossiers datasets voor machine learning
nl.shaip.com
json
Updated Dec 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2022). Getranscribeerde medische dossiers datasets voor machine learning [Dataset]. https://nl.shaip.com/offerings/transcribed-medical-records-medical-data-catalog/
Explore at:
jsonAvailable download formats
Dataset updated
Dec 7, 2022
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Ontvang hoogwaardige, kant-en-klare getranscribeerde medische datasets om beter presterende machine learning-modellen te ontwikkelen. Diepgaande domeinexpertise. Snel en kosteneffectief.
Z
Hand Washing Video Dataset Annotated According to the World Health...
data.niaid.nih.gov
Updated Jan 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sabelnikovs, Olegs (2022). Hand Washing Video Dataset Annotated According to the World Health Organization's Handwashing Guidelines - METC Subset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5808788
Explore at:
Dataset updated
Jan 3, 2022
Dataset provided by
Elsts, Atis
Sabelnikovs, Olegs
Zemlanuhina, Olga
Slavinska, Andreta
Vilde, Aija
Melbārde-Kelmere, Agita
Ivanovs, Maksims
Lulla, Martins
Rutkovskis, Aleksejs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview: This is a lab-based dataset with videos recording volunteers (medical students) washing their hands as part of a hand-washing monitoring and feedback experiment. The dataset is collected in the Medical Education Technology Center (METC) of Riga Stradins University, Riga, Latvia. In total, 72 participants took part in the experiments, each washing their hands three times, in a randomized order, going through three different hand-washing feedback approaches (user interfaces of a mobile app). The data was annotated in real time by a human operator, in order to give the experiment participants real-time feedback on their performance. There are 212 hand washing episodes in total, each of which is annotated by a single person. The annotations classify the washing movements according to the World Health Organization's (WHO) guidelines by marking each frame in each video with a certain movement code.

This dataset is part on three dataset series all following the same format:

https://zenodo.org/record/4537209 - data collected in Pauls Stradins Clinical University Hospital

https://zenodo.org/record/5808764 - data collected in Jurmala Hospital

https://zenodo.org/record/5808789 - data collected in the Medical Education Technology Center (METC) of Riga Stradins University

Note #1: we recommend that when using this dataset for machine learning, allowances are made for the reaction speed of the human operator labeling the data. For example, the annotations can be expected to be incorrect a short while after the person in the video switches their washing movements.

Application: The intention of this dataset is to serve as a basis for training machine learning classifiers for automated hand washing movement recognition and quality control.

Statistics:

Frame rate: ~16 FPS (slightly variable, as the video are reconstructed from a sequence of jpg images taken with max framerate supported by the capturing devices).

Resolution: 640x480

Number of videos: 212

Number of annotation files: 212

Movement codes (in JSON files):

1: Hand washing movement — Palm to palm

2: Hand washing movement — Palm over dorsum, fingers interlaced

3: Hand washing movement — Palm to palm, fingers interlaced

4: Hand washing movement — Backs of fingers to opposing palm, fingers interlocked

5: Hand washing movement — Rotational rubbing of the thumb

6: Hand washing movement — Fingertips to palm

0: Other hand washing movement

Note #2: The original dataset of JPG images is available upon request. There are 13 annotation classes in the original dataset: for each of the six washing movements defined by the WHO, "correct" and "incorrect" execution is market with two different labels. In this published dataset, all incorrect executions are marked with code 0, as "other" washing movement.

Acknowledgments: The dataset collection was funded by the Latvian Council of Science project: "Automated hand washing quality control and quality evaluation system with real-time feedback", No: lzp - Nr. 2020/2-0309.

References: For more detailed information, see this article, describing a similar dataset collected in a different project:

M. Lulla, A. Rutkovskis, A. Slavinska, A. Vilde, A. Gromova, M. Ivanovs, A. Skadins, R. Kadikis, A. Elsts. Hand-Washing Video Dataset Annotated According to the World Health Organization’s Hand-Washing Guidelines. Data. 2021; 6(4):38. https://doi.org/10.3390/data6040038

Contact information: atis.elsts@edi.lv
Data from: MedMNIST-C: Comprehensive benchmark and improved classifier...
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Di Salvo; Francesco Di Salvo; Sebastian Doerrich; Sebastian Doerrich; Christian Ledig; Christian Ledig (2024). MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions [Dataset]. http://doi.org/10.5281/zenodo.11471504
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11471504
Dataset updated
Jul 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francesco Di Salvo; Francesco Di Salvo; Sebastian Doerrich; Sebastian Doerrich; Christian Ledig; Christian Ledig
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract: The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection, covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at github.com/francescodisalvo05/medmnistc-api.

This work has been accepted at the Workshop on Advancing Data Solutions in Medical Imaging AI @ MICCAI 2024 [preprint].

Note: Due to space constraints, we have uploaded all datasets except TissueMNIST-C. However, it can be reproduced via our APIs.

Usage: We recommend using the demo code and tutorials available on our GitHub repository.

Citation: If you find this work useful, please consider citing us:

@article{disalvo2024medmnist, title={MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions}, author={Di Salvo, Francesco and Doerrich, Sebastian and Ledig, Christian}, journal={arXiv preprint arXiv:2406.17536}, year={2024} }

Disclaimer: This repository is inspired by MedMNIST APIs and the ImageNet-C repository. Thus, please also consider citing MedMNIST, the respective source datasets (described here), and ImageNet-C.
p
Data from: MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital...
physionet.org
Updated Feb 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asad Aali; Dave Van Veen; Yamin Arefeen; Jason Hom; Christian Bluethgen; Eduardo Pontes Reis; Sergios Gatidis; Namuun Clifford; Joseph Daws; Arash Tehrani; Jangwon Kim; Akshay Chaudhari (2025). MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital Course Summarization [Dataset]. http://doi.org/10.13026/5gte-bv70
Explore at:
Unique identifier
https://doi.org/10.13026/5gte-bv70
Dataset updated
Feb 3, 2025
Authors
Asad Aali; Dave Van Veen; Yamin Arefeen; Jason Hom; Christian Bluethgen; Eduardo Pontes Reis; Sergios Gatidis; Namuun Clifford; Joseph Daws; Arash Tehrani; Jangwon Kim; Akshay Chaudhari
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
This dataset presents a curated collection of preprocessed and labeled clinical notes derived from the MIMIC-IV-Note database. The primary aim of this resource is to facilitate the development and training of machine learning models focused on summarizing brief hospital courses (BHC) from clinical discharge notes.

The dataset contains 270,033 meticulously cleaned and standardized clinical notes containing an average token length of 2,267, ensuring usability for machine learning (ML) applications. Each clinical note is paired with a corresponding BHC summary, providing a robust foundation for supervised learning tasks. The preprocessing pipeline employed uses regular expressions to address common issues in the raw clinical text, such as special characters, extraneous whitespace, inconsistent formatting, and irrelevant text, to produce a high-quality, structured dataset with separated clinical note sections through appropriate headings.

By offering this resource, we aim to support healthcare professionals and researchers in their efforts to enhance patient care through the automation of BHC summarization. This dataset is ideal for exploring various NLP techniques, developing predictive models, and improving the efficiency and accuracy of clinical documentation practices. We invite the research community to utilize this dataset to advance the field of medical informatics and contribute to better health outcomes.
f
An Applicable Dataset of Electronic Health Records with a Focus on CTA...
figshare.com
xlsx
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad Ebrahimi; Soroor Laffafchi; samira kafan (2022). An Applicable Dataset of Electronic Health Records with a Focus on CTA Results in Pulmonary Embolism Disease [Dataset]. http://doi.org/10.6084/m9.figshare.21308463.v3
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21308463.v3
Dataset updated
Oct 27, 2022
Dataset provided by
figshare
Authors
Ahmad Ebrahimi; Soroor Laffafchi; samira kafan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A new dataset associated with EHR data, involving patients suspicious of PE, including positive PE, healthy and COVID-19 patients suspected of PE. The dataset included PE diagnosis based on CTA imaging results, biographic data, vital signs, laboratory test results, past medical history and, medications. This dataset can utilize in the evolution of PE studies based on machine learning and artificial intelligence.
Data from: Clinical Dataset
kaggle.com
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamadreza Momeni (2023). Clinical Dataset [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/clinical-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamadreza Momeni
Description
The purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.

Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.

About Dataset:

333 scholarly articles cite this dataset.

Unique identifier: DOI

Dataset updated: 2023

Authors: Haoyang Mi

In this dataset, we have two dataset:

1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time

2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS

Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.

Cancer	Classes	Images
Acute Lymphoblastic Leukemia	4	20,000
Brain Cancer	3	15,000
Breast Cancer	2	10,000
Cervical Cancer	5	25,000
Kidney Cancer	2	10,000
Lung and Colon Cancer	5	25,000
Lymphoma	3	15,000
Oral Cancer	2	10,000

Facebook

Twitter

Click to copy link

Link copied

Cite

Shaip (2025). Transcribed Medical Records datasets for Machine Learning [Dataset]. https://www.shaip.com/offerings/transcribed-medical-records-medical-data-catalog/

Transcribed Medical Records datasets for Machine Learning

Explore at:

jsonAvailable download formats

Dataset updated

Jun 15, 2025

Dataset authored and provided by

Shaip

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Get premium quality Off-the-shelf transcribed medical records dataset to develop better performing machine learning models. Deep domain expertise. Fast & Cost-effective.

Clear search

Close search

Google apps

Main menu

Transcribed Medical Records datasets for Machine Learning

Pixta AI | Imagery Data | Global | High volume | Annotation and Labelling...

Hospital Management Dataset

Dataset for Automated Medical Transcription

EMRBots: a 100,000-patient database

Data from: MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D...

Medical Diagnostic Fitness Dataset

Data from: OpenChart-SE: A corpus of artificial Swedish electronic health...

Synthetic Synthea patient datasets for lung cancer risk prediction machine...

A machine learning approach for master patient index record linkage and...

Electronic Health Records (EHR) Datasets

Multi Cancer Dataset

🩺 Multi Cancer Dataset - 8 Types of Cancer Images

Overview

📝 Citation

APA

📊 Dataset Details

📂 Folder Structure & Class Names

📄 Notes on Images

Table1_Generalizability of machine learning methods in detecting adverse...

Health Care Analytics

Context

Content

Acknowledgements

Inspiration

Getranscribeerde medische dossiers datasets voor machine learning

Hand Washing Video Dataset Annotated According to the World Health...

Data from: MedMNIST-C: Comprehensive benchmark and improved classifier...

Data from: MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital...

An Applicable Dataset of Electronic Health Records with a Focus on CTA...

Data from: Clinical Dataset

Transcribed Medical Records datasets for Machine Learning