31 datasets found
  1. s

    Transcribed Medical Records datasets for Machine Learning

    • shaip.com
    json
    Updated Jun 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2025). Transcribed Medical Records datasets for Machine Learning [Dataset]. https://www.shaip.com/offerings/transcribed-medical-records-medical-data-catalog/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jun 15, 2025
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Get premium quality Off-the-shelf transcribed medical records dataset to develop better performing machine learning models. Deep domain expertise. Fast & Cost-effective.

  2. d

    Pixta AI | Imagery Data | Global | High volume | Annotation and Labelling...

    • datarade.ai
    .json, .xml, .csv
    Updated Jul 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pixta AI (2023). Pixta AI | Imagery Data | Global | High volume | Annotation and Labelling Services Provided | Multimodal Medical Images OTS Datasets for AI and ML [Dataset]. https://datarade.ai/data-products/multimodal-medical-image-ots-datasets-pixta-ai
    Explore at:
    .json, .xml, .csvAvailable download formats
    Dataset updated
    Jul 19, 2023
    Dataset authored and provided by
    Pixta AI
    Area covered
    Pitcairn, Uruguay, French Polynesia, Haiti, Malaysia, Guernsey, Serbia, Lebanon, Maldives, Montenegro
    Description
    1. Overview This dataset is a collection of multimodal high quality image sets of medical data that are ready to use for optimizing the accuracy of computer vision models. All of the contents are sourced from Pixta AI's partner network with high quality & full data compliance.

    2. Data subject The datasets consist of various models

    3. X-ray datasets

    4. CT datasets

    5. MRI datasets

    6. Mammography datasets

    7. Segmentation datasets

    8. Classification datasets

    9. Regression datasets

    10. Use case The dataset could be used for various Healthcare & Medical models:

    11. Medical Image Analysis

    12. Remote Diagnosis

    13. Medical Record Keeping ... Each data set is supported by both AI and expert doctors review process to ensure labelling consistency and accuracy. Contact us for more custom datasets.

    14. About PIXTA PIXTASTOCK is the largest Asian-featured stock platform providing data, contents, tools and services since 2005. PIXTA experiences 15 years of integrating advanced AI technology in managing, curating, processing over 100M visual materials and serving global leading brands for their creative and data demands. Visit us at https://www.pixta.ai/ or contact via our email admin.bi@pixta.co.jp.

  3. Hospital Management Dataset

    • kaggle.com
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kanak Baghel
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

    Dataset Overview

    This dataset includes five CSV files:

    1. patients.csv – Patient demographics, contact details, registration info, and insurance data

    2. doctors.csv – Doctor profiles with specializations, experience, and contact information

    3. appointments.csv – Appointment dates, times, visit reasons, and statuses

    4. treatments.csv – Treatment types, descriptions, dates, and associated costs

    5. billing.csv – Billing amounts, payment methods, and status linked to treatments

    📁 Files & Column Descriptions

    ** patients.csv**

    Contains patient demographic and registration details.

    Column Description

    patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

    ** doctors.csv**

    Details about the doctors working in the hospital.

    Column Description

    doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

    appointments.csv

    Records of scheduled and completed patient appointments.

    Column Description

    appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

    treatments.csv

    Information about the treatments given during appointments.

    Column Description

    treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

    ** billing.csv**

    Billing and payment details for treatments.

    Column Description

    bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

    Possible Use Cases

    SQL queries and relational database design

    Exploratory data analysis (EDA) and dashboarding

    Machine learning projects (e.g., cost prediction, no-show analysis)

    Feature engineering and data cleaning practice

    End-to-end healthcare analytics workflows

    Recommended Tools & Resources

    SQL (joins, filters, window functions)

    Pandas and Matplotlib/Seaborn for EDA

    Scikit-learn for ML models

    Pandas Profiling for automated EDA

    Plotly for interactive visualizations

    Please Note that :

    All data is synthetically generated for educational and project use. No real patient information is included.

    If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.

  4. Dataset for Automated Medical Transcription

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nazmul Kazi; Nazmul Kazi; Matt Kuntz; Upulee Kanewala; Upulee Kanewala; Indika Kahanda; Indika Kahanda; Matt Kuntz (2023). Dataset for Automated Medical Transcription [Dataset]. http://doi.org/10.5281/zenodo.4279041
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nazmul Kazi; Nazmul Kazi; Matt Kuntz; Upulee Kanewala; Upulee Kanewala; Indika Kahanda; Indika Kahanda; Matt Kuntz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We generated this dataset to train a machine learning model for automatically generating psychiatric case notes from doctor-patient conversations. Since, we didn't have access to real doctor-patient conversations, we used transcripts from two different sources to generate audio recordings of enacted conversations between a doctor and a patient. We employed eight students who worked in pairs to generate these recordings. Six of the transcripts that we used to produce this recordings were hand-written by Cheryl Bristow and rest of the transcripts were adapted from Alexander Street which were generated from real doctor-patient conversations. Our study requires recording the doctor and the patient(s) in seperate channels which is the primary reason behind generating our own audio recordings of the conversations.

    We used Google Cloud Speech-To-Text API to transcribe the enacted recordings. These newly generated transcripts are auto-generated entirely using AI powered automatic speech recognition whereas the source transcripts are either hand-written or fine-tuned by human transcribers (transcripts from Alexander Street).

    We provided the generated transcripts back to the students and asked them to write case notes. The students worked independently using a software that we developed earlier for this purpose. The students had past experience of writing case notes and we let the students write case notes as they practiced without any training or instructions from us.

    NOTE: Audio recordings are not included in Zenodo due to large file size but they are available in the GitHub repository.

  5. EMRBots: a 100,000-patient database

    • figshare.com
    zip
    Updated Sep 3, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Uri Kartoun (2018). EMRBots: a 100,000-patient database [Dataset]. http://doi.org/10.6084/m9.figshare.7040198.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 3, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Uri Kartoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A 100,000-patient database that contains in total 100,000 virtual patients, 361,760 admissions, and 107,535,387 lab observations.

  6. Data from: MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D...

    • zenodo.org
    bin
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiancheng Yang; Rui Shi; Donglai Wei; Zequan Liu; Lin Zhao; Bilian Ke; Hanspeter Pfister; Bingbing Ni; Jiancheng Yang; Rui Shi; Donglai Wei; Zequan Liu; Lin Zhao; Bilian Ke; Hanspeter Pfister; Bingbing Ni (2023). MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification [Dataset]. http://doi.org/10.5281/zenodo.5208230
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jiancheng Yang; Rui Shi; Donglai Wei; Zequan Liu; Lin Zhao; Bilian Ke; Hanspeter Pfister; Bingbing Ni; Jiancheng Yang; Rui Shi; Donglai Wei; Zequan Liu; Lin Zhao; Bilian Ke; Hanspeter Pfister; Bingbing Ni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Note: We recommend to download from Zenodo official link, which is integrated with our code. However, if you find download problem, you can also use this mirror link from Google Drive.

    Abstract

    We introduce MedMNIST v2, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of 708,069 2D images and 10,214 3D images in total, could support numerous research / educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST v2, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.

    Note: This dataset is NOT intended for clinical use.

    We recommend our official code to download, parse and use the MedMNIST dataset:

    pip install medmnist

    Citation

    If you find this project useful, please cite both v1 and v2 paper as:

    Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. "MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification". arXiv preprint arXiv:2110.14795, 2021.
    
    Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

    or using bibtex:

    @article{medmnistv2,
      title={MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification},
      author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing},
      journal={arXiv preprint arXiv:2110.14795},
      year={2021}
    }
    
    @inproceedings{medmnistv1,
      title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis},
      author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing},
      booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)},
      pages={191--195},
      year={2021}
    }

    Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.

    License

    The dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

    The code is under Apache-2.0 License.

  7. Medical Diagnostic Fitness Dataset

    • kaggle.com
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maaz Khan (2025). Medical Diagnostic Fitness Dataset [Dataset]. https://www.kaggle.com/datasets/maazkhan636/medical-diagnostic-fitness-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Maaz Khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a synthetic simulation of medical fitness reports collected from routine health check-ups. It contains 5,000 rows representing individuals' vitals, lab test results, physical exam notes, and disease screenings.

    Each row is labeled with a FIT or UNFIT outcome to indicate overall health status based on the data. This dataset can be used for machine learning classification tasks, health analytics, and smart form automation.

    Use Cases:

    Medical classification model (FIT/UNFIT)

    EDA & visualization

    Data cleaning & preprocessing

    Health dashboards

    ML feature engineering

    Note: All values are randomly generated for educational purposes and do not represent real individuals.

  8. Data from: OpenChart-SE: A corpus of artificial Swedish electronic health...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, pdf, txt
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johanna Berg; Johanna Berg; Carl Ollvik Aasa; Björn Appelgren Thorell; Sonja Aits; Sonja Aits; Carl Ollvik Aasa; Björn Appelgren Thorell (2024). OpenChart-SE: A corpus of artificial Swedish electronic health records for imagined emergency care patients written by physicians in a crowd-sourcing project [Dataset]. http://doi.org/10.5281/zenodo.7499831
    Explore at:
    txt, csv, bin, pdfAvailable download formats
    Dataset updated
    Jul 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johanna Berg; Johanna Berg; Carl Ollvik Aasa; Björn Appelgren Thorell; Sonja Aits; Sonja Aits; Carl Ollvik Aasa; Björn Appelgren Thorell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) are a rich source of information for medical research and public health monitoring. Information systems based on EHR data could also assist in patient care and hospital management. However, much of the data in EHRs is in the form of unstructured text, which is difficult to process for analysis. Natural language processing (NLP), a form of artificial intelligence, has the potential to enable automatic extraction of information from EHRs and several NLP tools adapted to the style of clinical writing have been developed for English and other major languages. In contrast, the development of NLP tools for less widely spoken languages such as Swedish has lagged behind. A major bottleneck in the development of NLP tools is the restricted access to EHRs due to legitimate patient privacy concerns. To overcome this issue we have generated a citizen science platform for collecting artificial Swedish EHRs with the help of Swedish physicians and medical students. These artificial EHRs describe imagined but plausible emergency care patients in a style that closely resembles EHRs used in emergency departments in Sweden. In the pilot phase, we collected a first batch of 50 artificial EHRs, which has passed review by an experienced Swedish emergency care physician. We make this dataset publicly available as OpenChart-SE corpus (version 1) under an open-source license for the NLP research community. The project is now open for general participation and Swedish physicians and medical students are invited to submit EHRs on the project website (https://github.com/Aitslab/openchart-se), where additional batches of quality-controlled EHRs will be released periodically.

    Dataset content

    OpenChart-SE, version 1 corpus (txt files and and dataset.csv)

    The OpenChart-SE corpus, version 1, contains 50 artificial EHRs (note that the numbering starts with 5 as 1-4 were test cases that were not suitable for publication). The EHRs are available in two formats, structured as a .csv file and as separate textfiles for annotation. Note that flaws in the data were not cleaned up so that it simulates what could be encountered when working with data from different EHR systems. All charts have been checked for medical validity by a resident in Emergency Medicine at a Swedish hospital before publication.

    Codebook.xlsx

    The codebook contain information about each variable used. It is in XLSForm-format, which can be re-used in several different applications for data collection.

    suppl_data_1_openchart-se_form.pdf

    OpenChart-SE mock emergency care EHR form.

    suppl_data_3_openchart-se_dataexploration.ipynb

    This jupyter notebook contains the code and results from the analysis of the OpenChart-SE corpus.

    More details about the project and information on the upcoming preprint accompanying the dataset can be found on the project website (https://github.com/Aitslab/openchart-se).

  9. m

    Synthetic Synthea patient datasets for lung cancer risk prediction machine...

    • data.mendeley.com
    Updated Oct 31, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anjun Chen (2022). Synthetic Synthea patient datasets for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.17632/b24cb4nn8h.1
    Explore at:
    Dataset updated
    Oct 31, 2022
    Authors
    Anjun Chen
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction and simulation study of learning health systems.

    1. In subfolder "unconverted": Five populations of 30K patients were generated by the Synthea patient generator. About 1100 lung cancer patients and 3000 control patients (without lung cancer) were selected and their electronic health records (EHR) were processed to data table files ready for machine learning using common algorithms like XGBoost.

    2. In root directory: The five 30K-patient datasets were combined sequentially to form 5 different size datasets, from 30K to 150K patients. The new datasets were resampled to keep all lung cancer patients plus about 3x control patients. The ML-ready table files also had the continuous numeric values converted to categorical values.

    Because Synthea patients are closely resemble real patients, the Synthea patient data can be used to develop and test ML algorithms and pipelines, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns.

    The first LHS simulation study titled "Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data" has been published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

  10. c

    A machine learning approach for master patient index record linkage and...

    • esango.cput.ac.za
    csv
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dane Hollenbach (2025). A machine learning approach for master patient index record linkage and deduplication [Dataset]. http://doi.org/10.25381/cput.28593101.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 11, 2025
    Dataset provided by
    Cape Peninsula University of Technology
    Authors
    Dane Hollenbach
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Ethics Reference No: 209113723/2023/1Source Code is available on Github. The datasets are used to reproduce the same results: https://github.com/DHollenbach/record-linkage-and-deduplication/blob/main/README.mdAbstract:The research emphasised the vital role of a Master Patient Index (MPI) solution in addressing the challenges public healthcare facilities face in eliminating duplicate patient records and improving record linkage. The study recognised that traditional MPI systems may have limitations in terms of efficiency and accuracy. To address this, the study focused on utilising machine learning techniques to enhance the effectiveness of MPI systems, aiming to support the growing record linkage healthcare ecosystem.It was essential to highlight that integrating machine learning into MPI systems is crucial for optimising their capabilities. The study aimed to improve data linking and deduplication processes within MPI systems by leveraging machine learning techniques. This emphasis on machine learning represented a significant shift towards more sophisticated and intelligent healthcare technologies. Ultimately, the goal was to ensure safe and efficient patient care, benefiting individuals and the broader healthcare industry.This research investigated the performance of five machine learning classification algorithms (random forests, extreme gradient boosting, logistic regression, stacking ensemble, and deep multilayer perceptron) for data linkage and deduplication on four datasets. These techniques improved data linking and deduplication for use in an MPI system.The findings demonstrate the applicability of machine learning models for effective data linkage and deduplication of electronic health records. The random forest algorithm achieved the best performance (identifying duplicates correctly) based on accuracy, F1-Score, and AUC-score for three datasets (Electronic Practice-Based Research Network (ePBRN): Acc = 99.83%, F1-score = 81.09%, AUC = 99.98%; Freely Extensible Biomedical Record Linkage (FEBRL) 3: Acc = 99.55%, F1-score = 96.29%, AUC = 99.77%; Custom-synthetic: Acc = 99.98%, F1-score = 99.18%, AUC = 99.99%). In contrast, the experimentation on the FEBRL4 dataset revealed that the Multi-Layer Perceptron Artificial Neural Network (MLP-ANN) and logistic regression algorithms outperformed the random forest algorithm. The performance results for the MLP-ANN were (FEBRL4: Acc = 99.93%, F1-score = 96.95%, AUC = 99.97%). For the logistic regression algorithm, the results were (FEBRL4: Acc = 99.99%, F1 = 96.91%, AUC = 99.97%).In conclusion, the results of this research have significant implications for the healthcare industry, as they are expected to enhance the utilisation of MPI systems and improve their effectiveness in the record linkage healthcare ecosystem. By improving patient record linking and deduplication, healthcare providers can ensure safer and more efficient care, ultimately benefiting patients and the industry.

  11. s

    Electronic Health Records (EHR) Datasets

    • shaip.com
    json
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2022). Electronic Health Records (EHR) Datasets [Dataset]. https://www.shaip.com/offerings/electronic-health-records-ehr-medical-data-catalog/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Apr 8, 2022
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Get premium quality off-the-shelf EHR dataset to develop better performing machine learning models. Speak to our experts for Electronic Health Records data needs.

  12. Multi Cancer Dataset

    • kaggle.com
    Updated Oct 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Obuli Sai Naren (2024). Multi Cancer Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/9537604
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 3, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Obuli Sai Naren
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    🩺 Multi Cancer Dataset - 8 Types of Cancer Images

    Overview

    This dataset contains images of various cancer types, compiled for research and analysis purposes. It includes 8 main cancer classes and 26 subclasses, providing a rich resource for medical image classification and machine learning applications.

    📝 Citation

    If you use this dataset in your research or project, please make sure to cite it appropriately. Thanks! ❤️ You can check DOI Citation section at the bottom.

    APA

    Obuli Sai Naren. (2022). Multi Cancer Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/3415848

    📊 Dataset Details

    CancerClassesImages
    Acute Lymphoblastic Leukemia420,000
    Brain Cancer315,000
    Breast Cancer210,000
    Cervical Cancer525,000
    Kidney Cancer210,000
    Lung and Colon Cancer525,000
    Lymphoma315,000
    Oral Cancer210,000

    Total Images: 130,000
    Format: JPEG
    Dimensions: 512px × 512px

    📂 Folder Structure & Class Names

    Each subclass folder contains 5,000 images. The datasets referenced for each cancer type are linked below.

    📄 Notes on Images

    • All subclass folders contain 5,000 images each.
    • Each image follows the naming format <subclass>_<serial_number>.jpg for easy reference.

    For more detailed information on the dataset structure, preprocessing, and usage, please refer to the README.md file included in the dataset's main directory.

    Feel free to download, analyze, and contribute! 📊💻

  13. f

    Table1_Generalizability of machine learning methods in detecting adverse...

    • frontiersin.figshare.com
    docx
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Muntasir Zitu; Shijun Zhang; Dwight H. Owen; Chienwei Chiang; Lang Li (2023). Table1_Generalizability of machine learning methods in detecting adverse drug events from clinical narratives in electronic medical records.DOCX [Dataset]. http://doi.org/10.3389/fphar.2023.1218679.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jul 12, 2023
    Dataset provided by
    Frontiers
    Authors
    Md Muntasir Zitu; Shijun Zhang; Dwight H. Owen; Chienwei Chiang; Lang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We assessed the generalizability of machine learning methods using natural language processing (NLP) techniques to detect adverse drug events (ADEs) from clinical narratives in electronic medical records (EMRs). We constructed a new corpus correlating drugs with adverse drug events using 1,394 clinical notes of 47 randomly selected patients who received immune checkpoint inhibitors (ICIs) from 2011 to 2018 at The Ohio State University James Cancer Hospital, annotating 189 drug-ADE relations in single sentences within the medical records. We also used data from Harvard’s publicly available 2018 National Clinical Challenge (n2c2), which includes 505 discharge summaries with annotations of 1,355 single-sentence drug-ADE relations. We applied classical machine learning (support vector machine (SVM)), deep learning (convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM)), and state-of-the-art transformer-based (bidirectional encoder representations from transformers (BERT) and ClinicalBERT) methods trained and tested in the two different corpora and compared performance among them to detect drug–ADE relationships. ClinicalBERT detected drug–ADE relationships better than the other methods when trained using our dataset and tested in n2c2 (ClinicalBERT F-score, 0.78; other methods, F-scores, 0.61–0.73) and when trained using the n2c2 dataset and tested in ours (ClinicalBERT F-score, 0.74; other methods, F-scores, 0.55–0.72). Comparison among several machine learning methods demonstrated the superior performance and, therefore, the greatest generalizability of findings of ClinicalBERT for the detection of drug–ADE relations from clinical narratives in electronic medical records.

  14. Health Care Analytics

    • kaggle.com
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abishek Sudarshan (2022). Health Care Analytics [Dataset]. https://www.kaggle.com/datasets/abisheksudarshan/health-care-analytics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abishek Sudarshan
    Description

    Context

    Part of Janatahack Hackathon in Analytics Vidhya

    Content

    The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.

    MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

    MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

    One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

    The Process:

    MedCamp employees / volunteers reach out to people and drive registrations.
    During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
    

    Other things to note:

    Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
    For a few camps, there was hardware failure, so some information about date and time of registration is lost.
    MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides  
    information about several health issues through various awareness stalls.
    

    Favorable outcome:

    For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
    You need to predict the chances (probability) of having a favourable outcome.
    

    Train / Test split:

    Camps started on or before 31st March 2006 are considered in Train
    Test data is for all camps conducted on or after 1st April 2006.
    

    Acknowledgements

    Credits to AV

    Inspiration

    To share with the data science community to jump start their journey in Healthcare Analytics

  15. s

    Getranscribeerde medische dossiers datasets voor machine learning

    • nl.shaip.com
    json
    Updated Dec 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2022). Getranscribeerde medische dossiers datasets voor machine learning [Dataset]. https://nl.shaip.com/offerings/transcribed-medical-records-medical-data-catalog/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 7, 2022
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Ontvang hoogwaardige, kant-en-klare getranscribeerde medische datasets om beter presterende machine learning-modellen te ontwikkelen. Diepgaande domeinexpertise. Snel en kosteneffectief.

  16. Z

    Hand Washing Video Dataset Annotated According to the World Health...

    • data.niaid.nih.gov
    Updated Jan 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sabelnikovs, Olegs (2022). Hand Washing Video Dataset Annotated According to the World Health Organization's Handwashing Guidelines - METC Subset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5808788
    Explore at:
    Dataset updated
    Jan 3, 2022
    Dataset provided by
    Elsts, Atis
    Sabelnikovs, Olegs
    Zemlanuhina, Olga
    Slavinska, Andreta
    Vilde, Aija
    Melbārde-Kelmere, Agita
    Ivanovs, Maksims
    Lulla, Martins
    Rutkovskis, Aleksejs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview: This is a lab-based dataset with videos recording volunteers (medical students) washing their hands as part of a hand-washing monitoring and feedback experiment. The dataset is collected in the Medical Education Technology Center (METC) of Riga Stradins University, Riga, Latvia. In total, 72 participants took part in the experiments, each washing their hands three times, in a randomized order, going through three different hand-washing feedback approaches (user interfaces of a mobile app). The data was annotated in real time by a human operator, in order to give the experiment participants real-time feedback on their performance. There are 212 hand washing episodes in total, each of which is annotated by a single person. The annotations classify the washing movements according to the World Health Organization's (WHO) guidelines by marking each frame in each video with a certain movement code.

    This dataset is part on three dataset series all following the same format:

    https://zenodo.org/record/4537209 - data collected in Pauls Stradins Clinical University Hospital

    https://zenodo.org/record/5808764 - data collected in Jurmala Hospital

    https://zenodo.org/record/5808789 - data collected in the Medical Education Technology Center (METC) of Riga Stradins University

    Note #1: we recommend that when using this dataset for machine learning, allowances are made for the reaction speed of the human operator labeling the data. For example, the annotations can be expected to be incorrect a short while after the person in the video switches their washing movements.

    Application: The intention of this dataset is to serve as a basis for training machine learning classifiers for automated hand washing movement recognition and quality control.

    Statistics:

    Frame rate: ~16 FPS (slightly variable, as the video are reconstructed from a sequence of jpg images taken with max framerate supported by the capturing devices).

    Resolution: 640x480

    Number of videos: 212

    Number of annotation files: 212

    Movement codes (in JSON files):

    1: Hand washing movement — Palm to palm

    2: Hand washing movement — Palm over dorsum, fingers interlaced

    3: Hand washing movement — Palm to palm, fingers interlaced

    4: Hand washing movement — Backs of fingers to opposing palm, fingers interlocked

    5: Hand washing movement — Rotational rubbing of the thumb

    6: Hand washing movement — Fingertips to palm

    0: Other hand washing movement

    Note #2: The original dataset of JPG images is available upon request. There are 13 annotation classes in the original dataset: for each of the six washing movements defined by the WHO, "correct" and "incorrect" execution is market with two different labels. In this published dataset, all incorrect executions are marked with code 0, as "other" washing movement.

    Acknowledgments: The dataset collection was funded by the Latvian Council of Science project: "Automated hand washing quality control and quality evaluation system with real-time feedback", No: lzp - Nr. 2020/2-0309.

    References: For more detailed information, see this article, describing a similar dataset collected in a different project:

    M. Lulla, A. Rutkovskis, A. Slavinska, A. Vilde, A. Gromova, M. Ivanovs, A. Skadins, R. Kadikis, A. Elsts. Hand-Washing Video Dataset Annotated According to the World Health Organization’s Hand-Washing Guidelines. Data. 2021; 6(4):38. https://doi.org/10.3390/data6040038

    Contact information: atis.elsts@edi.lv

  17. Data from: MedMNIST-C: Comprehensive benchmark and improved classifier...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Di Salvo; Francesco Di Salvo; Sebastian Doerrich; Sebastian Doerrich; Christian Ledig; Christian Ledig (2024). MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions [Dataset]. http://doi.org/10.5281/zenodo.11471504
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesco Di Salvo; Francesco Di Salvo; Sebastian Doerrich; Sebastian Doerrich; Christian Ledig; Christian Ledig
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection, covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at github.com/francescodisalvo05/medmnistc-api.

    This work has been accepted at the Workshop on Advancing Data Solutions in Medical Imaging AI @ MICCAI 2024 [preprint].

    Note: Due to space constraints, we have uploaded all datasets except TissueMNIST-C. However, it can be reproduced via our APIs.

    Usage: We recommend using the demo code and tutorials available on our GitHub repository.

    Citation: If you find this work useful, please consider citing us:

    @article{disalvo2024medmnist,
     title={MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions},
     author={Di Salvo, Francesco and Doerrich, Sebastian and Ledig, Christian},
     journal={arXiv preprint arXiv:2406.17536},
     year={2024}
    }

    Disclaimer: This repository is inspired by MedMNIST APIs and the ImageNet-C repository. Thus, please also consider citing MedMNIST, the respective source datasets (described here), and ImageNet-C.

  18. p

    Data from: MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital...

    • physionet.org
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asad Aali; Dave Van Veen; Yamin Arefeen; Jason Hom; Christian Bluethgen; Eduardo Pontes Reis; Sergios Gatidis; Namuun Clifford; Joseph Daws; Arash Tehrani; Jangwon Kim; Akshay Chaudhari (2025). MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital Course Summarization [Dataset]. http://doi.org/10.13026/5gte-bv70
    Explore at:
    Dataset updated
    Feb 3, 2025
    Authors
    Asad Aali; Dave Van Veen; Yamin Arefeen; Jason Hom; Christian Bluethgen; Eduardo Pontes Reis; Sergios Gatidis; Namuun Clifford; Joseph Daws; Arash Tehrani; Jangwon Kim; Akshay Chaudhari
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    This dataset presents a curated collection of preprocessed and labeled clinical notes derived from the MIMIC-IV-Note database. The primary aim of this resource is to facilitate the development and training of machine learning models focused on summarizing brief hospital courses (BHC) from clinical discharge notes.

    The dataset contains 270,033 meticulously cleaned and standardized clinical notes containing an average token length of 2,267, ensuring usability for machine learning (ML) applications. Each clinical note is paired with a corresponding BHC summary, providing a robust foundation for supervised learning tasks. The preprocessing pipeline employed uses regular expressions to address common issues in the raw clinical text, such as special characters, extraneous whitespace, inconsistent formatting, and irrelevant text, to produce a high-quality, structured dataset with separated clinical note sections through appropriate headings.

    By offering this resource, we aim to support healthcare professionals and researchers in their efforts to enhance patient care through the automation of BHC summarization. This dataset is ideal for exploring various NLP techniques, developing predictive models, and improving the efficiency and accuracy of clinical documentation practices. We invite the research community to utilize this dataset to advance the field of medical informatics and contribute to better health outcomes.

  19. f

    An Applicable Dataset of Electronic Health Records with a Focus on CTA...

    • figshare.com
    xlsx
    Updated Oct 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad Ebrahimi; Soroor Laffafchi; samira kafan (2022). An Applicable Dataset of Electronic Health Records with a Focus on CTA Results in Pulmonary Embolism Disease [Dataset]. http://doi.org/10.6084/m9.figshare.21308463.v3
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 27, 2022
    Dataset provided by
    figshare
    Authors
    Ahmad Ebrahimi; Soroor Laffafchi; samira kafan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A new dataset associated with EHR data, involving patients suspicious of PE, including positive PE, healthy and COVID-19 patients suspected of PE. The dataset included PE diagnosis based on CTA imaging results, biographic data, vital signs, laboratory test results, past medical history and, medications. This dataset can utilize in the evolution of PE studies based on machine learning and artificial intelligence.

  20. Data from: Clinical Dataset

    • kaggle.com
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamadreza Momeni (2023). Clinical Dataset [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/clinical-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohamadreza Momeni
    Description

    The purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.

    Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.

    About Dataset:

    333 scholarly articles cite this dataset.

    Unique identifier: DOI

    Dataset updated: 2023

    Authors: Haoyang Mi

    In this dataset, we have two dataset:

    1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time

    2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS

    Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shaip (2025). Transcribed Medical Records datasets for Machine Learning [Dataset]. https://www.shaip.com/offerings/transcribed-medical-records-medical-data-catalog/

Transcribed Medical Records datasets for Machine Learning

Explore at:
jsonAvailable download formats
Dataset updated
Jun 15, 2025
Dataset authored and provided by
Shaip
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Get premium quality Off-the-shelf transcribed medical records dataset to develop better performing machine learning models. Deep domain expertise. Fast & Cost-effective.

Search
Clear search
Close search
Google apps
Main menu