66 datasets found
  1. Equity in Healthcare Clean DataSets

    • kaggle.com
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anopsy (2024). Equity in Healthcare Clean DataSets [Dataset]. https://www.kaggle.com/datasets/anopsy/equity-in-healthcare-clean-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anopsy
    Description

    This dataset is based on train and test dataset from this competition: https://www.kaggle.com/competitions/widsdatathon2024-challenge1 .

    What did I change? 1. I dropped 2 columns that contained to little data.
    2. using Machine Learning I imputed "payer_type", "patient_race" and "bmi". 3. using "patient_zip3" I filled missing values in "patient_state" , "Region" and "Division" 4. using SinmpleImputer I imputed few missing numeric data in "Ozone", "PM2.5" and other columns 5. I created some new features, based on demographic features, that may be a bit more informative. 6. I tokenized the 'breast_cancer_diagnosis_desc' column

    If you're interested how I did that check those notebooks: https://www.kaggle.com/code/anopsy/ml-for-missing-values for "bmi" and new features check this: https://www.kaggle.com/code/anopsy/fe-and-xgb-on-clean-data

    According to the description of the original dataset, it's a "39k record dataset (split into training and test sets) representing patients and their characteristics (age, race, BMI, zip code), their diagnosis and treatment information (breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, … etc.), their geo (zip-code level) demographic data (income, education, rent, race, poverty, …etc), as well as toxic air quality data (Ozone, PM25 and NO2)."

  2. Health Insurance Dataset

    • kaggle.com
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamadreza Momeni (2025). Health Insurance Dataset [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/health-insurance-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kaggle
    Authors
    Mohamadreza Momeni
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Medical Insurance Expenses & Premium Dataset

    This dataset captures demographic and financial information related to medical insurance policyholders. It includes key features such as age, gender, BMI, number of children, discount eligibility status, and the geographic region of the insured. The dataset also provides the actual medical expenses incurred (expenses) and the insurance premium charged (premium).

    The purpose of this dataset is to support research and development of machine learning models for predicting healthcare costs, optimizing pricing strategies, and understanding factors that influence insurance expenses and premiums.

    Columns

    age: Age of the policyholder

    gender: Gender (male/female)

    bmi: Body Mass Index

    children: Number of children covered by the insurance

    discount_eligibility: Whether the policyholder is eligible for a discount (yes/no)

    region: Geographic region (e.g., southeast, northwest)

    expenses: Actual medical costs incurred by the policyholder (Target number 1)

    premium: Insurance premium charged (Target number 2)

    Example Use Cases

    Predicting insurance expenses for new applicants

    Analyzing which demographic factors contribute most to higher premiums

    Exploring correlations between BMI, age, and healthcare costs

    Developing regression and classification models for pricing optimization

  3. AI medical chatbot

    • kaggle.com
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yousef Saeedian (2024). AI medical chatbot [Dataset]. https://www.kaggle.com/datasets/yousefsaeedian/ai-medical-chatbot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yousef Saeedian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description:

    This dataset comprises transcriptions of conversations between doctors and patients, providing valuable insights into the dynamics of medical consultations. It includes a wide range of interactions, covering various medical conditions, patient concerns, and treatment discussions. The data is structured to capture both the questions and concerns raised by patients, as well as the medical advice, diagnoses, and explanations provided by doctors.

    Key Features:

    • Doctor and Patient Roles: Each conversation is annotated with the role of the speaker (doctor or patient), making it easy to analyze communication patterns.
    • Medical Context: The dataset includes diverse scenarios, from routine check-ups to more complex medical discussions, offering a broad spectrum of healthcare dialogues.
    • Natural Language: The conversations are presented in natural language, allowing for the development and testing of NLP models focused on healthcare communication.
    • Applications: This dataset can be used for various applications, such as building dialogue systems, analyzing communication efficacy, developing medical NLP models, and enhancing patient care through better understanding of doctor-patient interactions.

    Potential Use Cases:

    • NLP Model Training: Train models to understand and generate medical dialogues.
    • Healthcare Communication Studies: Analyze communication strategies between doctors and patients to improve healthcare delivery.
    • Medical Chatbots: Develop intelligent medical chatbots that can simulate doctor-patient conversations.
    • Patient Experience Enhancement: Identify common patient concerns and doctor responses to enhance patient care strategies.

    This dataset is a valuable resource for researchers, data scientists, and healthcare professionals interested in the intersection of technology and medicine, aiming to improve healthcare communication through data-driven approaches.

  4. Sepsis Dataset –

    • kaggle.com
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatolu Peter (2025). Sepsis Dataset – [Dataset]. https://www.kaggle.com/datasets/olagokeblissman/sepsis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fatolu Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📝 Dataset Overview: This dataset focuses on early warning detection for sepsis, a critical and potentially fatal medical condition. It includes anonymized vital signs, lab results, and clinical indicators of patients admitted to the hospital, structured for real-time monitoring and predictive modeling.

    It’s ideal for clinical data analysts, healthcare data scientists, and AI practitioners aiming to develop decision support tools, early warning dashboards, or predictive health models.

    🔍 Dataset Features: Column Name Description Patient_ID Unique anonymized identifier Admission_Date Patient’s hospital admission date Temperature_C Body temperature in degrees Celsius BP_Systolic Systolic blood pressure (mmHg) BP_Diastolic Diastolic blood pressure (mmHg) Heart_Rate Beats per minute WBC_Count White blood cell count (x10⁹/L) Lactate_mmol_L Lactate level in mmol/L Sepsis_Flag Binary indicator (1 = Suspected Sepsis, 0 = Normal) Ward Hospital ward/unit Doctor_On_Duty Attending physician name (anonymized)

    🎯 Use Cases: Build Power BI dashboards for hospital early warning systems

    Train ML classification models to detect early signs of sepsis

    Create patient monitoring tools with Python or R

    Explore the relationship between vitals & sepsis onset

    Perform feature engineering for risk scoring systems

    📌 Clinical Relevance: Sepsis is one of the leading causes of in-hospital mortality worldwide. Early detection is crucial to reducing death rates and improving outcomes. This dataset empowers developers and analysts to make a meaningful impact in the healthcare sector.

    👤 Created By: Fatolu Peter (Emperor Analytics) A passionate healthcare analyst leveraging data to drive innovation in public health across Nigeria. This is Project 12 in my data-for-good series.

    ✅ LinkedIn Post: 🚨 New Dataset: Sepsis Early Warning System Data – Now on Kaggle 📊 Clinical vital signs + lab markers + sepsis risk flags 🔗 Explore the dataset here

    This dataset enables healthcare data scientists to: ✅ Build real-time hospital dashboards ✅ Predict sepsis risk with machine learning ✅ Explore vitals like BP, lactate, WBC, and temperature ✅ Support early intervention using data insights

    Whether you're into: 🧠 Predictive modeling 📈 Power BI clinical dashboards 📉 Risk analytics in healthcare This is for you.

    Join me in using data to save lives — one insight at a time. If you build something, tag me. I’ll gladly share it! 💡

    HealthcareAnalytics #SepsisAwareness #EarlyWarningSystems #KaggleDataset #PowerBI #DataForGood #FatoluPeter #EmperorAnalytics #PublicHealth #Project12 #RealWorldData

    Let me know if you’d like help starting a Power BI or Python model for this!

    You said: whic license is the best to choose that is global

    ChatGPT said: For globally recognized and widely accepted dataset licensing on Kaggle, the best license to choose depends on your intent — especially whether you want others to reuse and build upon your dataset freely.

    ✅ Recommended License: Creative Commons Attribution 4.0 (CC BY 4.0) Why CC BY 4.0 is best globally: 🌍 Internationally recognized and legally valid worldwide.

    ✅ Allows others to use, share, adapt, and build on your dataset, even for commercial purposes.

    📢 Requires only attribution to you as the creator.

    🔐 You still retain copyright while maximizing openness.

    When to choose it: If your goal is to:

    Share freely with the global community,

    Allow use in academic, commercial, or public projects,

    Gain credit and visibility as the original creator.

    ⚠️ Other license types (if needed): CC BY-NC 4.0 (Attribution-NonCommercial): Only for non-commercial use.

    CC0 (Public Domain Dedication): Freest use; no attribution required — but not always ideal if you want credit.

    GPL / Open Data Commons: More for software or structured databases with specific open-source obligations.

    ✅ Final Suggestion for Your Datasets: Use CC BY 4.0 for all your uploads, unless you have confidential/private data. It's perfect for:

    Healthcare datasets

    Sales/retail analytics

    Kaggle portfolio building

    Global recognition & impact

  5. m

    Synthetic Stroke Prediction Dataset

    • data.mendeley.com
    • kaggle.com
    Updated May 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Borhan Uddin (2025). Synthetic Stroke Prediction Dataset [Dataset]. http://doi.org/10.17632/s2nh6fm925.1
    Explore at:
    Dataset updated
    May 2, 2025
    Authors
    Mohammed Borhan Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.

  6. z

    Patient Doctor Q&A TR 321179

    • zenodo.org
    csv
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammed Kayra Bulut; Muhammed Kayra Bulut (2024). Patient Doctor Q&A TR 321179 [Dataset]. http://doi.org/10.5281/zenodo.12798934
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Muhammed Kayra Bulut
    Authors
    Muhammed Kayra Bulut; Muhammed Kayra Bulut
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    Jul 18, 2024
    Description

    # Patient Doctor Q&A TR 321179 Veri Seti
    Patient Doctor Q&A TR 321179 veri seti, [**Patient Doctor Q&A TR 19583**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-qa-dataset-tr), [**Patient Doctor Q&A TR 167732**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-167732), [**Patient Doctor Q&A TR 5695**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-translated-from-id-to-tr) ve [**Patient Doctor Q&A TR 95588**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-95588) veri setlerinin birleştirilmiş ve karıştırılmış halidir.

    ## Ana Özellikler:
    * İçerik: Çeşitli tıbbi konuları kapsayan hasta soruları ve doktor yanıtları.
    * Yapı: 2 sütun içerir: Soru, Cevap.
    * Dil: Türkçe.
    ## Potansiyel Kullanım Alanları:
    * Tıbbi araştırmalar
    * Doğal Dil İşleme (NLP)
    * Tıbbi eğitim
    ## Sınırlamalar:
    * Veri gizliliği endişeleri
    * Yanıt kalitesinde değişkenlik
    * Potansiyel önyargılar
    ## Genel Değerlendirme:
    Patient Doctor Q&A TR 321179 veri seti, gerçek dünyadaki tıbbi iletişimi ve bilgi alışverişini anlamak için değerli bir kaynaktır. Türkçeye çevrilmiş bu veri seti, tıbbi araştırmalar ve eğitim için önemli bir kaynak olup, hasta ve doktor arasındaki iletişimi analiz etmek için kullanılabilir. Ancak, veri gizliliği ve yanıt kalitesindeki değişkenlik gibi sınırlamalar göz önünde bulundurulmalıdır.

    Bu veri seti, araştırmacılara ve eğitimcilere, Türkçe tıbbi iletişim verilerini kullanarak daha derinlemesine analiz yapma ve doğal dil işleme tekniklerini uygulama fırsatı sunar.

    # Patient Doctor Q&A TR 321179 Dataset
    The Patient Doctor Q&A TR 321179 dataset is a combined and shuffled version of the [**Patient Doctor Q&A TR 19583**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-qa-dataset-tr), [**Patient Doctor Q&A TR 167732**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-167732), [**Patient Doctor Q&A TR 5695**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-translated-from-id-to-tr), and [**Patient Doctor Q&A TR 95588**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-95588) datasets.

    ## Main Features:
    * Content: Patient questions and doctor answers covering various medical topics.
    * Structure: Contains 2 columns: Question, Answer.
    * Language: Turkish.
    ## Potential Uses:
    * Medical research
    * Natural Language Processing (NLP)
    * Medical education
    ## Limitations:
    * Data privacy concerns
    * Variability in answer quality
    * Potential biases
    ## General Assessment:
    The Patient Doctor Q&A TR 321179 dataset is a valuable resource for understanding real-world medical communication and information exchange. This dataset, translated into Turkish, is an important resource for medical research and education, and can be used to analyze communication between patients and doctors. However, limitations such as data privacy and variability in answer quality should be considered.

    This dataset offers researchers and educators the opportunity to conduct more in-depth analyses and apply natural language processing techniques using Turkish medical communication data.

  7. i

    Cardiovascular Disease Dataset

    • ieee-dataport.org
    Updated Oct 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajib Kumar Halder Halder (2022). Cardiovascular Disease Dataset [Dataset]. https://ieee-dataport.org/documents/cardiovascular-disease-dataset
    Explore at:
    Dataset updated
    Oct 25, 2022
    Authors
    Rajib Kumar Halder Halder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This heart disease dataset is curated by combining 3 popular heart disease datasets. The first dataset (Collected from Kaggle) contains 70000 records with 11 independent features which makes it the largest heart disease dataset available so far for research purposes. These data were collected at the moment of medical examination and information given by the patient. Second and third datasets contain 303 and 293 intstances respectively with 13 common features. The three datasets used for its curation are:Cardio Data (Kaggle Dataset)

  8. Cardiovascular Disease dataset

    • kaggle.com
    zip
    Updated Jan 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Svetlana Ulianova (2019). Cardiovascular Disease dataset [Dataset]. https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
    Explore at:
    zip(760066 bytes)Available download formats
    Dataset updated
    Jan 20, 2019
    Authors
    Svetlana Ulianova
    Description

    Data description

    There are 3 types of input features:

    • Objective: factual information;
    • Examination: results of medical examination;
    • Subjective: information given by the patient.

    Features:

    1. Age | Objective Feature | age | int (days)
    2. Height | Objective Feature | height | int (cm) |
    3. Weight | Objective Feature | weight | float (kg) |
    4. Gender | Objective Feature | gender | categorical code |
    5. Systolic blood pressure | Examination Feature | ap_hi | int |
    6. Diastolic blood pressure | Examination Feature | ap_lo | int |
    7. Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
    8. Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
    9. Smoking | Subjective Feature | smoke | binary |
    10. Alcohol intake | Subjective Feature | alco | binary |
    11. Physical activity | Subjective Feature | active | binary |
    12. Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

    All of the dataset values were collected at the moment of medical examination.

  9. Maternal Health Dataset

    • kaggle.com
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JANANI D (2025). Maternal Health Dataset [Dataset]. https://www.kaggle.com/datasets/hgfdrtyu/maternal-health-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 4, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    JANANI D
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains health indicators related to pregnant women to help assess maternal health risk levels. The primary goal is to classify the level of maternal health risk into categories such as Low, Mid, or High, based on various medical features.

    The dataset can be used for classification tasks in machine learning, data visualization, healthcare analytics, and predictive modeling.

  10. Symptom-Based Disease Labeling Dataset

    • kaggle.com
    Updated Sep 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krish Sharma (2023). Symptom-Based Disease Labeling Dataset [Dataset]. https://www.kaggle.com/datasets/krish0202/symptom-based-disease-labeling-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Krish Sharma
    Description

    The "Symptom-Based Disease Labeling Dataset" is a valuable resource for the development and evaluation of Natural Language Processing (NLP) models aimed at disease diagnosis and healthcare applications. This dataset contains a diverse collection of English sentences that describe various symptoms and corresponding labels specifying the associated disease.

    Key Features :

    Diverse Symptoms: The dataset includes a rich variety of symptom descriptions, encompassing both common and rare health issues, making it suitable for comprehensive disease diagnosis tasks.

    Accurate Labels: Each sentence is meticulously labeled with the corresponding disease, ensuring high-quality training and evaluation data for NLP models.

  11. Hospital Patient Treatment Dataset

    • kaggle.com
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Slidescope (2025). Hospital Patient Treatment Dataset [Dataset]. https://www.kaggle.com/datasets/slidescope/hospital-patient-treatment-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Slidescope
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This Hospital Patient Treatment Dataset contains simulated data representing patient treatments in a multi-specialty hospital. It includes 200 records with both categorical and numerical fields, designed for data visualization and analysis using tools like Power BI, Tableau, or Python.

    The dataset features 3–4 categorical fields: Department (e.g., Cardiology, Neurology, Orthopedics), Treatment Type (Surgery, Medication, Therapy, Observation), Doctor Name, and Gender. It also includes numerical fields such as Age, Treatment Cost, Hospital Stay (Days), and Recovery Score (ranging from 0 to 100).

    To get more information about Columns visit: https://colorstech.net/practice-datasets/hospital-patient-treatment-dataset-for-analysis/

    This dataset is ideal for healthcare analysts and data enthusiasts who want to practice analyzing treatment efficiency, patient demographics, cost effectiveness, and healthcare outcomes. Potential analyses include cost comparisons by department, gender-based treatment patterns, doctor performance based on recovery scores, and identifying which treatments lead to faster recovery.

    It can help simulate real-world healthcare reporting scenarios, such as understanding hospital load, cost optimization, or patient well-being tracking. The inclusion of unique Patient ID allows for easy referencing and segmentation.

    This dataset is well-suited for creating KPIs, dashboards, and advanced visualizations to gain insights into hospital operations and patient care outcomes. No real patient data is used—this is a synthetic dataset for educational use only.

  12. f

    Environmental settings of the proposed system.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Biswas; Rafid Mostafiz; Mohammad Shorif Uddin; Muhammad Shahin Uddin (2025). Environmental settings of the proposed system. [Dataset]. http://doi.org/10.1371/journal.pone.0324957.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Shuvo Biswas; Rafid Mostafiz; Mohammad Shorif Uddin; Muhammad Shahin Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pneumonia, a severe lung infection caused by various viruses, presents significant challenges in diagnosis and treatment due to its similarities with other respiratory conditions. Additionally, the need to protect patient privacy complicates the sharing of sensitive clinical data. This study introduces FLPneXAINet, an effective framework that combines federated learning (FL) with deep learning (DL) and explainable AI (XAI) to securely and accurately predict pneumonia using chest X-ray (CXR) images. We utilized a benchmark dataset from Kaggle, comprising 8,402 CXR images (3,904 normal and 4,498 pneumonia). The dataset was preprocessed and augmented using a cycle-consistent generative adversarial (CycleGAN) network to increase the volume of training data. Three pre-trained DL models named VGG16, NASNetMobile, and MobileNet were employed to extract features from the augmented dataset. Further, four ensemble DL (EDL) models were used to enhance feature extraction. Feature optimization was performed using recursive feature elimination (RFE), analysis of variance (ANOVA), and random forest (RF) to select the most relevant features. These optimized features were then inputted into machine learning (ML) models, including K-nearest neighbor (KNN), naive bayes (NB), support vector machine (SVM), and RF, for pneumonia prediction. The performance of the models was evaluated in a FL environment, with the EDL network achieving the best results: accuracy 97.61%, F1 score 98.36%, recall 98.13%, and precision 98.59%. The framework’s predictions were further validated using two XAI techniques—Local Interpretable Model-Agnostic Explanations (LIME) and Grad-CAM. FLPneXAINet offers a robust solution for healthcare professionals to accurately diagnose pneumonia, ensuring timely treatment while safeguarding patient privacy.

  13. u

    Data from: OCT5k: A dataset of multi-disease and multi-graded annotations...

    • rdr.ucl.ac.uk
    zip
    Updated Nov 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mustafa Arikan (2024). OCT5k: A dataset of multi-disease and multi-graded annotations for retinal layers [Dataset]. http://doi.org/10.5522/04/22128671.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    University College London
    Authors
    Mustafa Arikan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The thickness and appearance of retinal layers are essential markers for diagnosing and studying eye diseases. Despite the increasing availability of imaging devices to scan and store large amounts of data, analyzing retinal images and generating trial endpoints has remained a manual, error-prone, and time-consuming task. In particular, the lack of large amounts of high-quality labels for different diseases hinders the development of automated algorithms. Therefore, we have compiled 5016 pixel-wise manual labels for 1672 optical coherence tomography (OCT) scans featuring two different diseases as well as healthy subjects to help democratize the process of developing novel automatic techniques. We also collected 4698 bounding box annotations for a subset of 566 scans across 9 classes of disease biomarker. Due to variations in retinal morphology, intensity range, and changes in contrast and brightness, designing segmentation and detection methods that can generalize to different disease types is challenging. While machine learning-based methods can overcome these challenges, high-quality expert annotations are necessary for training. Publicly available annotated image datasets typically contain few images and/or only cover a single type of disease, and most are only annotated by a single grader. To address this gap, we present a comprehensive multi-grader and multi-disease dataset fortraining machine learning-based algorithms. The proposed dataset covers three subsets of scans (Age-related Macular Degeneration, Diabetic Macular Edema, and healthy) and annotations for two types of tasks (semantic segmentation and object detection).

  14. LUTH Hospital Enhanced Dataset –

    • kaggle.com
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatolu Peter (2025). LUTH Hospital Enhanced Dataset – [Dataset]. https://www.kaggle.com/datasets/olagokeblissman/luth-hospital-enhanced-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fatolu Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📝 Dataset Overview: This enhanced dataset captures the real-world operational and clinical performance data from a major hospital in Nigeria — Lagos University Teaching Hospital (LUTH). It includes detailed information on admissions, patient care, medical services, billing, and staff activities, ideal for healthcare analytics, hospital management dashboards, and machine learning projects.

    🔍 Dataset Features (Suggested Columns): Column Name Description Patient_ID Unique anonymized patient ID Admission_Date Date of admission Discharge_Date Date of discharge Gender Patient’s gender Age Patient’s age Department Medical department involved Diagnosis Primary diagnosis Doctor Attending physician (anonymized) Treatment_Provided Type of treatment/procedure Lab_Tests Count of lab tests conducted Medications_Administered Total medications given Surgery_Cost (₦) If applicable, cost of surgery Bill_Amount (₦) Total bill charged to patient Ward Hospital ward assigned Length_of_Stay (days) Duration of hospitalization

    🎯 Use Cases: Build hospital operations dashboards in Power BI

    Analyze billing and cost patterns across departments

    Predict length of stay or discharge outcomes

    Explore departmental workload and performance

    Use as a base for AI in hospital management systems

    🏥 Clinical & Operational Value: This dataset empowers analysts and healthcare professionals to:

    Track patient outcomes and billing efficiency

    Reduce operational bottlenecks

    Improve patient care with data-driven recommendations

    Benchmark departmental performance

    Train predictive models for resource allocation

    👤 Created By: Fatolu Peter (Emperor Analytics) Dedicated to transforming public healthcare using analytics and real-world data across Nigerian hospitals. This is Project 14 in my growing health-tech analytics journey.

    ✅ LinkedIn Post: 🚑 New Kaggle Dataset: LUTH Hospital Enhanced Clinical & Operations Data 📊 Real hospital data on admissions, billing, treatments, and care metrics 🔗 Access the dataset now on Kaggle

    This dataset gives you: ✅ Real hospital operations data ✅ Billing and medication insights ✅ Doctor and ward-level activity ✅ A perfect base for building Power BI dashboards or training ML models

    Whether you're a data scientist, health analyst, or Power BI pro — this is real-world data to make real impact. Let’s build something powerful together. 💡

    HealthcareAnalytics #LUTH #HospitalData #KaggleDataset #PowerBI #FatoluPeter #EmperorAnalytics #DataForGood #Project14 #PublicHealth #NigeriaHealthData

  15. Virtual Patient Model Assessment

    • kaggle.com
    Updated Feb 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Virtual Patient Model Assessment [Dataset]. https://www.kaggle.com/datasets/thedevastator/virtual-patient-model-assessment
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Virtual Patient Model Assessment

    Tracking Physical, Psychological and Cognitive Performance in Older Adults

    By [source]

    About this dataset

    This dataset provides a comprehensive overview of the physical, psychological and cognitive health of a cohort of older adults. It contains data collected from medical experts during clinical assessments such as physical activity, nutrition, activity limitations, balance, depression and cognition. Additionally it includes parameters extracted from used devices such as average heart rate per day and average gait speed. Carefully coupled with this is detailed information relating to falls, fractures and loss of orientation within the group studied which can add even further insight into the overall trends in health for those aged 55 and above.

    The dataset includes various scores capturing different aspects alongside statistics to better represent participants' lifestyles; not only does it feature basic metrics like gender or age but also complex measures like exhaustion or grip strength for each individual in the cohort. Furthermore an analytical exploration into nutrition measures (e.g., Body Mass Index), social interaction (e.g., phone calls) or leisure activities (clubs) could help uncover powerful correlations among them resulting in innovative strategies for improving well-being amongst elderly population groups

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a comprehensive overview of physical, psychological and cognitive health of a cohort of older adults. It includes parameters related to physical activity, nutrition, activity limitations, balance, depression, cognition and more. Through this dataset you can gain insights into the various factors affecting the health of elderlies in your population which could be helpful for researchers or practitioners in developing interventions to promote elderly health.

    Before using this dataset it is advised to get familiar with the variables and fields provided. There are two sections within each variable: descriptive information such as gender and age group; and scores related to various aspects such as heart rate per day or average gait speed per month. You may also find additional coupled events like falls or fractures that can impact the assessment scores over time.

    Once you have gone through all variables available in the dataset you may use simple statistical methods like measuring mean values of several key indicators (such as balance score or bmi score) across different characteristics (such as age group). Comparing these values allows researchers to identify trends amongst different groups within a population that would show differences on an individual level.

    Other techniques that could be used include clustering techniques to observe patterns in data relating different indicators at once on comparative models; logistic regression which would help identify which predictors explain certain outcomes among elderly people well; or propensity matching-based approaches which suggest what kind of intervention should be given depending on each person’s characteristics based on an accumulated data source from elderly population research using this dataset . The usefulness of this dataset is not limited by stats only but it might also benefit from theoretical forms such as narrative geometry used for subjective analysis by placing story-telling elements along with formative assessments onto conceptual frameworks between inside natural ecosystems already running smoothly(between concepts) before disruption/disequilibrium happens due external stressors ecomorphonologically speaking . This will eventually help clinicians addressing psychological conditions verifying objective status via outcomes from metrics established earlier preferably prior experiments where involuntary independent behavior was detected influencing vital organ systems at homeostasis levels either causing positive adaptations / fitness ,or increasing vulnerability that when added up together shift towards severe distress turn proximally considering also other segments elsewhere varying across multiple networks simultaneous injections cumulated/integrated effects starting sometimes after take off periods way before ill health seems obviously concrete therefore important details concerning risk factors sometimes overlooked got noticed while capturing evidence based prospective by cross validated means completed longitudinal surveys taking advantage into being able understanding potentially confounding conditions sparedly manifested either forgotten beca...

  16. Features_corected

    • kaggle.com
    zip
    Updated May 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TUSHAR SARKAR (2021). Features_corected [Dataset]. https://www.kaggle.com/tusharsarkar/features-corected
    Explore at:
    zip(14633271 bytes)Available download formats
    Dataset updated
    May 13, 2021
    Authors
    TUSHAR SARKAR
    Description

    Dataset

    This dataset was created by TUSHAR SARKAR

    Contents

    It contains the following files:

  17. Human Vital Sign Dataset

    • kaggle.com
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DatasetEngineer (2024). Human Vital Sign Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/8992827
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DatasetEngineer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview The Human Vital Signs Dataset is a comprehensive collection of key physiological parameters recorded from patients. This dataset is designed to support research in medical diagnostics, patient monitoring, and predictive analytics. It includes both original attributes and derived features to provide a holistic view of patient health.

    Attributes Patient ID

    Description: A unique identifier assigned to each patient. Type: Integer Example: 1, 2, 3, ... Heart Rate

    Description: The number of heartbeats per minute. Type: Integer Range: 60-100 bpm (for this dataset) Example: 72, 85, 90 Respiratory Rate

    Description: The number of breaths taken per minute. Type: Integer Range: 12-20 breaths per minute (for this dataset) Example: 16, 18, 15 Timestamp

    Description: The exact time at which the vital signs were recorded. Type: Datetime Format: YYYY-MM-DD HH:MM Example: 2023-07-19 10:15:30 Body Temperature

    Description: The body temperature measured in degrees Celsius. Type: Float Range: 36.0-37.5°C (for this dataset) Example: 36.7, 37.0, 36.5 Oxygen Saturation

    Description: The percentage of oxygen-bound hemoglobin in the blood. Type: Float Range: 95-100% (for this dataset) Example: 98.5, 97.2, 99.1 Systolic Blood Pressure

    Description: The pressure in the arteries when the heart beats (systolic pressure). Type: Integer Range: 110-140 mmHg (for this dataset) Example: 120, 130, 115 Diastolic Blood Pressure

    Description: The pressure in the arteries when the heart rests between beats (diastolic pressure). Type: Integer Range: 70-90 mmHg (for this dataset) Example: 80, 75, 85 Age

    Description: The age of the patient. Type: Integer Range: 18-90 years (for this dataset) Example: 25, 45, 60 Gender

    Description: The gender of the patient. Type: Categorical Categories: Male, Female Example: Male, Female Weight (kg)

    Description: The weight of the patient in kilograms. Type: Float Range: 50-100 kg (for this dataset) Example: 70.5, 80.3, 65.2 Height (m)

    Description: The height of the patient in meters. Type: Float Range: 1.5-2.0 m (for this dataset) Example: 1.75, 1.68, 1.82 Derived Features Derived_HRV (Heart Rate Variability)

    Description: A measure of the variation in time between heartbeats. Type: Float Formula: 𝐻 𝑅

    𝑉

    Standard Deviation of Heart Rate over a Period Mean Heart Rate over the Same Period HRV= Mean Heart Rate over the Same Period Standard Deviation of Heart Rate over a Period ​

    Example: 0.10, 0.12, 0.08 Derived_Pulse_Pressure (Pulse Pressure)

    Description: The difference between systolic and diastolic blood pressure. Type: Integer Formula: 𝑃

    𝑃

    Systolic Blood Pressure − Diastolic Blood Pressure PP=Systolic Blood Pressure−Diastolic Blood Pressure Example: 40, 45, 30 Derived_BMI (Body Mass Index)

    Description: A measure of body fat based on weight and height. Type: Float Formula: 𝐵 𝑀

    𝐼

    Weight (kg) ( Height (m) ) 2 BMI= (Height (m)) 2

    Weight (kg) ​

    Example: 22.8, 25.4, 20.3 Derived_MAP (Mean Arterial Pressure)

    Description: An average blood pressure in an individual during a single cardiac cycle. Type: Float Formula: 𝑀 𝐴

    𝑃

    Diastolic Blood Pressure + 1 3 ( Systolic Blood Pressure − Diastolic Blood Pressure ) MAP=Diastolic Blood Pressure+ 3 1 ​ (Systolic Blood Pressure−Diastolic Blood Pressure) Example: 93.3, 100.0, 88.7 Target Feature Risk Category Description: Classification of patients into "High Risk" or "Low Risk" based on their vital signs. Type: Categorical Categories: High Risk, Low Risk Criteria: High Risk: Any of the following conditions Heart Rate: > 90 bpm or < 60 bpm Respiratory Rate: > 20 breaths per minute or < 12 breaths per minute Body Temperature: > 37.5°C or < 36.0°C Oxygen Saturation: < 95% Systolic Blood Pressure: > 140 mmHg or < 110 mmHg Diastolic Blood Pressure: > 90 mmHg or < 70 mmHg BMI: > 30 or < 18.5 Low Risk: None of the above conditions Example: High Risk, Low Risk This dataset, with a total of 200,000 samples, provides a robust foundation for various machine learning and statistical analysis tasks aimed at understanding and predicting patient health outcomes based on vital signs. The inclusion of both original attributes and derived features enhances the richness and utility of the dataset.

  18. 3500 hospitals data of USA

    • kaggle.com
    Updated Oct 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Garg 0303 (2023). 3500 hospitals data of USA [Dataset]. https://www.kaggle.com/datasets/amangarg0303/3500-hospitals-data-of-usa/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aman Garg 0303
    Area covered
    United States
    Description

    The dataset is taken from communitybenefitinsight website, they have the data of around 3500 hospitals in USA consisting of the hospital name, address, city, state, zipcode, bed_count, and many more features. This data is for practice and educational purposes.

  19. CellNet

    • kaggle.com
    Updated Jul 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Capocyan (2024). CellNet [Dataset]. http://doi.org/10.34740/kaggle/dsv/8839178
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    John Capocyan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CellNet: A Large-Scale Comprehensive Medical Imaging Database Research Effort

    Overview:

    Inspired by Stanford's leading AI research group's and computer vision's beloved ImageNet database, CellNet is a meticulously curated dataset featuring over 120,000 high-quality medical images representing over 20 organ/cancer classes. It was an idea that started in June 2024 during the time of research in the paper "Enhancement Framework for Vision Transformers in Data Limited Cancer Detection" [Presented at the IEEE's 2024 MIT Undergraduate Research Technology Conference (URTC)]. Future work in the study believed that given enough data and computation, pre-trained CellNet weights transfered to other models working with various medical imaging tasks during transfer-learning can outperform other models pre-trained on ImageNet. More importantly, CellNet relies on the study's theory that models pre-trained on large datasets specifically tailored to their task's field at hand will perform better than general pre-trained models. Specifically in CellNet's case, pre-trained weights derived from medical-based tasks can potentially provide a more sound and ethical base for medical AI models in future applications and research on cancer diagnosis and other related fields.

    Data Collection:

    The data for CellNet has been aggregated from numerous reputable online repositories and medical labs, ensuring a diverse and comprehensive collection of histopathology images. In collaboration with a team of physicians, lab collected images implemented in the set have been meticulously verified and annotated to maintain high standards of accuracy and reliability.

    Continiuous Updates:

    One of the key features of CellNet is its dynamic and evolving nature. CellNet is committed to updating monthly, integrating new data and reorganizing existing entries based on the latest research and clinical findings. This continuous update cycle ensures that CellNet remains a relevant and up-to-date resource for the clinical and medical AI research communities.

    Applications:

    CellNet is designed to facilitate various machine learning model applications, including but not limited to:

    • Image Classification
    • Image Segmentation
    • Transfer Learning

    However, it's important to note that as this dataset grows, weights obtained from training on this set are primarily meant to be imported to other models for transfer learning on smaller datasets.

    Community Engagement:

    The CellNet team encourages the Kaggle community to engage with CellNet by exploring the dataset, creating notebooks, and providing feedback. Your contributions and insights are invaluable in helping CellNet improve and expand its resource and weights.

    License:

    This dataset is licensed under the MIT License. Please cite hte dataset if you use it in your research.

    Acknowledgements:

    I would like to thank all the contributors and the medical professionals who helped produced all the images curated in this dataset.

    Contact:

    For any questions or support, please contact [johncapocyan@gmail.com].

    Dive into the dataset today and join the mission to revolutionize medical research through the power of data and AI!

  20. UK Hospitals list

    • kaggle.com
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhankar Sinha (2024). UK Hospitals list [Dataset]. https://www.kaggle.com/datasets/subhankarsinha1985/uk-hospitals-list/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 5, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Subhankar Sinha
    Area covered
    United Kingdom
    Description

    This is a list of UK hospitals which contains details about 299 hospitals. The dataset contains no null values. It has 7 features altogether

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anopsy (2024). Equity in Healthcare Clean DataSets [Dataset]. https://www.kaggle.com/datasets/anopsy/equity-in-healthcare-clean-datasets
Organization logo

Equity in Healthcare Clean DataSets

Clean Version / Imputed Missing Values

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anopsy
Description

This dataset is based on train and test dataset from this competition: https://www.kaggle.com/competitions/widsdatathon2024-challenge1 .

What did I change? 1. I dropped 2 columns that contained to little data.
2. using Machine Learning I imputed "payer_type", "patient_race" and "bmi". 3. using "patient_zip3" I filled missing values in "patient_state" , "Region" and "Division" 4. using SinmpleImputer I imputed few missing numeric data in "Ozone", "PM2.5" and other columns 5. I created some new features, based on demographic features, that may be a bit more informative. 6. I tokenized the 'breast_cancer_diagnosis_desc' column

If you're interested how I did that check those notebooks: https://www.kaggle.com/code/anopsy/ml-for-missing-values for "bmi" and new features check this: https://www.kaggle.com/code/anopsy/fe-and-xgb-on-clean-data

According to the description of the original dataset, it's a "39k record dataset (split into training and test sets) representing patients and their characteristics (age, race, BMI, zip code), their diagnosis and treatment information (breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, … etc.), their geo (zip-code level) demographic data (income, education, rent, race, poverty, …etc), as well as toxic air quality data (Ozone, PM25 and NO2)."

Search
Clear search
Close search
Google apps
Main menu