66 datasets found

Equity in Healthcare Clean DataSets
kaggle.com
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anopsy (2024). Equity in Healthcare Clean DataSets [Dataset]. https://www.kaggle.com/datasets/anopsy/equity-in-healthcare-clean-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anopsy
Description
This dataset is based on train and test dataset from this competition: https://www.kaggle.com/competitions/widsdatathon2024-challenge1 .

What did I change? 1. I dropped 2 columns that contained to little data.
2. using Machine Learning I imputed "payer_type", "patient_race" and "bmi". 3. using "patient_zip3" I filled missing values in "patient_state" , "Region" and "Division" 4. using SinmpleImputer I imputed few missing numeric data in "Ozone", "PM2.5" and other columns 5. I created some new features, based on demographic features, that may be a bit more informative. 6. I tokenized the 'breast_cancer_diagnosis_desc' column

If you're interested how I did that check those notebooks: https://www.kaggle.com/code/anopsy/ml-for-missing-values for "bmi" and new features check this: https://www.kaggle.com/code/anopsy/fe-and-xgb-on-clean-data

According to the description of the original dataset, it's a "39k record dataset (split into training and test sets) representing patients and their characteristics (age, race, BMI, zip code), their diagnosis and treatment information (breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, … etc.), their geo (zip-code level) demographic data (income, education, rent, race, poverty, …etc), as well as toxic air quality data (Ozone, PM25 and NO2)."
Health Insurance Dataset
kaggle.com
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamadreza Momeni (2025). Health Insurance Dataset [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/health-insurance-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2025
Dataset provided by
Kaggle
Authors
Mohamadreza Momeni
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Medical Insurance Expenses & Premium Dataset

This dataset captures demographic and financial information related to medical insurance policyholders. It includes key features such as age, gender, BMI, number of children, discount eligibility status, and the geographic region of the insured. The dataset also provides the actual medical expenses incurred (expenses) and the insurance premium charged (premium).

The purpose of this dataset is to support research and development of machine learning models for predicting healthcare costs, optimizing pricing strategies, and understanding factors that influence insurance expenses and premiums.

Columns

age: Age of the policyholder

gender: Gender (male/female)

bmi: Body Mass Index

children: Number of children covered by the insurance

discount_eligibility: Whether the policyholder is eligible for a discount (yes/no)

region: Geographic region (e.g., southeast, northwest)

expenses: Actual medical costs incurred by the policyholder (Target number 1)

premium: Insurance premium charged (Target number 2)

Example Use Cases

Predicting insurance expenses for new applicants

Analyzing which demographic factors contribute most to higher premiums

Exploring correlations between BMI, age, and healthcare costs

Developing regression and classification models for pricing optimization
AI medical chatbot
kaggle.com
Updated Aug 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yousef Saeedian (2024). AI medical chatbot [Dataset]. https://www.kaggle.com/datasets/yousefsaeedian/ai-medical-chatbot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yousef Saeedian
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description:

This dataset comprises transcriptions of conversations between doctors and patients, providing valuable insights into the dynamics of medical consultations. It includes a wide range of interactions, covering various medical conditions, patient concerns, and treatment discussions. The data is structured to capture both the questions and concerns raised by patients, as well as the medical advice, diagnoses, and explanations provided by doctors.

Key Features:

Doctor and Patient Roles: Each conversation is annotated with the role of the speaker (doctor or patient), making it easy to analyze communication patterns.

Medical Context: The dataset includes diverse scenarios, from routine check-ups to more complex medical discussions, offering a broad spectrum of healthcare dialogues.

Natural Language: The conversations are presented in natural language, allowing for the development and testing of NLP models focused on healthcare communication.

Applications: This dataset can be used for various applications, such as building dialogue systems, analyzing communication efficacy, developing medical NLP models, and enhancing patient care through better understanding of doctor-patient interactions.

Potential Use Cases:

NLP Model Training: Train models to understand and generate medical dialogues.

Healthcare Communication Studies: Analyze communication strategies between doctors and patients to improve healthcare delivery.

Medical Chatbots: Develop intelligent medical chatbots that can simulate doctor-patient conversations.

Patient Experience Enhancement: Identify common patient concerns and doctor responses to enhance patient care strategies.

This dataset is a valuable resource for researchers, data scientists, and healthcare professionals interested in the intersection of technology and medicine, aiming to improve healthcare communication through data-driven approaches.
Sepsis Dataset –
kaggle.com
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatolu Peter (2025). Sepsis Dataset – [Dataset]. https://www.kaggle.com/datasets/olagokeblissman/sepsis-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fatolu Peter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
📝 Dataset Overview: This dataset focuses on early warning detection for sepsis, a critical and potentially fatal medical condition. It includes anonymized vital signs, lab results, and clinical indicators of patients admitted to the hospital, structured for real-time monitoring and predictive modeling.

It’s ideal for clinical data analysts, healthcare data scientists, and AI practitioners aiming to develop decision support tools, early warning dashboards, or predictive health models.

🔍 Dataset Features: Column Name Description Patient_ID Unique anonymized identifier Admission_Date Patient’s hospital admission date Temperature_C Body temperature in degrees Celsius BP_Systolic Systolic blood pressure (mmHg) BP_Diastolic Diastolic blood pressure (mmHg) Heart_Rate Beats per minute WBC_Count White blood cell count (x10⁹/L) Lactate_mmol_L Lactate level in mmol/L Sepsis_Flag Binary indicator (1 = Suspected Sepsis, 0 = Normal) Ward Hospital ward/unit Doctor_On_Duty Attending physician name (anonymized)

🎯 Use Cases: Build Power BI dashboards for hospital early warning systems

Train ML classification models to detect early signs of sepsis

Create patient monitoring tools with Python or R

Explore the relationship between vitals & sepsis onset

Perform feature engineering for risk scoring systems

📌 Clinical Relevance: Sepsis is one of the leading causes of in-hospital mortality worldwide. Early detection is crucial to reducing death rates and improving outcomes. This dataset empowers developers and analysts to make a meaningful impact in the healthcare sector.

👤 Created By: Fatolu Peter (Emperor Analytics) A passionate healthcare analyst leveraging data to drive innovation in public health across Nigeria. This is Project 12 in my data-for-good series.

✅ LinkedIn Post: 🚨 New Dataset: Sepsis Early Warning System Data – Now on Kaggle 📊 Clinical vital signs + lab markers + sepsis risk flags 🔗 Explore the dataset here

This dataset enables healthcare data scientists to: ✅ Build real-time hospital dashboards ✅ Predict sepsis risk with machine learning ✅ Explore vitals like BP, lactate, WBC, and temperature ✅ Support early intervention using data insights

Whether you're into: 🧠 Predictive modeling 📈 Power BI clinical dashboards 📉 Risk analytics in healthcare This is for you.

Join me in using data to save lives — one insight at a time. If you build something, tag me. I’ll gladly share it! 💡

HealthcareAnalytics #SepsisAwareness #EarlyWarningSystems #KaggleDataset #PowerBI #DataForGood #FatoluPeter #EmperorAnalytics #PublicHealth #Project12 #RealWorldData

Let me know if you’d like help starting a Power BI or Python model for this!

You said: whic license is the best to choose that is global

ChatGPT said: For globally recognized and widely accepted dataset licensing on Kaggle, the best license to choose depends on your intent — especially whether you want others to reuse and build upon your dataset freely.

✅ Recommended License: Creative Commons Attribution 4.0 (CC BY 4.0) Why CC BY 4.0 is best globally: 🌍 Internationally recognized and legally valid worldwide.

✅ Allows others to use, share, adapt, and build on your dataset, even for commercial purposes.

📢 Requires only attribution to you as the creator.

🔐 You still retain copyright while maximizing openness.

When to choose it: If your goal is to:

Share freely with the global community,

Allow use in academic, commercial, or public projects,

Gain credit and visibility as the original creator.

⚠️ Other license types (if needed): CC BY-NC 4.0 (Attribution-NonCommercial): Only for non-commercial use.

CC0 (Public Domain Dedication): Freest use; no attribution required — but not always ideal if you want credit.

GPL / Open Data Commons: More for software or structured databases with specific open-source obligations.

✅ Final Suggestion for Your Datasets: Use CC BY 4.0 for all your uploads, unless you have confidential/private data. It's perfect for:

Healthcare datasets

Sales/retail analytics

Kaggle portfolio building

Global recognition & impact
m
Synthetic Stroke Prediction Dataset
data.mendeley.com
kaggle.com
Updated May 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Borhan Uddin (2025). Synthetic Stroke Prediction Dataset [Dataset]. http://doi.org/10.17632/s2nh6fm925.1
Explore at:
Unique identifier
https://doi.org/10.17632/s2nh6fm925.1
Dataset updated
May 2, 2025
Authors
Mohammed Borhan Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.
z
Patient Doctor Q&A TR 321179
zenodo.org
csv
Updated Jul 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammed Kayra Bulut; Muhammed Kayra Bulut (2024). Patient Doctor Q&A TR 321179 [Dataset]. http://doi.org/10.5281/zenodo.12798934
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12798934
Dataset updated
Jul 23, 2024
Dataset provided by
Muhammed Kayra Bulut
Authors
Muhammed Kayra Bulut; Muhammed Kayra Bulut
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Time period covered
Jul 18, 2024
Description
# Patient Doctor Q&A TR 321179 Veri Seti
Patient Doctor Q&A TR 321179 veri seti, [**Patient Doctor Q&A TR 19583**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-qa-dataset-tr), [**Patient Doctor Q&A TR 167732**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-167732), [**Patient Doctor Q&A TR 5695**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-translated-from-id-to-tr) ve [**Patient Doctor Q&A TR 95588**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-95588) veri setlerinin birleştirilmiş ve karıştırılmış halidir.

## Ana Özellikler:
* İçerik: Çeşitli tıbbi konuları kapsayan hasta soruları ve doktor yanıtları.
* Yapı: 2 sütun içerir: Soru, Cevap.
* Dil: Türkçe.
## Potansiyel Kullanım Alanları:
* Tıbbi araştırmalar
* Doğal Dil İşleme (NLP)
* Tıbbi eğitim
## Sınırlamalar:
* Veri gizliliği endişeleri
* Yanıt kalitesinde değişkenlik
* Potansiyel önyargılar
## Genel Değerlendirme:
Patient Doctor Q&A TR 321179 veri seti, gerçek dünyadaki tıbbi iletişimi ve bilgi alışverişini anlamak için değerli bir kaynaktır. Türkçeye çevrilmiş bu veri seti, tıbbi araştırmalar ve eğitim için önemli bir kaynak olup, hasta ve doktor arasındaki iletişimi analiz etmek için kullanılabilir. Ancak, veri gizliliği ve yanıt kalitesindeki değişkenlik gibi sınırlamalar göz önünde bulundurulmalıdır.

Bu veri seti, araştırmacılara ve eğitimcilere, Türkçe tıbbi iletişim verilerini kullanarak daha derinlemesine analiz yapma ve doğal dil işleme tekniklerini uygulama fırsatı sunar.

# Patient Doctor Q&A TR 321179 Dataset
The Patient Doctor Q&A TR 321179 dataset is a combined and shuffled version of the [**Patient Doctor Q&A TR 19583**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-qa-dataset-tr), [**Patient Doctor Q&A TR 167732**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-167732), [**Patient Doctor Q&A TR 5695**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-translated-from-id-to-tr), and [**Patient Doctor Q&A TR 95588**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-95588) datasets.

## Main Features:
* Content: Patient questions and doctor answers covering various medical topics.
* Structure: Contains 2 columns: Question, Answer.
* Language: Turkish.
## Potential Uses:
* Medical research
* Natural Language Processing (NLP)
* Medical education
## Limitations:
* Data privacy concerns
* Variability in answer quality
* Potential biases
## General Assessment:
The Patient Doctor Q&A TR 321179 dataset is a valuable resource for understanding real-world medical communication and information exchange. This dataset, translated into Turkish, is an important resource for medical research and education, and can be used to analyze communication between patients and doctors. However, limitations such as data privacy and variability in answer quality should be considered.

This dataset offers researchers and educators the opportunity to conduct more in-depth analyses and apply natural language processing techniques using Turkish medical communication data.
i
Cardiovascular Disease Dataset
ieee-dataport.org
Updated Oct 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajib Kumar Halder Halder (2022). Cardiovascular Disease Dataset [Dataset]. https://ieee-dataport.org/documents/cardiovascular-disease-dataset
Explore at:
Dataset updated
Oct 25, 2022
Authors
Rajib Kumar Halder Halder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This heart disease dataset is curated by combining 3 popular heart disease datasets. The first dataset (Collected from Kaggle) contains 70000 records with 11 independent features which makes it the largest heart disease dataset available so far for research purposes. These data were collected at the moment of medical examination and information given by the patient. Second and third datasets contain 303 and 293 intstances respectively with 13 common features. The three datasets used for its curation are:Cardio Data (Kaggle Dataset)
Cardiovascular Disease dataset
kaggle.com
zip
Updated Jan 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Svetlana Ulianova (2019). Cardiovascular Disease dataset [Dataset]. https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
Explore at:
zip(760066 bytes)Available download formats
Dataset updated
Jan 20, 2019
Authors
Svetlana Ulianova
Description
Data description

There are 3 types of input features:

Objective: factual information;

Examination: results of medical examination;

Subjective: information given by the patient.

Features:

Age | Objective Feature | age | int (days)

Height | Objective Feature | height | int (cm) |

Weight | Objective Feature | weight | float (kg) |

Gender | Objective Feature | gender | categorical code |

Systolic blood pressure | Examination Feature | ap_hi | int |

Diastolic blood pressure | Examination Feature | ap_lo | int |

Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |

Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |

Smoking | Subjective Feature | smoke | binary |

Alcohol intake | Subjective Feature | alco | binary |

Physical activity | Subjective Feature | active | binary |

Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

All of the dataset values were collected at the moment of medical examination.
Maternal Health Dataset
kaggle.com
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JANANI D (2025). Maternal Health Dataset [Dataset]. https://www.kaggle.com/datasets/hgfdrtyu/maternal-health-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
JANANI D
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains health indicators related to pregnant women to help assess maternal health risk levels. The primary goal is to classify the level of maternal health risk into categories such as Low, Mid, or High, based on various medical features.

The dataset can be used for classification tasks in machine learning, data visualization, healthcare analytics, and predictive modeling.
Symptom-Based Disease Labeling Dataset
kaggle.com
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krish Sharma (2023). Symptom-Based Disease Labeling Dataset [Dataset]. https://www.kaggle.com/datasets/krish0202/symptom-based-disease-labeling-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Krish Sharma
Description
The "Symptom-Based Disease Labeling Dataset" is a valuable resource for the development and evaluation of Natural Language Processing (NLP) models aimed at disease diagnosis and healthcare applications. This dataset contains a diverse collection of English sentences that describe various symptoms and corresponding labels specifying the associated disease.

Key Features :

Diverse Symptoms: The dataset includes a rich variety of symptom descriptions, encompassing both common and rare health issues, making it suitable for comprehensive disease diagnosis tasks.

Accurate Labels: Each sentence is meticulously labeled with the corresponding disease, ensuring high-quality training and evaluation data for NLP models.
Hospital Patient Treatment Dataset
kaggle.com
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Slidescope (2025). Hospital Patient Treatment Dataset [Dataset]. https://www.kaggle.com/datasets/slidescope/hospital-patient-treatment-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 26, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Slidescope
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This Hospital Patient Treatment Dataset contains simulated data representing patient treatments in a multi-specialty hospital. It includes 200 records with both categorical and numerical fields, designed for data visualization and analysis using tools like Power BI, Tableau, or Python.

The dataset features 3–4 categorical fields: Department (e.g., Cardiology, Neurology, Orthopedics), Treatment Type (Surgery, Medication, Therapy, Observation), Doctor Name, and Gender. It also includes numerical fields such as Age, Treatment Cost, Hospital Stay (Days), and Recovery Score (ranging from 0 to 100).

To get more information about Columns visit: https://colorstech.net/practice-datasets/hospital-patient-treatment-dataset-for-analysis/

This dataset is ideal for healthcare analysts and data enthusiasts who want to practice analyzing treatment efficiency, patient demographics, cost effectiveness, and healthcare outcomes. Potential analyses include cost comparisons by department, gender-based treatment patterns, doctor performance based on recovery scores, and identifying which treatments lead to faster recovery.

It can help simulate real-world healthcare reporting scenarios, such as understanding hospital load, cost optimization, or patient well-being tracking. The inclusion of unique Patient ID allows for easy referencing and segmentation.

This dataset is well-suited for creating KPIs, dashboards, and advanced visualizations to gain insights into hospital operations and patient care outcomes. No real patient data is used—this is a synthetic dataset for educational use only.
f
Environmental settings of the proposed system.
figshare.com
plos.figshare.com
xls
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuvo Biswas; Rafid Mostafiz; Mohammad Shorif Uddin; Muhammad Shahin Uddin (2025). Environmental settings of the proposed system. [Dataset]. http://doi.org/10.1371/journal.pone.0324957.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0324957.t005
Dataset updated
Jul 17, 2025
Dataset provided by
PLOS ONE
Authors
Shuvo Biswas; Rafid Mostafiz; Mohammad Shorif Uddin; Muhammad Shahin Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pneumonia, a severe lung infection caused by various viruses, presents significant challenges in diagnosis and treatment due to its similarities with other respiratory conditions. Additionally, the need to protect patient privacy complicates the sharing of sensitive clinical data. This study introduces FLPneXAINet, an effective framework that combines federated learning (FL) with deep learning (DL) and explainable AI (XAI) to securely and accurately predict pneumonia using chest X-ray (CXR) images. We utilized a benchmark dataset from Kaggle, comprising 8,402 CXR images (3,904 normal and 4,498 pneumonia). The dataset was preprocessed and augmented using a cycle-consistent generative adversarial (CycleGAN) network to increase the volume of training data. Three pre-trained DL models named VGG16, NASNetMobile, and MobileNet were employed to extract features from the augmented dataset. Further, four ensemble DL (EDL) models were used to enhance feature extraction. Feature optimization was performed using recursive feature elimination (RFE), analysis of variance (ANOVA), and random forest (RF) to select the most relevant features. These optimized features were then inputted into machine learning (ML) models, including K-nearest neighbor (KNN), naive bayes (NB), support vector machine (SVM), and RF, for pneumonia prediction. The performance of the models was evaluated in a FL environment, with the EDL network achieving the best results: accuracy 97.61%, F1 score 98.36%, recall 98.13%, and precision 98.59%. The framework’s predictions were further validated using two XAI techniques—Local Interpretable Model-Agnostic Explanations (LIME) and Grad-CAM. FLPneXAINet offers a robust solution for healthcare professionals to accurately diagnose pneumonia, ensuring timely treatment while safeguarding patient privacy.
u
Data from: OCT5k: A dataset of multi-disease and multi-graded annotations...
rdr.ucl.ac.uk
zip
Updated Nov 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mustafa Arikan (2024). OCT5k: A dataset of multi-disease and multi-graded annotations for retinal layers [Dataset]. http://doi.org/10.5522/04/22128671.v4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5522/04/22128671.v4
Dataset updated
Nov 7, 2024
Dataset provided by
University College London
Authors
Mustafa Arikan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The thickness and appearance of retinal layers are essential markers for diagnosing and studying eye diseases. Despite the increasing availability of imaging devices to scan and store large amounts of data, analyzing retinal images and generating trial endpoints has remained a manual, error-prone, and time-consuming task. In particular, the lack of large amounts of high-quality labels for different diseases hinders the development of automated algorithms. Therefore, we have compiled 5016 pixel-wise manual labels for 1672 optical coherence tomography (OCT) scans featuring two different diseases as well as healthy subjects to help democratize the process of developing novel automatic techniques. We also collected 4698 bounding box annotations for a subset of 566 scans across 9 classes of disease biomarker. Due to variations in retinal morphology, intensity range, and changes in contrast and brightness, designing segmentation and detection methods that can generalize to different disease types is challenging. While machine learning-based methods can overcome these challenges, high-quality expert annotations are necessary for training. Publicly available annotated image datasets typically contain few images and/or only cover a single type of disease, and most are only annotated by a single grader. To address this gap, we present a comprehensive multi-grader and multi-disease dataset fortraining machine learning-based algorithms. The proposed dataset covers three subsets of scans (Age-related Macular Degeneration, Diabetic Macular Edema, and healthy) and annotations for two types of tasks (semantic segmentation and object detection).
LUTH Hospital Enhanced Dataset –
kaggle.com
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatolu Peter (2025). LUTH Hospital Enhanced Dataset – [Dataset]. https://www.kaggle.com/datasets/olagokeblissman/luth-hospital-enhanced-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fatolu Peter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
📝 Dataset Overview: This enhanced dataset captures the real-world operational and clinical performance data from a major hospital in Nigeria — Lagos University Teaching Hospital (LUTH). It includes detailed information on admissions, patient care, medical services, billing, and staff activities, ideal for healthcare analytics, hospital management dashboards, and machine learning projects.

🔍 Dataset Features (Suggested Columns): Column Name Description Patient_ID Unique anonymized patient ID Admission_Date Date of admission Discharge_Date Date of discharge Gender Patient’s gender Age Patient’s age Department Medical department involved Diagnosis Primary diagnosis Doctor Attending physician (anonymized) Treatment_Provided Type of treatment/procedure Lab_Tests Count of lab tests conducted Medications_Administered Total medications given Surgery_Cost (₦) If applicable, cost of surgery Bill_Amount (₦) Total bill charged to patient Ward Hospital ward assigned Length_of_Stay (days) Duration of hospitalization

🎯 Use Cases: Build hospital operations dashboards in Power BI

Analyze billing and cost patterns across departments

Predict length of stay or discharge outcomes

Explore departmental workload and performance

Use as a base for AI in hospital management systems

🏥 Clinical & Operational Value: This dataset empowers analysts and healthcare professionals to:

Track patient outcomes and billing efficiency

Reduce operational bottlenecks

Improve patient care with data-driven recommendations

Benchmark departmental performance

Train predictive models for resource allocation

👤 Created By: Fatolu Peter (Emperor Analytics) Dedicated to transforming public healthcare using analytics and real-world data across Nigerian hospitals. This is Project 14 in my growing health-tech analytics journey.

✅ LinkedIn Post: 🚑 New Kaggle Dataset: LUTH Hospital Enhanced Clinical & Operations Data 📊 Real hospital data on admissions, billing, treatments, and care metrics 🔗 Access the dataset now on Kaggle

This dataset gives you: ✅ Real hospital operations data ✅ Billing and medication insights ✅ Doctor and ward-level activity ✅ A perfect base for building Power BI dashboards or training ML models

Whether you're a data scientist, health analyst, or Power BI pro — this is real-world data to make real impact. Let’s build something powerful together. 💡

HealthcareAnalytics #LUTH #HospitalData #KaggleDataset #PowerBI #FatoluPeter #EmperorAnalytics #DataForGood #Project14 #PublicHealth #NigeriaHealthData
Virtual Patient Model Assessment
kaggle.com
Updated Feb 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Virtual Patient Model Assessment [Dataset]. https://www.kaggle.com/datasets/thedevastator/virtual-patient-model-assessment
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 18, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Virtual Patient Model Assessment

Tracking Physical, Psychological and Cognitive Performance in Older Adults

By [source]

About this dataset

This dataset provides a comprehensive overview of the physical, psychological and cognitive health of a cohort of older adults. It contains data collected from medical experts during clinical assessments such as physical activity, nutrition, activity limitations, balance, depression and cognition. Additionally it includes parameters extracted from used devices such as average heart rate per day and average gait speed. Carefully coupled with this is detailed information relating to falls, fractures and loss of orientation within the group studied which can add even further insight into the overall trends in health for those aged 55 and above.

The dataset includes various scores capturing different aspects alongside statistics to better represent participants' lifestyles; not only does it feature basic metrics like gender or age but also complex measures like exhaustion or grip strength for each individual in the cohort. Furthermore an analytical exploration into nutrition measures (e.g., Body Mass Index), social interaction (e.g., phone calls) or leisure activities (clubs) could help uncover powerful correlations among them resulting in innovative strategies for improving well-being amongst elderly population groups

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a comprehensive overview of physical, psychological and cognitive health of a cohort of older adults. It includes parameters related to physical activity, nutrition, activity limitations, balance, depression, cognition and more. Through this dataset you can gain insights into the various factors affecting the health of elderlies in your population which could be helpful for researchers or practitioners in developing interventions to promote elderly health.

Before using this dataset it is advised to get familiar with the variables and fields provided. There are two sections within each variable: descriptive information such as gender and age group; and scores related to various aspects such as heart rate per day or average gait speed per month. You may also find additional coupled events like falls or fractures that can impact the assessment scores over time.

Once you have gone through all variables available in the dataset you may use simple statistical methods like measuring mean values of several key indicators (such as balance score or bmi score) across different characteristics (such as age group). Comparing these values allows researchers to identify trends amongst different groups within a population that would show differences on an individual level.

Other techniques that could be used include clustering techniques to observe patterns in data relating different indicators at once on comparative models; logistic regression which would help identify which predictors explain certain outcomes among elderly people well; or propensity matching-based approaches which suggest what kind of intervention should be given depending on each person’s characteristics based on an accumulated data source from elderly population research using this dataset . The usefulness of this dataset is not limited by stats only but it might also benefit from theoretical forms such as narrative geometry used for subjective analysis by placing story-telling elements along with formative assessments onto conceptual frameworks between inside natural ecosystems already running smoothly(between concepts) before disruption/disequilibrium happens due external stressors ecomorphonologically speaking . This will eventually help clinicians addressing psychological conditions verifying objective status via outcomes from metrics established earlier preferably prior experiments where involuntary independent behavior was detected influencing vital organ systems at homeostasis levels either causing positive adaptations / fitness ,or increasing vulnerability that when added up together shift towards severe distress turn proximally considering also other segments elsewhere varying across multiple networks simultaneous injections cumulated/integrated effects starting sometimes after take off periods way before ill health seems obviously concrete therefore important details concerning risk factors sometimes overlooked got noticed while capturing evidence based prospective by cross validated means completed longitudinal surveys taking advantage into being able understanding potentially confounding conditions sparedly manifested either forgotten beca...
Features_corected
kaggle.com
zip
Updated May 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TUSHAR SARKAR (2021). Features_corected [Dataset]. https://www.kaggle.com/tusharsarkar/features-corected
Explore at:
zip(14633271 bytes)Available download formats
Dataset updated
May 13, 2021
Authors
TUSHAR SARKAR
Description
Dataset

This dataset was created by TUSHAR SARKAR

Contents

It contains the following files:
Human Vital Sign Dataset
kaggle.com
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DatasetEngineer (2024). Human Vital Sign Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/8992827
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8992827
Dataset updated
Jul 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DatasetEngineer
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview The Human Vital Signs Dataset is a comprehensive collection of key physiological parameters recorded from patients. This dataset is designed to support research in medical diagnostics, patient monitoring, and predictive analytics. It includes both original attributes and derived features to provide a holistic view of patient health.

Attributes Patient ID

Description: A unique identifier assigned to each patient. Type: Integer Example: 1, 2, 3, ... Heart Rate

Description: The number of heartbeats per minute. Type: Integer Range: 60-100 bpm (for this dataset) Example: 72, 85, 90 Respiratory Rate

Description: The number of breaths taken per minute. Type: Integer Range: 12-20 breaths per minute (for this dataset) Example: 16, 18, 15 Timestamp

Description: The exact time at which the vital signs were recorded. Type: Datetime Format: YYYY-MM-DD HH:MM Example: 2023-07-19 10:15:30 Body Temperature

Description: The body temperature measured in degrees Celsius. Type: Float Range: 36.0-37.5°C (for this dataset) Example: 36.7, 37.0, 36.5 Oxygen Saturation

Description: The percentage of oxygen-bound hemoglobin in the blood. Type: Float Range: 95-100% (for this dataset) Example: 98.5, 97.2, 99.1 Systolic Blood Pressure

Description: The pressure in the arteries when the heart beats (systolic pressure). Type: Integer Range: 110-140 mmHg (for this dataset) Example: 120, 130, 115 Diastolic Blood Pressure

Description: The pressure in the arteries when the heart rests between beats (diastolic pressure). Type: Integer Range: 70-90 mmHg (for this dataset) Example: 80, 75, 85 Age

Description: The age of the patient. Type: Integer Range: 18-90 years (for this dataset) Example: 25, 45, 60 Gender

Description: The gender of the patient. Type: Categorical Categories: Male, Female Example: Male, Female Weight (kg)

Description: The weight of the patient in kilograms. Type: Float Range: 50-100 kg (for this dataset) Example: 70.5, 80.3, 65.2 Height (m)

Description: The height of the patient in meters. Type: Float Range: 1.5-2.0 m (for this dataset) Example: 1.75, 1.68, 1.82 Derived Features Derived_HRV (Heart Rate Variability)

Description: A measure of the variation in time between heartbeats. Type: Float Formula: 𝐻 𝑅

𝑉

Standard Deviation of Heart Rate over a Period Mean Heart Rate over the Same Period HRV= Mean Heart Rate over the Same Period Standard Deviation of Heart Rate over a Period

Example: 0.10, 0.12, 0.08 Derived_Pulse_Pressure (Pulse Pressure)

Description: The difference between systolic and diastolic blood pressure. Type: Integer Formula: 𝑃

𝑃

Systolic Blood Pressure − Diastolic Blood Pressure PP=Systolic Blood Pressure−Diastolic Blood Pressure Example: 40, 45, 30 Derived_BMI (Body Mass Index)

Description: A measure of body fat based on weight and height. Type: Float Formula: 𝐵 𝑀

𝐼

Weight (kg) ( Height (m) ) 2 BMI= (Height (m)) 2

Weight (kg)

Example: 22.8, 25.4, 20.3 Derived_MAP (Mean Arterial Pressure)

Description: An average blood pressure in an individual during a single cardiac cycle. Type: Float Formula: 𝑀 𝐴

𝑃

Diastolic Blood Pressure + 1 3 ( Systolic Blood Pressure − Diastolic Blood Pressure ) MAP=Diastolic Blood Pressure+ 3 1 (Systolic Blood Pressure−Diastolic Blood Pressure) Example: 93.3, 100.0, 88.7 Target Feature Risk Category Description: Classification of patients into "High Risk" or "Low Risk" based on their vital signs. Type: Categorical Categories: High Risk, Low Risk Criteria: High Risk: Any of the following conditions Heart Rate: > 90 bpm or < 60 bpm Respiratory Rate: > 20 breaths per minute or < 12 breaths per minute Body Temperature: > 37.5°C or < 36.0°C Oxygen Saturation: < 95% Systolic Blood Pressure: > 140 mmHg or < 110 mmHg Diastolic Blood Pressure: > 90 mmHg or < 70 mmHg BMI: > 30 or < 18.5 Low Risk: None of the above conditions Example: High Risk, Low Risk This dataset, with a total of 200,000 samples, provides a robust foundation for various machine learning and statistical analysis tasks aimed at understanding and predicting patient health outcomes based on vital signs. The inclusion of both original attributes and derived features enhances the richness and utility of the dataset.
3500 hospitals data of USA
kaggle.com
Updated Oct 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Garg 0303 (2023). 3500 hospitals data of USA [Dataset]. https://www.kaggle.com/datasets/amangarg0303/3500-hospitals-data-of-usa/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aman Garg 0303
Area covered
United States
Description
The dataset is taken from communitybenefitinsight website, they have the data of around 3500 hospitals in USA consisting of the hospital name, address, city, state, zipcode, bed_count, and many more features. This data is for practice and educational purposes.
CellNet
kaggle.com
Updated Jul 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Capocyan (2024). CellNet [Dataset]. http://doi.org/10.34740/kaggle/dsv/8839178
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8839178
Dataset updated
Jul 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
John Capocyan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CellNet: A Large-Scale Comprehensive Medical Imaging Database Research Effort

Overview:

Inspired by Stanford's leading AI research group's and computer vision's beloved ImageNet database, CellNet is a meticulously curated dataset featuring over 120,000 high-quality medical images representing over 20 organ/cancer classes. It was an idea that started in June 2024 during the time of research in the paper "Enhancement Framework for Vision Transformers in Data Limited Cancer Detection" [Presented at the IEEE's 2024 MIT Undergraduate Research Technology Conference (URTC)]. Future work in the study believed that given enough data and computation, pre-trained CellNet weights transfered to other models working with various medical imaging tasks during transfer-learning can outperform other models pre-trained on ImageNet. More importantly, CellNet relies on the study's theory that models pre-trained on large datasets specifically tailored to their task's field at hand will perform better than general pre-trained models. Specifically in CellNet's case, pre-trained weights derived from medical-based tasks can potentially provide a more sound and ethical base for medical AI models in future applications and research on cancer diagnosis and other related fields.

Data Collection:

The data for CellNet has been aggregated from numerous reputable online repositories and medical labs, ensuring a diverse and comprehensive collection of histopathology images. In collaboration with a team of physicians, lab collected images implemented in the set have been meticulously verified and annotated to maintain high standards of accuracy and reliability.

Continiuous Updates:

One of the key features of CellNet is its dynamic and evolving nature. CellNet is committed to updating monthly, integrating new data and reorganizing existing entries based on the latest research and clinical findings. This continuous update cycle ensures that CellNet remains a relevant and up-to-date resource for the clinical and medical AI research communities.

Applications:

CellNet is designed to facilitate various machine learning model applications, including but not limited to:

Image Classification

Image Segmentation

Transfer Learning

However, it's important to note that as this dataset grows, weights obtained from training on this set are primarily meant to be imported to other models for transfer learning on smaller datasets.

Community Engagement:

The CellNet team encourages the Kaggle community to engage with CellNet by exploring the dataset, creating notebooks, and providing feedback. Your contributions and insights are invaluable in helping CellNet improve and expand its resource and weights.

License:

This dataset is licensed under the MIT License. Please cite hte dataset if you use it in your research.

Acknowledgements:

I would like to thank all the contributors and the medical professionals who helped produced all the images curated in this dataset.

Contact:

For any questions or support, please contact [johncapocyan@gmail.com].

Dive into the dataset today and join the mission to revolutionize medical research through the power of data and AI!
UK Hospitals list
kaggle.com
Updated Aug 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhankar Sinha (2024). UK Hospitals list [Dataset]. https://www.kaggle.com/datasets/subhankarsinha1985/uk-hospitals-list/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 5, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Subhankar Sinha
Area covered
United Kingdom
Description
This is a list of UK hospitals which contains details about 299 hospitals. The dataset contains no null values. It has 7 features altogether

Facebook

Twitter

Click to copy link

Link copied

Cite

Anopsy (2024). Equity in Healthcare Clean DataSets [Dataset]. https://www.kaggle.com/datasets/anopsy/equity-in-healthcare-clean-datasets

Equity in Healthcare Clean DataSets

Clean Version / Imputed Missing Values

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 21, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Anopsy

Description

This dataset is based on train and test dataset from this competition: https://www.kaggle.com/competitions/widsdatathon2024-challenge1 .

What did I change? 1. I dropped 2 columns that contained to little data.
2. using Machine Learning I imputed "payer_type", "patient_race" and "bmi". 3. using "patient_zip3" I filled missing values in "patient_state" , "Region" and "Division" 4. using SinmpleImputer I imputed few missing numeric data in "Ozone", "PM2.5" and other columns 5. I created some new features, based on demographic features, that may be a bit more informative. 6. I tokenized the 'breast_cancer_diagnosis_desc' column

If you're interested how I did that check those notebooks: https://www.kaggle.com/code/anopsy/ml-for-missing-values for "bmi" and new features check this: https://www.kaggle.com/code/anopsy/fe-and-xgb-on-clean-data

According to the description of the original dataset, it's a "39k record dataset (split into training and test sets) representing patients and their characteristics (age, race, BMI, zip code), their diagnosis and treatment information (breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, … etc.), their geo (zip-code level) demographic data (income, education, rent, race, poverty, …etc), as well as toxic air quality data (Ozone, PM25 and NO2)."

Clear search

Close search

Google apps

Main menu

Equity in Healthcare Clean DataSets

Health Insurance Dataset

AI medical chatbot

Sepsis Dataset –

HealthcareAnalytics #SepsisAwareness #EarlyWarningSystems #KaggleDataset #PowerBI #DataForGood #FatoluPeter #EmperorAnalytics #PublicHealth #Project12 #RealWorldData

Synthetic Stroke Prediction Dataset

Patient Doctor Q&A TR 321179

Cardiovascular Disease Dataset

Cardiovascular Disease dataset

Data description

Maternal Health Dataset

Symptom-Based Disease Labeling Dataset

Hospital Patient Treatment Dataset

Environmental settings of the proposed system.

Data from: OCT5k: A dataset of multi-disease and multi-graded annotations...

LUTH Hospital Enhanced Dataset –

HealthcareAnalytics #LUTH #HospitalData #KaggleDataset #PowerBI #FatoluPeter #EmperorAnalytics #DataForGood #Project14 #PublicHealth #NigeriaHealthData

Virtual Patient Model Assessment

Virtual Patient Model Assessment

Tracking Physical, Psychological and Cognitive Performance in Older Adults

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Features_corected

Dataset

Contents

Human Vital Sign Dataset

𝑉

𝑃

𝐼

𝑃

3500 hospitals data of USA

CellNet

CellNet: A Large-Scale Comprehensive Medical Imaging Database Research Effort

Overview:

Data Collection:

Continiuous Updates:

Applications:

Community Engagement:

License:

Acknowledgements:

Contact:

UK Hospitals list

Equity in Healthcare Clean DataSets

Clean Version / Imputed Missing Values