This dataset is based on train and test dataset from this competition: https://www.kaggle.com/competitions/widsdatathon2024-challenge1 .
What did I change?
1. I dropped 2 columns that contained to little data.
2. using Machine Learning I imputed "payer_type", "patient_race" and "bmi".
3. using "patient_zip3" I filled missing values in "patient_state" , "Region" and "Division"
4. using SinmpleImputer I imputed few missing numeric data in "Ozone", "PM2.5" and other columns
5. I created some new features, based on demographic features, that may be a bit more informative.
6. I tokenized the 'breast_cancer_diagnosis_desc' column
If you're interested how I did that check those notebooks: https://www.kaggle.com/code/anopsy/ml-for-missing-values for "bmi" and new features check this: https://www.kaggle.com/code/anopsy/fe-and-xgb-on-clean-data
According to the description of the original dataset, it's a "39k record dataset (split into training and test sets) representing patients and their characteristics (age, race, BMI, zip code), their diagnosis and treatment information (breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, … etc.), their geo (zip-code level) demographic data (income, education, rent, race, poverty, …etc), as well as toxic air quality data (Ozone, PM25 and NO2)."
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Medical Insurance Expenses & Premium Dataset
This dataset captures demographic and financial information related to medical insurance policyholders. It includes key features such as age, gender, BMI, number of children, discount eligibility status, and the geographic region of the insured. The dataset also provides the actual medical expenses incurred (expenses) and the insurance premium charged (premium).
The purpose of this dataset is to support research and development of machine learning models for predicting healthcare costs, optimizing pricing strategies, and understanding factors that influence insurance expenses and premiums.
Columns
age: Age of the policyholder
gender: Gender (male/female)
bmi: Body Mass Index
children: Number of children covered by the insurance
discount_eligibility: Whether the policyholder is eligible for a discount (yes/no)
region: Geographic region (e.g., southeast, northwest)
expenses: Actual medical costs incurred by the policyholder (Target number 1)
premium: Insurance premium charged (Target number 2)
Example Use Cases
Predicting insurance expenses for new applicants
Analyzing which demographic factors contribute most to higher premiums
Exploring correlations between BMI, age, and healthcare costs
Developing regression and classification models for pricing optimization
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Description:
This dataset comprises transcriptions of conversations between doctors and patients, providing valuable insights into the dynamics of medical consultations. It includes a wide range of interactions, covering various medical conditions, patient concerns, and treatment discussions. The data is structured to capture both the questions and concerns raised by patients, as well as the medical advice, diagnoses, and explanations provided by doctors.
Key Features:
Potential Use Cases:
This dataset is a valuable resource for researchers, data scientists, and healthcare professionals interested in the intersection of technology and medicine, aiming to improve healthcare communication through data-driven approaches.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
📝 Dataset Overview: This dataset focuses on early warning detection for sepsis, a critical and potentially fatal medical condition. It includes anonymized vital signs, lab results, and clinical indicators of patients admitted to the hospital, structured for real-time monitoring and predictive modeling.
It’s ideal for clinical data analysts, healthcare data scientists, and AI practitioners aiming to develop decision support tools, early warning dashboards, or predictive health models.
🔍 Dataset Features: Column Name Description Patient_ID Unique anonymized identifier Admission_Date Patient’s hospital admission date Temperature_C Body temperature in degrees Celsius BP_Systolic Systolic blood pressure (mmHg) BP_Diastolic Diastolic blood pressure (mmHg) Heart_Rate Beats per minute WBC_Count White blood cell count (x10⁹/L) Lactate_mmol_L Lactate level in mmol/L Sepsis_Flag Binary indicator (1 = Suspected Sepsis, 0 = Normal) Ward Hospital ward/unit Doctor_On_Duty Attending physician name (anonymized)
🎯 Use Cases: Build Power BI dashboards for hospital early warning systems
Train ML classification models to detect early signs of sepsis
Create patient monitoring tools with Python or R
Explore the relationship between vitals & sepsis onset
Perform feature engineering for risk scoring systems
📌 Clinical Relevance: Sepsis is one of the leading causes of in-hospital mortality worldwide. Early detection is crucial to reducing death rates and improving outcomes. This dataset empowers developers and analysts to make a meaningful impact in the healthcare sector.
👤 Created By: Fatolu Peter (Emperor Analytics) A passionate healthcare analyst leveraging data to drive innovation in public health across Nigeria. This is Project 12 in my data-for-good series.
✅ LinkedIn Post: 🚨 New Dataset: Sepsis Early Warning System Data – Now on Kaggle 📊 Clinical vital signs + lab markers + sepsis risk flags 🔗 Explore the dataset here
This dataset enables healthcare data scientists to: ✅ Build real-time hospital dashboards ✅ Predict sepsis risk with machine learning ✅ Explore vitals like BP, lactate, WBC, and temperature ✅ Support early intervention using data insights
Whether you're into: 🧠 Predictive modeling 📈 Power BI clinical dashboards 📉 Risk analytics in healthcare This is for you.
Join me in using data to save lives — one insight at a time. If you build something, tag me. I’ll gladly share it! 💡
Let me know if you’d like help starting a Power BI or Python model for this!
You said: whic license is the best to choose that is global
ChatGPT said: For globally recognized and widely accepted dataset licensing on Kaggle, the best license to choose depends on your intent — especially whether you want others to reuse and build upon your dataset freely.
✅ Recommended License: Creative Commons Attribution 4.0 (CC BY 4.0) Why CC BY 4.0 is best globally: 🌍 Internationally recognized and legally valid worldwide.
✅ Allows others to use, share, adapt, and build on your dataset, even for commercial purposes.
📢 Requires only attribution to you as the creator.
🔐 You still retain copyright while maximizing openness.
When to choose it: If your goal is to:
Share freely with the global community,
Allow use in academic, commercial, or public projects,
Gain credit and visibility as the original creator.
⚠️ Other license types (if needed): CC BY-NC 4.0 (Attribution-NonCommercial): Only for non-commercial use.
CC0 (Public Domain Dedication): Freest use; no attribution required — but not always ideal if you want credit.
GPL / Open Data Commons: More for software or structured databases with specific open-source obligations.
✅ Final Suggestion for Your Datasets: Use CC BY 4.0 for all your uploads, unless you have confidential/private data. It's perfect for:
Healthcare datasets
Sales/retail analytics
Kaggle portfolio building
Global recognition & impact
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
# Patient Doctor Q&A TR 321179 Veri Seti
Patient Doctor Q&A TR 321179 veri seti, [**Patient Doctor Q&A TR 19583**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-qa-dataset-tr), [**Patient Doctor Q&A TR 167732**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-167732), [**Patient Doctor Q&A TR 5695**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-translated-from-id-to-tr) ve [**Patient Doctor Q&A TR 95588**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-95588) veri setlerinin birleştirilmiş ve karıştırılmış halidir.
## Ana Özellikler:
* İçerik: Çeşitli tıbbi konuları kapsayan hasta soruları ve doktor yanıtları.
* Yapı: 2 sütun içerir: Soru, Cevap.
* Dil: Türkçe.
## Potansiyel Kullanım Alanları:
* Tıbbi araştırmalar
* Doğal Dil İşleme (NLP)
* Tıbbi eğitim
## Sınırlamalar:
* Veri gizliliği endişeleri
* Yanıt kalitesinde değişkenlik
* Potansiyel önyargılar
## Genel Değerlendirme:
Patient Doctor Q&A TR 321179 veri seti, gerçek dünyadaki tıbbi iletişimi ve bilgi alışverişini anlamak için değerli bir kaynaktır. Türkçeye çevrilmiş bu veri seti, tıbbi araştırmalar ve eğitim için önemli bir kaynak olup, hasta ve doktor arasındaki iletişimi analiz etmek için kullanılabilir. Ancak, veri gizliliği ve yanıt kalitesindeki değişkenlik gibi sınırlamalar göz önünde bulundurulmalıdır.
Bu veri seti, araştırmacılara ve eğitimcilere, Türkçe tıbbi iletişim verilerini kullanarak daha derinlemesine analiz yapma ve doğal dil işleme tekniklerini uygulama fırsatı sunar.
# Patient Doctor Q&A TR 321179 Dataset
The Patient Doctor Q&A TR 321179 dataset is a combined and shuffled version of the [**Patient Doctor Q&A TR 19583**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-qa-dataset-tr), [**Patient Doctor Q&A TR 167732**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-167732), [**Patient Doctor Q&A TR 5695**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-translated-from-id-to-tr), and [**Patient Doctor Q&A TR 95588**](https://www.kaggle.com/datasets/kaayra2000/patient-doctor-q-and-a-tr-95588) datasets.
## Main Features:
* Content: Patient questions and doctor answers covering various medical topics.
* Structure: Contains 2 columns: Question, Answer.
* Language: Turkish.
## Potential Uses:
* Medical research
* Natural Language Processing (NLP)
* Medical education
## Limitations:
* Data privacy concerns
* Variability in answer quality
* Potential biases
## General Assessment:
The Patient Doctor Q&A TR 321179 dataset is a valuable resource for understanding real-world medical communication and information exchange. This dataset, translated into Turkish, is an important resource for medical research and education, and can be used to analyze communication between patients and doctors. However, limitations such as data privacy and variability in answer quality should be considered.
This dataset offers researchers and educators the opportunity to conduct more in-depth analyses and apply natural language processing techniques using Turkish medical communication data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This heart disease dataset is curated by combining 3 popular heart disease datasets. The first dataset (Collected from Kaggle) contains 70000 records with 11 independent features which makes it the largest heart disease dataset available so far for research purposes. These data were collected at the moment of medical examination and information given by the patient. Second and third datasets contain 303 and 293 intstances respectively with 13 common features. The three datasets used for its curation are:Cardio Data (Kaggle Dataset)
There are 3 types of input features:
Features:
All of the dataset values were collected at the moment of medical examination.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains health indicators related to pregnant women to help assess maternal health risk levels. The primary goal is to classify the level of maternal health risk into categories such as Low, Mid, or High, based on various medical features.
The dataset can be used for classification tasks in machine learning, data visualization, healthcare analytics, and predictive modeling.
The "Symptom-Based Disease Labeling Dataset" is a valuable resource for the development and evaluation of Natural Language Processing (NLP) models aimed at disease diagnosis and healthcare applications. This dataset contains a diverse collection of English sentences that describe various symptoms and corresponding labels specifying the associated disease.
Key Features :
Diverse Symptoms: The dataset includes a rich variety of symptom descriptions, encompassing both common and rare health issues, making it suitable for comprehensive disease diagnosis tasks.
Accurate Labels: Each sentence is meticulously labeled with the corresponding disease, ensuring high-quality training and evaluation data for NLP models.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This Hospital Patient Treatment Dataset contains simulated data representing patient treatments in a multi-specialty hospital. It includes 200 records with both categorical and numerical fields, designed for data visualization and analysis using tools like Power BI, Tableau, or Python.
The dataset features 3–4 categorical fields: Department
(e.g., Cardiology, Neurology, Orthopedics), Treatment Type
(Surgery, Medication, Therapy, Observation), Doctor Name
, and Gender
. It also includes numerical fields such as Age
, Treatment Cost
, Hospital Stay (Days)
, and Recovery Score
(ranging from 0 to 100).
To get more information about Columns visit: https://colorstech.net/practice-datasets/hospital-patient-treatment-dataset-for-analysis/
This dataset is ideal for healthcare analysts and data enthusiasts who want to practice analyzing treatment efficiency, patient demographics, cost effectiveness, and healthcare outcomes. Potential analyses include cost comparisons by department, gender-based treatment patterns, doctor performance based on recovery scores, and identifying which treatments lead to faster recovery.
It can help simulate real-world healthcare reporting scenarios, such as understanding hospital load, cost optimization, or patient well-being tracking. The inclusion of unique Patient ID
allows for easy referencing and segmentation.
This dataset is well-suited for creating KPIs, dashboards, and advanced visualizations to gain insights into hospital operations and patient care outcomes. No real patient data is used—this is a synthetic dataset for educational use only.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pneumonia, a severe lung infection caused by various viruses, presents significant challenges in diagnosis and treatment due to its similarities with other respiratory conditions. Additionally, the need to protect patient privacy complicates the sharing of sensitive clinical data. This study introduces FLPneXAINet, an effective framework that combines federated learning (FL) with deep learning (DL) and explainable AI (XAI) to securely and accurately predict pneumonia using chest X-ray (CXR) images. We utilized a benchmark dataset from Kaggle, comprising 8,402 CXR images (3,904 normal and 4,498 pneumonia). The dataset was preprocessed and augmented using a cycle-consistent generative adversarial (CycleGAN) network to increase the volume of training data. Three pre-trained DL models named VGG16, NASNetMobile, and MobileNet were employed to extract features from the augmented dataset. Further, four ensemble DL (EDL) models were used to enhance feature extraction. Feature optimization was performed using recursive feature elimination (RFE), analysis of variance (ANOVA), and random forest (RF) to select the most relevant features. These optimized features were then inputted into machine learning (ML) models, including K-nearest neighbor (KNN), naive bayes (NB), support vector machine (SVM), and RF, for pneumonia prediction. The performance of the models was evaluated in a FL environment, with the EDL network achieving the best results: accuracy 97.61%, F1 score 98.36%, recall 98.13%, and precision 98.59%. The framework’s predictions were further validated using two XAI techniques—Local Interpretable Model-Agnostic Explanations (LIME) and Grad-CAM. FLPneXAINet offers a robust solution for healthcare professionals to accurately diagnose pneumonia, ensuring timely treatment while safeguarding patient privacy.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The thickness and appearance of retinal layers are essential markers for diagnosing and studying eye diseases. Despite the increasing availability of imaging devices to scan and store large amounts of data, analyzing retinal images and generating trial endpoints has remained a manual, error-prone, and time-consuming task. In particular, the lack of large amounts of high-quality labels for different diseases hinders the development of automated algorithms. Therefore, we have compiled 5016 pixel-wise manual labels for 1672 optical coherence tomography (OCT) scans featuring two different diseases as well as healthy subjects to help democratize the process of developing novel automatic techniques. We also collected 4698 bounding box annotations for a subset of 566 scans across 9 classes of disease biomarker. Due to variations in retinal morphology, intensity range, and changes in contrast and brightness, designing segmentation and detection methods that can generalize to different disease types is challenging. While machine learning-based methods can overcome these challenges, high-quality expert annotations are necessary for training. Publicly available annotated image datasets typically contain few images and/or only cover a single type of disease, and most are only annotated by a single grader. To address this gap, we present a comprehensive multi-grader and multi-disease dataset fortraining machine learning-based algorithms. The proposed dataset covers three subsets of scans (Age-related Macular Degeneration, Diabetic Macular Edema, and healthy) and annotations for two types of tasks (semantic segmentation and object detection).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
📝 Dataset Overview: This enhanced dataset captures the real-world operational and clinical performance data from a major hospital in Nigeria — Lagos University Teaching Hospital (LUTH). It includes detailed information on admissions, patient care, medical services, billing, and staff activities, ideal for healthcare analytics, hospital management dashboards, and machine learning projects.
🔍 Dataset Features (Suggested Columns): Column Name Description Patient_ID Unique anonymized patient ID Admission_Date Date of admission Discharge_Date Date of discharge Gender Patient’s gender Age Patient’s age Department Medical department involved Diagnosis Primary diagnosis Doctor Attending physician (anonymized) Treatment_Provided Type of treatment/procedure Lab_Tests Count of lab tests conducted Medications_Administered Total medications given Surgery_Cost (₦) If applicable, cost of surgery Bill_Amount (₦) Total bill charged to patient Ward Hospital ward assigned Length_of_Stay (days) Duration of hospitalization
🎯 Use Cases: Build hospital operations dashboards in Power BI
Analyze billing and cost patterns across departments
Predict length of stay or discharge outcomes
Explore departmental workload and performance
Use as a base for AI in hospital management systems
🏥 Clinical & Operational Value: This dataset empowers analysts and healthcare professionals to:
Track patient outcomes and billing efficiency
Reduce operational bottlenecks
Improve patient care with data-driven recommendations
Benchmark departmental performance
Train predictive models for resource allocation
👤 Created By: Fatolu Peter (Emperor Analytics) Dedicated to transforming public healthcare using analytics and real-world data across Nigerian hospitals. This is Project 14 in my growing health-tech analytics journey.
✅ LinkedIn Post: 🚑 New Kaggle Dataset: LUTH Hospital Enhanced Clinical & Operations Data 📊 Real hospital data on admissions, billing, treatments, and care metrics 🔗 Access the dataset now on Kaggle
This dataset gives you: ✅ Real hospital operations data ✅ Billing and medication insights ✅ Doctor and ward-level activity ✅ A perfect base for building Power BI dashboards or training ML models
Whether you're a data scientist, health analyst, or Power BI pro — this is real-world data to make real impact. Let’s build something powerful together. 💡
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset provides a comprehensive overview of the physical, psychological and cognitive health of a cohort of older adults. It contains data collected from medical experts during clinical assessments such as physical activity, nutrition, activity limitations, balance, depression and cognition. Additionally it includes parameters extracted from used devices such as average heart rate per day and average gait speed. Carefully coupled with this is detailed information relating to falls, fractures and loss of orientation within the group studied which can add even further insight into the overall trends in health for those aged 55 and above.
The dataset includes various scores capturing different aspects alongside statistics to better represent participants' lifestyles; not only does it feature basic metrics like gender or age but also complex measures like exhaustion or grip strength for each individual in the cohort. Furthermore an analytical exploration into nutrition measures (e.g., Body Mass Index), social interaction (e.g., phone calls) or leisure activities (clubs) could help uncover powerful correlations among them resulting in innovative strategies for improving well-being amongst elderly population groups
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive overview of physical, psychological and cognitive health of a cohort of older adults. It includes parameters related to physical activity, nutrition, activity limitations, balance, depression, cognition and more. Through this dataset you can gain insights into the various factors affecting the health of elderlies in your population which could be helpful for researchers or practitioners in developing interventions to promote elderly health.
Before using this dataset it is advised to get familiar with the variables and fields provided. There are two sections within each variable: descriptive information such as gender and age group; and scores related to various aspects such as heart rate per day or average gait speed per month. You may also find additional coupled events like falls or fractures that can impact the assessment scores over time.
Once you have gone through all variables available in the dataset you may use simple statistical methods like measuring mean values of several key indicators (such as balance score or bmi score) across different characteristics (such as age group). Comparing these values allows researchers to identify trends amongst different groups within a population that would show differences on an individual level.
Other techniques that could be used include clustering techniques to observe patterns in data relating different indicators at once on comparative models; logistic regression which would help identify which predictors explain certain outcomes among elderly people well; or propensity matching-based approaches which suggest what kind of intervention should be given depending on each person’s characteristics based on an accumulated data source from elderly population research using this dataset . The usefulness of this dataset is not limited by stats only but it might also benefit from theoretical forms such as narrative geometry used for subjective analysis by placing story-telling elements along with formative assessments onto conceptual frameworks between inside natural ecosystems already running smoothly(between concepts) before disruption/disequilibrium happens due external stressors ecomorphonologically speaking . This will eventually help clinicians addressing psychological conditions verifying objective status via outcomes from metrics established earlier preferably prior experiments where involuntary independent behavior was detected influencing vital organ systems at homeostasis levels either causing positive adaptations / fitness ,or increasing vulnerability that when added up together shift towards severe distress turn proximally considering also other segments elsewhere varying across multiple networks simultaneous injections cumulated/integrated effects starting sometimes after take off periods way before ill health seems obviously concrete therefore important details concerning risk factors sometimes overlooked got noticed while capturing evidence based prospective by cross validated means completed longitudinal surveys taking advantage into being able understanding potentially confounding conditions sparedly manifested either forgotten beca...
This dataset was created by TUSHAR SARKAR
It contains the following files:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Overview The Human Vital Signs Dataset is a comprehensive collection of key physiological parameters recorded from patients. This dataset is designed to support research in medical diagnostics, patient monitoring, and predictive analytics. It includes both original attributes and derived features to provide a holistic view of patient health.
Attributes Patient ID
Description: A unique identifier assigned to each patient. Type: Integer Example: 1, 2, 3, ... Heart Rate
Description: The number of heartbeats per minute. Type: Integer Range: 60-100 bpm (for this dataset) Example: 72, 85, 90 Respiratory Rate
Description: The number of breaths taken per minute. Type: Integer Range: 12-20 breaths per minute (for this dataset) Example: 16, 18, 15 Timestamp
Description: The exact time at which the vital signs were recorded. Type: Datetime Format: YYYY-MM-DD HH:MM Example: 2023-07-19 10:15:30 Body Temperature
Description: The body temperature measured in degrees Celsius. Type: Float Range: 36.0-37.5°C (for this dataset) Example: 36.7, 37.0, 36.5 Oxygen Saturation
Description: The percentage of oxygen-bound hemoglobin in the blood. Type: Float Range: 95-100% (for this dataset) Example: 98.5, 97.2, 99.1 Systolic Blood Pressure
Description: The pressure in the arteries when the heart beats (systolic pressure). Type: Integer Range: 110-140 mmHg (for this dataset) Example: 120, 130, 115 Diastolic Blood Pressure
Description: The pressure in the arteries when the heart rests between beats (diastolic pressure). Type: Integer Range: 70-90 mmHg (for this dataset) Example: 80, 75, 85 Age
Description: The age of the patient. Type: Integer Range: 18-90 years (for this dataset) Example: 25, 45, 60 Gender
Description: The gender of the patient. Type: Categorical Categories: Male, Female Example: Male, Female Weight (kg)
Description: The weight of the patient in kilograms. Type: Float Range: 50-100 kg (for this dataset) Example: 70.5, 80.3, 65.2 Height (m)
Description: The height of the patient in meters. Type: Float Range: 1.5-2.0 m (for this dataset) Example: 1.75, 1.68, 1.82 Derived Features Derived_HRV (Heart Rate Variability)
Description: A measure of the variation in time between heartbeats. Type: Float Formula: 𝐻 𝑅
Standard Deviation of Heart Rate over a Period Mean Heart Rate over the Same Period HRV= Mean Heart Rate over the Same Period Standard Deviation of Heart Rate over a Period
Example: 0.10, 0.12, 0.08 Derived_Pulse_Pressure (Pulse Pressure)
Description: The difference between systolic and diastolic blood pressure. Type: Integer Formula: 𝑃
Systolic Blood Pressure − Diastolic Blood Pressure PP=Systolic Blood Pressure−Diastolic Blood Pressure Example: 40, 45, 30 Derived_BMI (Body Mass Index)
Description: A measure of body fat based on weight and height. Type: Float Formula: 𝐵 𝑀
Weight (kg) ( Height (m) ) 2 BMI= (Height (m)) 2
Weight (kg)
Example: 22.8, 25.4, 20.3 Derived_MAP (Mean Arterial Pressure)
Description: An average blood pressure in an individual during a single cardiac cycle. Type: Float Formula: 𝑀 𝐴
Diastolic Blood Pressure + 1 3 ( Systolic Blood Pressure − Diastolic Blood Pressure ) MAP=Diastolic Blood Pressure+ 3 1 (Systolic Blood Pressure−Diastolic Blood Pressure) Example: 93.3, 100.0, 88.7 Target Feature Risk Category Description: Classification of patients into "High Risk" or "Low Risk" based on their vital signs. Type: Categorical Categories: High Risk, Low Risk Criteria: High Risk: Any of the following conditions Heart Rate: > 90 bpm or < 60 bpm Respiratory Rate: > 20 breaths per minute or < 12 breaths per minute Body Temperature: > 37.5°C or < 36.0°C Oxygen Saturation: < 95% Systolic Blood Pressure: > 140 mmHg or < 110 mmHg Diastolic Blood Pressure: > 90 mmHg or < 70 mmHg BMI: > 30 or < 18.5 Low Risk: None of the above conditions Example: High Risk, Low Risk This dataset, with a total of 200,000 samples, provides a robust foundation for various machine learning and statistical analysis tasks aimed at understanding and predicting patient health outcomes based on vital signs. The inclusion of both original attributes and derived features enhances the richness and utility of the dataset.
The dataset is taken from communitybenefitinsight website, they have the data of around 3500 hospitals in USA consisting of the hospital name, address, city, state, zipcode, bed_count, and many more features. This data is for practice and educational purposes.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Inspired by Stanford's leading AI research group's and computer vision's beloved ImageNet database, CellNet is a meticulously curated dataset featuring over 120,000 high-quality medical images representing over 20 organ/cancer classes. It was an idea that started in June 2024 during the time of research in the paper "Enhancement Framework for Vision Transformers in Data Limited Cancer Detection" [Presented at the IEEE's 2024 MIT Undergraduate Research Technology Conference (URTC)]. Future work in the study believed that given enough data and computation, pre-trained CellNet weights transfered to other models working with various medical imaging tasks during transfer-learning can outperform other models pre-trained on ImageNet. More importantly, CellNet relies on the study's theory that models pre-trained on large datasets specifically tailored to their task's field at hand will perform better than general pre-trained models. Specifically in CellNet's case, pre-trained weights derived from medical-based tasks can potentially provide a more sound and ethical base for medical AI models in future applications and research on cancer diagnosis and other related fields.
The data for CellNet has been aggregated from numerous reputable online repositories and medical labs, ensuring a diverse and comprehensive collection of histopathology images. In collaboration with a team of physicians, lab collected images implemented in the set have been meticulously verified and annotated to maintain high standards of accuracy and reliability.
One of the key features of CellNet is its dynamic and evolving nature. CellNet is committed to updating monthly, integrating new data and reorganizing existing entries based on the latest research and clinical findings. This continuous update cycle ensures that CellNet remains a relevant and up-to-date resource for the clinical and medical AI research communities.
CellNet is designed to facilitate various machine learning model applications, including but not limited to:
However, it's important to note that as this dataset grows, weights obtained from training on this set are primarily meant to be imported to other models for transfer learning on smaller datasets.
The CellNet team encourages the Kaggle community to engage with CellNet by exploring the dataset, creating notebooks, and providing feedback. Your contributions and insights are invaluable in helping CellNet improve and expand its resource and weights.
This dataset is licensed under the MIT License. Please cite hte dataset if you use it in your research.
I would like to thank all the contributors and the medical professionals who helped produced all the images curated in this dataset.
For any questions or support, please contact [johncapocyan@gmail.com].
Dive into the dataset today and join the mission to revolutionize medical research through the power of data and AI!
This is a list of UK hospitals which contains details about 299 hospitals. The dataset contains no null values. It has 7 features altogether
This dataset is based on train and test dataset from this competition: https://www.kaggle.com/competitions/widsdatathon2024-challenge1 .
What did I change?
1. I dropped 2 columns that contained to little data.
2. using Machine Learning I imputed "payer_type", "patient_race" and "bmi".
3. using "patient_zip3" I filled missing values in "patient_state" , "Region" and "Division"
4. using SinmpleImputer I imputed few missing numeric data in "Ozone", "PM2.5" and other columns
5. I created some new features, based on demographic features, that may be a bit more informative.
6. I tokenized the 'breast_cancer_diagnosis_desc' column
If you're interested how I did that check those notebooks: https://www.kaggle.com/code/anopsy/ml-for-missing-values for "bmi" and new features check this: https://www.kaggle.com/code/anopsy/fe-and-xgb-on-clean-data
According to the description of the original dataset, it's a "39k record dataset (split into training and test sets) representing patients and their characteristics (age, race, BMI, zip code), their diagnosis and treatment information (breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, … etc.), their geo (zip-code level) demographic data (income, education, rent, race, poverty, …etc), as well as toxic air quality data (Ozone, PM25 and NO2)."