37 datasets found

A
‘DISEASE PREDICTION USING MACHINE LEARNING WITH GUI’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘DISEASE PREDICTION USING MACHINE LEARNING WITH GUI’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-disease-prediction-using-machine-learning-with-gui-5ad4/latest
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘DISEASE PREDICTION USING MACHINE LEARNING WITH GUI’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/neelima98/disease-prediction-using-machine-learning on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Due to big data progress in biomedical and healthcare communities, accurate study of medical data benefits early disease recognition, patient care and community services. When the quality of medical data is incomplete the exactness of study is reduced. Moreover, different regions exhibit unique appearances of certain regional diseases, which may results in weakening the prediction of disease outbreaks. In this project, it bid a Machine learning Decision tree map, Navie Bayes, Random forest algorithm by using structured and unstructured data from hospital. It also uses Machine learning algorithm for partitioning the data. To the highest of gen, none of the current work attentive on together data types in the zone of remedial big data analytics. Compared to several typical calculating algorithms, the scheming accuracy of our proposed algorithm reaches 94.8% with an regular speed which is quicker than that of the unimodal disease risk prediction algorithm and produces report.

--- Original source retains full ownership of the source dataset ---
Disease Prediction Using Machine Learning
dataandsons.com
csv, zip
Updated Oct 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
test test (2022). Disease Prediction Using Machine Learning [Dataset]. https://www.dataandsons.com/categories/machine-learning/disease-prediction-using-machine-learning
Explore at:
csv, zipAvailable download formats
Dataset updated
Oct 31, 2022
Dataset provided by
Authors
test test
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
About this Dataset

This dataset will help you apply your existing knowledge to great use. This dataset has 132 parameters on which 42 different types of diseases can be predicted. This dataset consists of 2 CSV files. One of them is for training and the other is for testing your model. Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and the last column is the prognosis. These symptoms are mapped to 42 diseases you can classify these sets of symptoms. You are required to train your model on training data and test it on testing data.

Category

Machine Learning

Keywords

medicine,disease,Healthcare,ML,Machine Learning

Row Count

4962

Price

$109.00
i
Data from: Disease Prediction Dataset
ieee-dataport.org
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Nautiyal (2025). Disease Prediction Dataset [Dataset]. https://ieee-dataport.org/documents/disease-prediction-dataset
Explore at:
Dataset updated
Feb 20, 2025
Authors
Ayush Nautiyal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains symptoms and disease information. It contains total of 1325 symptoms covered with 391 disease.This dataset is refernced from website MedLinePlus. This dataset have training and testing dataset and can be used to train disease prediction algorithm . It is created on own for project disease prediction and do not involves any funding or promotional terms.
👨‍🦯 Parkinson's Disease Detection Dataset 👨‍⚕️
kaggle.com
Updated Jul 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kancharla Naveen Kumar (2023). 👨‍🦯 Parkinson's Disease Detection Dataset 👨‍⚕️ [Dataset]. https://www.kaggle.com/datasets/naveenkumar20bps1137/parkinsons-disease-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kancharla Naveen Kumar
Description
Parkinson's data set

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column. For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

Citation:

Little,Max. (2008). Parkinsons. UCI Machine Learning Repository. https://doi.org/10.24432/C59C74.

Matrix column entries (attributes):

name - ASCII subject name and recording number MDVP:Fo(Hz) - Average vocal fundamental frequency MDVP:Fhi(Hz) - Maximum vocal fundamental frequency MDVP:Flo(Hz) - Minimum vocal fundamental frequency Five measures of variation in Frequency MDVP:Jitter(%) - Percentage of cycle-to-cycle variability of the period duration MDVP:Jitter(Abs) - Absolute value of cycle-to-cycle variability of the period duration MDVP:RAP - Relative measure of the pitch disturbance MDVP:PPQ - Pitch perturbation quotient Jitter:DDP - Average absolute difference of differences between jitter cycles Six measures of variation in amplitude MDVP:Shimmer - Variations in the voice amplitdue MDVP:Shimmer(dB) - Variations in the voice amplitdue in dB Shimmer:APQ3 - Three point amplitude perturbation quotient measured against the average of the three amplitude Shimmer:APQ5 - Five point amplitude perturbation quotient measured against the average of the three amplitude MDVP:APQ - Amplitude perturbation quotient from MDVP Shimmer:DDA - Average absolute difference between the amplitudes of consecutive periods Two measures of ratio of noise to tonal components in the voice NHR - Noise-to-harmonics Ratio and HNR - Harmonics-to-noise Ratio status - Health status of the subject (one) - Parkinson's, (zero) - healthy Two nonlinear dynamical complexity measures RPDE - Recurrence period density entropy D2 - correlation dimension DFA - Signal fractal scaling exponent Three nonlinear measures of fundamental frequency variation spread1 - discrete probability distribution of occurrence of relative semitone variations spread2 - Three nonlinear measures of fundamental frequency variation PPE - Entropy of the discrete probability distribution of occurrence of relative semitone variations
Heart_Dataset
kaggle.com
Updated Mar 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reddy_Nitin (2021). Heart_Dataset [Dataset]. https://www.kaggle.com/reddynitin/heart-dataset/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Reddy_Nitin
Description
This notebook will introduce some foundation machine learning and data science concepts by exploring the problem of heart disease classification.

The original data came from the Cleveland database from UCI Machine Learning Repository.

The original database contains 76 attributes, but here only 14 attributes will be used. Attributes (also called features) are the variables that we'll use to predict our target
i
Cardiovascular Disease Dataset
ieee-dataport.org
Updated Oct 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajib Kumar Halder Halder (2022). Cardiovascular Disease Dataset [Dataset]. https://ieee-dataport.org/documents/cardiovascular-disease-dataset
Explore at:
Dataset updated
Oct 25, 2022
Authors
Rajib Kumar Halder Halder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This heart disease dataset is curated by combining 3 popular heart disease datasets. The first dataset (Collected from Kaggle) contains 70000 records with 11 independent features which makes it the largest heart disease dataset available so far for research purposes. These data were collected at the moment of medical examination and information given by the patient. Second and third datasets contain 303 and 293 intstances respectively with 13 common features. The three datasets used for its curation are:Cardio Data (Kaggle Dataset)
Kidney Disease Dataset
kaggle.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
amanik (2025). Kidney Disease Dataset [Dataset]. https://www.kaggle.com/datasets/amanik000/kidney-disease-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 14, 2025
Dataset provided by
Kaggle
Authors
amanik
Description
The Kidney Disease Dataset is a rich collection of clinical and laboratory data from patients, curated to support the analysis, diagnosis, and prediction of chronic kidney disease (CKD). It includes 43 diverse features encompassing demographic details, vital signs, urine and blood test results, medical history, lifestyle factors, and biomarkers such as eGFR, serum creatinine, and Cystatin C. This dataset is ideal for building machine learning models, conducting statistical analysis, and exploring correlations between health indicators and kidney function. It provides a valuable resource for researchers and healthcare professionals working on early detection and management of kidney-related disorders. This dataset consists of detailed clinical information related to kidney health, intended for machine learning applications, statistical analysis, and healthcare research.
Heart Disease Prediction
kaggle.com
Updated Aug 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Falah Gatea (2024). Heart Disease Prediction [Dataset]. https://www.kaggle.com/datasets/falahgatea/heart-disease-prediction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 10, 2024
Dataset provided by
Kaggle
Authors
Falah Gatea
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
About Dataset Context: The leading cause of death in the developed world is heart disease. Therefore there needs to be work done to help prevent the risks of of having a heart attack or stroke.

Content: Use this dataset to predict which patients are most likely to suffer from a heart disease in the near future using the features given.

Acknowledgement: This data comes from the University of California Irvine's Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Heart+Disease.
Medical Data for Disease Prediction
kaggle.com
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Elshoraky (2025). Medical Data for Disease Prediction [Dataset]. https://www.kaggle.com/datasets/mhmdelshoraky/medical-data-for-disease-prediction/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 28, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamed Elshoraky
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview This dataset is a synthetic collection of medical attributes designed for educational and research purposes. It provides structured health-related data, including patient demographics, vital signs, and electrocardiogram (ECG) measurements, along with a predicted disease classification.

The dataset is intended to support machine learning practitioners and students in developing classification models for disease prediction. It allows users to explore patterns in health-related data and apply machine learning techniques in a controlled, educational setting.

Dataset Details Total Records: 695,551 entries Target Variable: Predicted_Disease (Categorical: ‘Arrhythmia’, ‘Heart Failure’, ‘Coronary Artery Disease’, ‘Good’)

Features: - Age - Gender - Weight - Height - Heart_Rate - Oxygen_Saturation - Temperature - ECG_QT_Interval - ECG_ST_Segment - Predicted_Disease

This dataset was generated with script with predefined parameter ranges and is not derived from real-world medical data. It should not be considered reliable for medical or clinical decision-making.

It is intended for educational purposes only and should not be used in real-world healthcare applications. The accuracy of the generated values is not guaranteed.

I'm not responsible for any incorrect use, misinterpretation, or unintended consequences of this dataset.
Breast Cancer Prediction Dataset
kaggle.com
Updated Sep 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Merishna Singh Suwal (2018). Breast Cancer Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/merishnasuwal/breast-cancer-prediction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 26, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Merishna Singh Suwal
Description
Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.

This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
i
Heart Disease Dataset (Comprehensive)
ieee-dataport.org
Updated Oct 24, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MANU SIDDHARTHA (2019). Heart Disease Dataset (Comprehensive) [Dataset]. https://ieee-dataport.org/open-access/heart-disease-dataset-comprehensive
Explore at:
Dataset updated
Oct 24, 2019
Authors
MANU SIDDHARTHA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This heart disease dataset is curated by combining 5 popular heart disease datasets already available independently but not combined before. In this dataset
m
Cardiovascular_Disease_Dataset
data.mendeley.com
Updated Apr 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanu Prakash Doppala (2021). Cardiovascular_Disease_Dataset [Dataset]. http://doi.org/10.17632/dzz48mvjht.1
Explore at:
Unique identifier
https://doi.org/10.17632/dzz48mvjht.1
Dataset updated
Apr 16, 2021
Authors
Bhanu Prakash Doppala
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This heart disease dataset is acquired from one o f the multispecialty hospitals in India. Over 14 common features which makes it one of the heart disease dataset available so far for research purposes. This dataset consists of 1000 subjects with 12 features. This dataset will be useful for building a early-stage heart disease detection as well as to generate predictive machine learning models.
A
‘Heart Failure Prediction’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Heart Failure Prediction’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-heart-failure-prediction-c926/1b358936/?iid=010-637&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Heart Failure Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/andrewmvd/heart-failure-clinical-data on 28 January 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

How to use this dataset

Create a model for predicting mortality caused by Heart Failure.

Your kernel can be featured here!

More datasets

Acknowledgements

If you use this dataset in your research, please credit the authors

Citation

Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). (link)

License

CC BY 4.0

Splash icon

Icon by Freepik, available on Flaticon.

Splash banner

Wallpaper by jcomp, available on Freepik.

--- Original source retains full ownership of the source dataset ---
Health Care Analytics
kaggle.com
Updated Jan 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abishek Sudarshan (2022). Health Care Analytics [Dataset]. https://www.kaggle.com/datasets/abisheksudarshan/health-care-analytics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abishek Sudarshan
Description
Context

Part of Janatahack Hackathon in Analytics Vidhya

Content

The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.

MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

The Process:

MedCamp employees / volunteers reach out to people and drive registrations. During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.

Other things to note:

Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people. For a few camps, there was hardware failure, so some information about date and time of registration is lost. MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

Favorable outcome:

For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall. You need to predict the chances (probability) of having a favourable outcome.

Train / Test split:

Camps started on or before 31st March 2006 are considered in Train Test data is for all camps conducted on or after 1st April 2006.

Acknowledgements

Credits to AV

Inspiration

To share with the data science community to jump start their journey in Healthcare Analytics
A
‘Dementia Prediction Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Dementia Prediction Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-dementia-prediction-dataset-8ab0/latest
Explore at:
Dataset updated
Aug 14, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Dementia Prediction Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/shashwatwork/dementia-prediction-dataset on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

Dementia is a syndrome – usually of a chronic or progressive nature – in which there is deterioration in cognitive function (i.e. the ability to process thought) beyond what might be expected from normal aging. It affects memory, thinking, orientation, comprehension, calculation, learning capacity, language, and judgment. Consciousness is not affected. The impairment in cognitive function is commonly accompanied and occasionally preceded, by deterioration in emotional control, social behaviou, or motivation.

Dementia results from a variety of diseases and injuries that primarily or secondarily affect the brain, such as Alzheimer's disease or stroke.

Dementia is one of the major causes of disability and dependency among older people worldwide. It can be overwhelming, not only for the people who have it, but also for their carers and families. There is often a lack of awareness and understanding of dementia, resulting in stigmatization and barriers to diagnosis and care. The impact of dementia on carers, family, and society at large can be physical, psychological, social and e and economic

Content

This set consists of a longitudinal collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit

Acknowledgements

Battineni, Gopi; Amenta, Francesco; Chintalapudi, Nalini (2019), “Data for: MACHINE LEARNING IN MEDICINE: CLASSIFICATION AND PREDICTION OF DEMENTIA BY SUPPORT VECTOR MACHINES (SVM)”, Mendeley Data, V1, doi: 10.17632/tsy6rbc5d4.1 * Dataset is available here.

--- Original source retains full ownership of the source dataset ---
o
Medical Diagnosis Prediction Dataset
opendatabay.com
.undefined
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 7, 2025
Dataset authored and provided by
Datasimple
Area covered
Public Safety & Security
Description
This dataset is designed for preliminary diagnosis prediction, supporting patient flow logistics and the second opinion concept during patient interactions through dialogue systems. It is part of a project initiated at ITMO University in 2022. The dataset maps symptoms to diseases, offering a valuable resource for developing AI and LLM-based diagnostic tools. It comprises two main columns, detailing symptoms and their corresponding diagnoses, with 132 unique symptoms and 40 unique diagnoses identified.

Columns

симптомы (symptoms as list): Contains information regarding various patient symptoms, often provided as a list.

диагноз (disease name): Specifies the corresponding disease name or diagnosis associated with the listed symptoms.

Distribution

The dataset is typically provided in a CSV format. It structures information across two columns: symptoms and disease names. While the exact total number of rows or records is not specified, the dataset includes 132 unique symptoms and 40 unique diagnoses. This is a Version 1.0 dataset.

Usage

This dataset is ideally suited for: * Developing and training preliminary diagnosis prediction models. * Enhancing patient flow logistics in healthcare settings. * Supporting second opinion concepts through automated systems. * Building and refining dialogue systems for patient interactions. * Training AI and machine learning models for symptom-disease mapping.

Coverage

The dataset's scope is global, indicating its potential applicability across different regions. The project that developed these datasets has been active since 2022, suggesting the data reflects contemporary medical terminology and contexts. The dataset was listed on 26/06/2025.

License

CC-BY-NC

Who Can Use It

AI/LLM developers: For training and fine-tuning models in medical diagnostics and conversational AI.

Medical researchers: To analyse symptom-disease correlations and develop predictive tools.

Healthcare technology developers: For creating applications that assist with patient intake, preliminary diagnoses, and medical information systems.

Academic institutions: For educational and research purposes in health informatics and AI in medicine.

Dataset Name Suggestions

Patient Symptom-Disease Mapping Data

Medical Diagnosis Prediction Dataset

Healthcare Dialogue System Training Data

Symptom-Disease NLP Data

Clinical Symptom-Diagnosis Dataset

Attributes

Original Data Source: Patient Disease Dataset
o
Disease Symptom Classifier Dataset
opendatabay.com
.undefined
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Disease Symptom Classifier Dataset [Dataset]. https://www.opendatabay.com/data/healthcare/1df74ad4-cc10-46c0-9cbb-309f5922d042
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 2, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Healthcare Insurance & Costs
Description
This dataset provides a curated collection of disease labels paired with natural language descriptions of symptoms. Its primary purpose is to facilitate the development of language models capable of accurately predicting potential diseases based on user-provided symptom descriptions. Such models hold significant potential for enabling early disease identification, allowing individuals to seek prompt medical attention and treatment. Furthermore, it supports the creation of applications for remote diagnosis and treatment recommendations, particularly useful in situations where in-person consultations may not be feasible or desirable.

Columns

The dataset consists of two main columns: * label: This column contains the specific disease labels associated with each symptom description. * text: This column provides the natural language descriptions of the symptoms experienced.

Distribution

The dataset is typically provided in a CSV file format. It comprises a total of 1200 datapoints. These datapoints are structured around 24 distinct diseases, with each disease having 50 corresponding symptom descriptions.

Usage

This dataset is ideal for various applications and use cases, including: * Developing and training natural language processing (NLP) models for disease prediction. * Creating AI-powered tools for early identification of health conditions. * Building virtual assistants or telemedicine platforms that offer remote diagnostic support. * Researching classification algorithms in the medical and healthcare domain. * Analysing disease patterns and symptom correlations.

Coverage

The dataset's coverage is global, making it suitable for a wide range of applications without regional limitations. It specifically includes 24 different diseases: Psoriasis, Varicose Veins, Typhoid, Chicken pox, Impetigo, Dengue, Fungal infection, Common Cold, Pneumonia, Dimorphic Hemorrhoids, Arthritis, Acne, Bronchial Asthma, Hypertension, Migraine, Cervical spondylosis, Jaundice, Malaria, urinary tract infection, allergy, gastroesophageal reflux disease, drug reaction, peptic ulcer disease, and diabetes. Information on specific time ranges or demographic scopes is not available in the provided details.

License

CCO

Who Can Use It

This dataset is intended for a variety of users, including: * Data Scientists and Machine Learning Engineers: To build and refine models for medical diagnostics and NLP tasks. * Healthcare Technology Developers: To integrate symptom analysis capabilities into healthcare applications and platforms. * Researchers: To conduct studies on disease prediction, language understanding in a medical context, and the application of deep learning to health data. * Students: As a valuable resource for learning and practicing data science and AI skills within the healthcare domain.

Dataset Name Suggestions

Symptom2Disease Dataset

Disease Symptom Classifier

Medical Symptom Description Data

Healthcare NLP Diagnostic Dataset

Attributes

Original Data Source: Symptom2Disease
A
‘In Hospital Mortality Prediction’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘In Hospital Mortality Prediction’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-in-hospital-mortality-prediction-41fd/latest
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘In Hospital Mortality Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/saurabhshahane/in-hospital-mortality-prediction on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

The predictors of in-hospital mortality for intensive care units (ICU)-admitted HF patients remain poorly characterized. We aimed to develop and validate a prediction model for all-cause in-hospital mortality among ICU-admitted HF patients.

Content

Using Structured Query Language queries (PostgreSQL, version 9.6), demographic characteristics, vital signs, and laboratory values data were extracted from the following tables in the MIMIC III dataset: ADMISSIONS, PATIENTS, ICUSTAYS, D_ICD DIAGNOSIS, DIAGNOSIS_ICD, LABEVENTS, D_LABIEVENTS, CHARTEVENTS, D_ITEMS, NOTEEVENTS, and OUTPUTEVENTS. Based on previous studies 7-9 13-15, clinical relevance, and general availability at the time of presentation, we extracted the following data: demographic characteristics (age at the time of hospital admission, sex, ethnicity, weight, and height); vital signs (heart rate, (HR), systolic blood pressure [SBP], diastolic blood pressure [DBP], mean blood pressure, respiratory rate, body temperature, saturation pulse oxygen [SPO2], urine output [first 24 h]); comorbidities (hypertension, atrial fibrillation, ischemic heart disease, diabetes mellitus, depression, hypoferric anemia, hyperlipidemia, chronic kidney disease (CKD), and chronic obstructive pulmonary disease [COPD]); and laboratory variables (hematocrit, red blood cells, mean corpuscular hemoglobin [MCH], mean corpuscular hemoglobin concentration [MCHC], mean corpuscular volume [MCV], red blood cell distribution width [RDW], platelet count, white blood cells, neutrophils, basophils, lymphocytes, prothrombin time [PT], international normalized ratio [INR], NT-proBNP, creatine kinase, creatinine, blood urea nitrogen [BUN] glucose, potassium, sodium, calcium, chloride, magnesium, the anion gap, bicarbonate, lactate, hydrogen ion concentration [pH], partial pressure of CO2 in arterial blood, and LVEF), using Structured Query Language (SQL) with PostgreSQL (version 9.6). Demographic characteristics and vital signs extracted were recorded during the ﬁrst 24 hours of each admission and laboratory variables were measured during the entire ICU stay. Comorbidities were identified using ICD-9 codes. For variable data with multiple measurements, the calculated mean value was included for analysis. The primary outcome of the study was in-hospital mortality, defined as the vital status at the time of hospital discharge in survivors and non-survivors.

Acknowledgements

Zhou, Jingmin et al. (2021), Prediction model of in-hospital mortality in intensive care unit patients with heart failure: machine learning-based, retrospective analysis of the MIMIC-III database, Dryad, Dataset, https://doi.org/10.5061/dryad.0p2ngf1zd

LICENSE - CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Target Variable - Outcome 0 - Alive 1 - Death

--- Original source retains full ownership of the source dataset ---
Heart Disease Risk Prediction Dataset
kaggle.com
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Şahide ŞEKER (2025). Heart Disease Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/sahideseker/heart-disease-risk-prediction-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Şahide ŞEKER
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
🇬🇧 English:

This synthetic dataset helps build machine learning models to predict whether a patient is at risk of heart disease. It includes patient attributes such as age, cholesterol, blood pressure, sex, and diabetes history.

Use this dataset to:

Train classification models (e.g., XGBoost, Decision Tree)

Analyze the relationship between health metrics and heart disease

Practice healthcare-related ML without privacy concerns

Features:

age: Age of the patient

cholesterol: Cholesterol level (mg/dL)

bp: Blood pressure (mmHg)

sex: Biological sex (Male/Female)

diabetes: Diabetes status (Yes/No)

heart_disease: Presence of heart disease (1 = Yes, 0 = No)

🇹🇷 Türkçe:

Bu sentetik veri seti, hastaların kalp hastalığı riski taşıyıp taşımadığını tahmin etmeye yönelik makine öğrenmesi modelleri geliştirmek için tasarlanmıştır. Yaş, kolesterol, tansiyon, cinsiyet ve diyabet bilgileri gibi özellikleri içerir.

Bu veri seti ile:

XGBoost ve Decision Tree gibi sınıflandırma modelleri eğitilebilir

Sağlık verileriyle risk analizi yapılabilir

Gizlilik endişesi olmadan sağlık odaklı projeler geliştirilebilir
f
Data set presentation.
plos.figshare.com
xls
Updated Sep 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenguang Li; Yan Peng; Ke Peng (2024). Data set presentation. [Dataset]. http://doi.org/10.1371/journal.pone.0311222.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311222.t001
Dataset updated
Sep 30, 2024
Dataset provided by
PLOS ONE
Authors
Wenguang Li; Yan Peng; Ke Peng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes, as an incurable lifelong chronic disease, has profound and far-reaching effects on patients. Given this, early intervention is particularly crucial, as it can not only significantly improve the prognosis of patients but also provide valuable reference information for clinical treatment. This study selected the BRFSS (Behavioral Risk Factor Surveillance System) dataset, which is publicly available on the Kaggle platform, as the research object, aiming to provide a scientific basis for the early diagnosis and treatment of diabetes through advanced machine learning techniques. Firstly, the dataset was balanced using various sampling methods; secondly, a Stacking model based on GA-XGBoost (XGBoost model optimized by genetic algorithm) was constructed for the risk prediction of diabetes; finally, the interpretability of the model was deeply analyzed using Shapley values. The results show: (1) Random oversampling, ADASYN, SMOTE, and SMOTEENN were used for data balance processing, among which SMOTEENN showed better efficiency and effect in dealing with data imbalance. (2) The GA-XGBoost model optimized the hyperparameters of the XGBoost model through a genetic algorithm to improve the model’s predictive accuracy. Combined with the better-performing LightGBM model and random forest model, a two-layer Stacking model was constructed. This model not only outperforms single machine learning models in predictive effect but also provides a new idea and method in the field of model integration. (3) Shapley value analysis identified features that have a significant impact on the prediction of diabetes, such as age and body mass index. This analysis not only enhances the transparency of the model but also provides more precise treatment decision support for doctors and patients. In summary, this study has not only improved the accuracy of predicting the risk of diabetes by adopting advanced machine learning techniques and model integration strategies but also provided a powerful tool for the early diagnosis and personalized treatment of diabetes.

Facebook

Twitter

Click to copy link

Link copied

Cite

Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘DISEASE PREDICTION USING MACHINE LEARNING WITH GUI’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-disease-prediction-using-machine-learning-with-gui-5ad4/latest

‘DISEASE PREDICTION USING MACHINE LEARNING WITH GUI’ analyzed by Analyst-2

Explore at:

Dataset updated

Jan 28, 2022

Dataset authored and provided by

Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Analysis of ‘DISEASE PREDICTION USING MACHINE LEARNING WITH GUI’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/neelima98/disease-prediction-using-machine-learning on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Due to big data progress in biomedical and healthcare communities, accurate study of medical data benefits early disease recognition, patient care and community services. When the quality of medical data is incomplete the exactness of study is reduced. Moreover, different regions exhibit unique appearances of certain regional diseases, which may results in weakening the prediction of disease outbreaks. In this project, it bid a Machine learning Decision tree map, Navie Bayes, Random forest algorithm by using structured and unstructured data from hospital. It also uses Machine learning algorithm for partitioning the data. To the highest of gen, none of the current work attentive on together data types in the zone of remedial big data analytics. Compared to several typical calculating algorithms, the scheming accuracy of our proposed algorithm reaches 94.8% with an regular speed which is quicker than that of the unimodal disease risk prediction algorithm and produces report.

--- Original source retains full ownership of the source dataset ---

Clear search

Close search

Google apps

Main menu

‘DISEASE PREDICTION USING MACHINE LEARNING WITH GUI’ analyzed by Analyst-2

Disease Prediction Using Machine Learning

About this Dataset

Category

Keywords

Row Count

Price

Data from: Disease Prediction Dataset

👨‍🦯 Parkinson's Disease Detection Dataset 👨‍⚕️

Parkinson's data set

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation:

Little,Max. (2008). Parkinsons. UCI Machine Learning Repository. https://doi.org/10.24432/C59C74.

Matrix column entries (attributes):

Heart_Dataset

Cardiovascular Disease Dataset

Kidney Disease Dataset

Heart Disease Prediction

Medical Data for Disease Prediction

Breast Cancer Prediction Dataset

Heart Disease Dataset (Comprehensive)

Cardiovascular_Disease_Dataset

‘Heart Failure Prediction’ analyzed by Analyst-2

About this dataset

How to use this dataset

Acknowledgements

Citation

License

Splash icon

Splash banner

Health Care Analytics

Context

Content

Acknowledgements

Inspiration

‘Dementia Prediction Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Medical Diagnosis Prediction Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Disease Symptom Classifier Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

‘In Hospital Mortality Prediction’ analyzed by Analyst-2

Context

Content

Acknowledgements

LICENSE - CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Heart Disease Risk Prediction Dataset

Data set presentation.

‘DISEASE PREDICTION USING MACHINE LEARNING WITH GUI’ analyzed by Analyst-2