Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘DISEASE PREDICTION USING MACHINE LEARNING WITH GUI’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/neelima98/disease-prediction-using-machine-learning on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Due to big data progress in biomedical and healthcare communities, accurate study of medical data benefits early disease recognition, patient care and community services. When the quality of medical data is incomplete the exactness of study is reduced. Moreover, different regions exhibit unique appearances of certain regional diseases, which may results in weakening the prediction of disease outbreaks. In this project, it bid a Machine learning Decision tree map, Navie Bayes, Random forest algorithm by using structured and unstructured data from hospital. It also uses Machine learning algorithm for partitioning the data. To the highest of gen, none of the current work attentive on together data types in the zone of remedial big data analytics. Compared to several typical calculating algorithms, the scheming accuracy of our proposed algorithm reaches 94.8% with an regular speed which is quicker than that of the unimodal disease risk prediction algorithm and produces report.
--- Original source retains full ownership of the source dataset ---
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset will help you apply your existing knowledge to great use. This dataset has 132 parameters on which 42 different types of diseases can be predicted. This dataset consists of 2 CSV files. One of them is for training and the other is for testing your model. Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and the last column is the prognosis. These symptoms are mapped to 42 diseases you can classify these sets of symptoms. You are required to train your model on training data and test it on testing data.
Machine Learning
medicine,disease,Healthcare,ML,Machine Learning
4962
$109.00
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains symptoms and disease information. It contains total of 1325 symptoms covered with 391 disease.This dataset is refernced from website MedLinePlus. This dataset have training and testing dataset and can be used to train disease prediction algorithm . It is created on own for project disease prediction and do not involves any funding or promotional terms.
This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.
The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column. For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).
Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.
name - ASCII subject name and recording number MDVP:Fo(Hz) - Average vocal fundamental frequency MDVP:Fhi(Hz) - Maximum vocal fundamental frequency MDVP:Flo(Hz) - Minimum vocal fundamental frequency Five measures of variation in Frequency MDVP:Jitter(%) - Percentage of cycle-to-cycle variability of the period duration MDVP:Jitter(Abs) - Absolute value of cycle-to-cycle variability of the period duration MDVP:RAP - Relative measure of the pitch disturbance MDVP:PPQ - Pitch perturbation quotient Jitter:DDP - Average absolute difference of differences between jitter cycles Six measures of variation in amplitude MDVP:Shimmer - Variations in the voice amplitdue MDVP:Shimmer(dB) - Variations in the voice amplitdue in dB Shimmer:APQ3 - Three point amplitude perturbation quotient measured against the average of the three amplitude Shimmer:APQ5 - Five point amplitude perturbation quotient measured against the average of the three amplitude MDVP:APQ - Amplitude perturbation quotient from MDVP Shimmer:DDA - Average absolute difference between the amplitudes of consecutive periods Two measures of ratio of noise to tonal components in the voice NHR - Noise-to-harmonics Ratio and HNR - Harmonics-to-noise Ratio status - Health status of the subject (one) - Parkinson's, (zero) - healthy Two nonlinear dynamical complexity measures RPDE - Recurrence period density entropy D2 - correlation dimension DFA - Signal fractal scaling exponent Three nonlinear measures of fundamental frequency variation spread1 - discrete probability distribution of occurrence of relative semitone variations spread2 - Three nonlinear measures of fundamental frequency variation PPE - Entropy of the discrete probability distribution of occurrence of relative semitone variations
This notebook will introduce some foundation machine learning and data science concepts by exploring the problem of heart disease classification.
The original data came from the Cleveland database from UCI Machine Learning Repository.
The original database contains 76 attributes, but here only 14 attributes will be used. Attributes (also called features) are the variables that we'll use to predict our target
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This heart disease dataset is curated by combining 3 popular heart disease datasets. The first dataset (Collected from Kaggle) contains 70000 records with 11 independent features which makes it the largest heart disease dataset available so far for research purposes. These data were collected at the moment of medical examination and information given by the patient. Second and third datasets contain 303 and 293 intstances respectively with 13 common features. The three datasets used for its curation are:Cardio Data (Kaggle Dataset)
The Kidney Disease Dataset is a rich collection of clinical and laboratory data from patients, curated to support the analysis, diagnosis, and prediction of chronic kidney disease (CKD). It includes 43 diverse features encompassing demographic details, vital signs, urine and blood test results, medical history, lifestyle factors, and biomarkers such as eGFR, serum creatinine, and Cystatin C. This dataset is ideal for building machine learning models, conducting statistical analysis, and exploring correlations between health indicators and kidney function. It provides a valuable resource for researchers and healthcare professionals working on early detection and management of kidney-related disorders. This dataset consists of detailed clinical information related to kidney health, intended for machine learning applications, statistical analysis, and healthcare research.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
About Dataset Context: The leading cause of death in the developed world is heart disease. Therefore there needs to be work done to help prevent the risks of of having a heart attack or stroke.
Content: Use this dataset to predict which patients are most likely to suffer from a heart disease in the near future using the features given.
Acknowledgement: This data comes from the University of California Irvine's Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Heart+Disease.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview This dataset is a synthetic collection of medical attributes designed for educational and research purposes. It provides structured health-related data, including patient demographics, vital signs, and electrocardiogram (ECG) measurements, along with a predicted disease classification.
The dataset is intended to support machine learning practitioners and students in developing classification models for disease prediction. It allows users to explore patterns in health-related data and apply machine learning techniques in a controlled, educational setting.
Dataset Details Total Records: 695,551 entries Target Variable: Predicted_Disease (Categorical: ‘Arrhythmia’, ‘Heart Failure’, ‘Coronary Artery Disease’, ‘Good’)
Features: - Age - Gender - Weight - Height - Heart_Rate - Oxygen_Saturation - Temperature - ECG_QT_Interval - ECG_ST_Segment - Predicted_Disease
This dataset was generated with script with predefined parameter ranges and is not derived from real-world medical data. It should not be considered reliable for medical or clinical decision-making.
It is intended for educational purposes only and should not be used in real-world healthcare applications. The accuracy of the generated values is not guaranteed.
I'm not responsible for any incorrect use, misinterpretation, or unintended consequences of this dataset.
Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.
This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This heart disease dataset is curated by combining 5 popular heart disease datasets already available independently but not combined before. In this dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This heart disease dataset is acquired from one o f the multispecialty hospitals in India. Over 14 common features which makes it one of the heart disease dataset available so far for research purposes. This dataset consists of 1000 subjects with 12 features. This dataset will be useful for building a early-stage heart disease detection as well as to generate predictive machine learning models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Heart Failure Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/andrewmvd/heart-failure-clinical-data on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.
Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.
People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.
- Create a model for predicting mortality caused by Heart Failure.
- Your kernel can be featured here!
- More datasets
If you use this dataset in your research, please credit the authors
Citation
Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). (link)
License
CC BY 4.0
Splash icon
Splash banner
--- Original source retains full ownership of the source dataset ---
Part of Janatahack Hackathon in Analytics Vidhya
The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.
MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).
MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.
One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.
The Process:
MedCamp employees / volunteers reach out to people and drive registrations.
During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
Other things to note:
Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
For a few camps, there was hardware failure, so some information about date and time of registration is lost.
MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides
information about several health issues through various awareness stalls.
Favorable outcome:
For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
You need to predict the chances (probability) of having a favourable outcome.
Train / Test split:
Camps started on or before 31st March 2006 are considered in Train
Test data is for all camps conducted on or after 1st April 2006.
Credits to AV
To share with the data science community to jump start their journey in Healthcare Analytics
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Dementia Prediction Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/shashwatwork/dementia-prediction-dataset on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Dementia is a syndrome – usually of a chronic or progressive nature – in which there is deterioration in cognitive function (i.e. the ability to process thought) beyond what might be expected from normal aging. It affects memory, thinking, orientation, comprehension, calculation, learning capacity, language, and judgment. Consciousness is not affected. The impairment in cognitive function is commonly accompanied and occasionally preceded, by deterioration in emotional control, social behaviou, or motivation.
Dementia results from a variety of diseases and injuries that primarily or secondarily affect the brain, such as Alzheimer's disease or stroke.
Dementia is one of the major causes of disability and dependency among older people worldwide. It can be overwhelming, not only for the people who have it, but also for their carers and families. There is often a lack of awareness and understanding of dementia, resulting in stigmatization and barriers to diagnosis and care. The impact of dementia on carers, family, and society at large can be physical, psychological, social and e and economic
This set consists of a longitudinal collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit
Battineni, Gopi; Amenta, Francesco; Chintalapudi, Nalini (2019), “Data for: MACHINE LEARNING IN MEDICINE: CLASSIFICATION AND PREDICTION OF DEMENTIA BY SUPPORT VECTOR MACHINES (SVM)”, Mendeley Data, V1, doi: 10.17632/tsy6rbc5d4.1 * Dataset is available here.
--- Original source retains full ownership of the source dataset ---
This dataset is designed for preliminary diagnosis prediction, supporting patient flow logistics and the second opinion concept during patient interactions through dialogue systems. It is part of a project initiated at ITMO University in 2022. The dataset maps symptoms to diseases, offering a valuable resource for developing AI and LLM-based diagnostic tools. It comprises two main columns, detailing symptoms and their corresponding diagnoses, with 132 unique symptoms and 40 unique diagnoses identified.
The dataset is typically provided in a CSV format. It structures information across two columns: symptoms and disease names. While the exact total number of rows or records is not specified, the dataset includes 132 unique symptoms and 40 unique diagnoses. This is a Version 1.0 dataset.
This dataset is ideally suited for: * Developing and training preliminary diagnosis prediction models. * Enhancing patient flow logistics in healthcare settings. * Supporting second opinion concepts through automated systems. * Building and refining dialogue systems for patient interactions. * Training AI and machine learning models for symptom-disease mapping.
The dataset's scope is global, indicating its potential applicability across different regions. The project that developed these datasets has been active since 2022, suggesting the data reflects contemporary medical terminology and contexts. The dataset was listed on 26/06/2025.
CC-BY-NC
Original Data Source: Patient Disease Dataset
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a curated collection of disease labels paired with natural language descriptions of symptoms. Its primary purpose is to facilitate the development of language models capable of accurately predicting potential diseases based on user-provided symptom descriptions. Such models hold significant potential for enabling early disease identification, allowing individuals to seek prompt medical attention and treatment. Furthermore, it supports the creation of applications for remote diagnosis and treatment recommendations, particularly useful in situations where in-person consultations may not be feasible or desirable.
The dataset consists of two main columns: * label: This column contains the specific disease labels associated with each symptom description. * text: This column provides the natural language descriptions of the symptoms experienced.
The dataset is typically provided in a CSV file format. It comprises a total of 1200 datapoints. These datapoints are structured around 24 distinct diseases, with each disease having 50 corresponding symptom descriptions.
This dataset is ideal for various applications and use cases, including: * Developing and training natural language processing (NLP) models for disease prediction. * Creating AI-powered tools for early identification of health conditions. * Building virtual assistants or telemedicine platforms that offer remote diagnostic support. * Researching classification algorithms in the medical and healthcare domain. * Analysing disease patterns and symptom correlations.
The dataset's coverage is global, making it suitable for a wide range of applications without regional limitations. It specifically includes 24 different diseases: Psoriasis, Varicose Veins, Typhoid, Chicken pox, Impetigo, Dengue, Fungal infection, Common Cold, Pneumonia, Dimorphic Hemorrhoids, Arthritis, Acne, Bronchial Asthma, Hypertension, Migraine, Cervical spondylosis, Jaundice, Malaria, urinary tract infection, allergy, gastroesophageal reflux disease, drug reaction, peptic ulcer disease, and diabetes. Information on specific time ranges or demographic scopes is not available in the provided details.
CCO
This dataset is intended for a variety of users, including: * Data Scientists and Machine Learning Engineers: To build and refine models for medical diagnostics and NLP tasks. * Healthcare Technology Developers: To integrate symptom analysis capabilities into healthcare applications and platforms. * Researchers: To conduct studies on disease prediction, language understanding in a medical context, and the application of deep learning to health data. * Students: As a valuable resource for learning and practicing data science and AI skills within the healthcare domain.
Original Data Source: Symptom2Disease
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘In Hospital Mortality Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/saurabhshahane/in-hospital-mortality-prediction on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The predictors of in-hospital mortality for intensive care units (ICU)-admitted HF patients remain poorly characterized. We aimed to develop and validate a prediction model for all-cause in-hospital mortality among ICU-admitted HF patients.
Using Structured Query Language queries (PostgreSQL, version 9.6), demographic characteristics, vital signs, and laboratory values data were extracted from the following tables in the MIMIC III dataset: ADMISSIONS, PATIENTS, ICUSTAYS, D_ICD DIAGNOSIS, DIAGNOSIS_ICD, LABEVENTS, D_LABIEVENTS, CHARTEVENTS, D_ITEMS, NOTEEVENTS, and OUTPUTEVENTS. Based on previous studies 7-9 13-15, clinical relevance, and general availability at the time of presentation, we extracted the following data: demographic characteristics (age at the time of hospital admission, sex, ethnicity, weight, and height); vital signs (heart rate, (HR), systolic blood pressure [SBP], diastolic blood pressure [DBP], mean blood pressure, respiratory rate, body temperature, saturation pulse oxygen [SPO2], urine output [first 24 h]); comorbidities (hypertension, atrial fibrillation, ischemic heart disease, diabetes mellitus, depression, hypoferric anemia, hyperlipidemia, chronic kidney disease (CKD), and chronic obstructive pulmonary disease [COPD]); and laboratory variables (hematocrit, red blood cells, mean corpuscular hemoglobin [MCH], mean corpuscular hemoglobin concentration [MCHC], mean corpuscular volume [MCV], red blood cell distribution width [RDW], platelet count, white blood cells, neutrophils, basophils, lymphocytes, prothrombin time [PT], international normalized ratio [INR], NT-proBNP, creatine kinase, creatinine, blood urea nitrogen [BUN] glucose, potassium, sodium, calcium, chloride, magnesium, the anion gap, bicarbonate, lactate, hydrogen ion concentration [pH], partial pressure of CO2 in arterial blood, and LVEF), using Structured Query Language (SQL) with PostgreSQL (version 9.6). Demographic characteristics and vital signs extracted were recorded during the first 24 hours of each admission and laboratory variables were measured during the entire ICU stay. Comorbidities were identified using ICD-9 codes. For variable data with multiple measurements, the calculated mean value was included for analysis. The primary outcome of the study was in-hospital mortality, defined as the vital status at the time of hospital discharge in survivors and non-survivors.
Zhou, Jingmin et al. (2021), Prediction model of in-hospital mortality in intensive care unit patients with heart failure: machine learning-based, retrospective analysis of the MIMIC-III database, Dryad, Dataset, https://doi.org/10.5061/dryad.0p2ngf1zd
Target Variable - Outcome 0 - Alive 1 - Death
--- Original source retains full ownership of the source dataset ---
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
🇬🇧 English:
This synthetic dataset helps build machine learning models to predict whether a patient is at risk of heart disease. It includes patient attributes such as age, cholesterol, blood pressure, sex, and diabetes history.
Use this dataset to:
Features:
🇹🇷 Türkçe:
Bu sentetik veri seti, hastaların kalp hastalığı riski taşıyıp taşımadığını tahmin etmeye yönelik makine öğrenmesi modelleri geliştirmek için tasarlanmıştır. Yaş, kolesterol, tansiyon, cinsiyet ve diyabet bilgileri gibi özellikleri içerir.
Bu veri seti ile:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diabetes, as an incurable lifelong chronic disease, has profound and far-reaching effects on patients. Given this, early intervention is particularly crucial, as it can not only significantly improve the prognosis of patients but also provide valuable reference information for clinical treatment. This study selected the BRFSS (Behavioral Risk Factor Surveillance System) dataset, which is publicly available on the Kaggle platform, as the research object, aiming to provide a scientific basis for the early diagnosis and treatment of diabetes through advanced machine learning techniques. Firstly, the dataset was balanced using various sampling methods; secondly, a Stacking model based on GA-XGBoost (XGBoost model optimized by genetic algorithm) was constructed for the risk prediction of diabetes; finally, the interpretability of the model was deeply analyzed using Shapley values. The results show: (1) Random oversampling, ADASYN, SMOTE, and SMOTEENN were used for data balance processing, among which SMOTEENN showed better efficiency and effect in dealing with data imbalance. (2) The GA-XGBoost model optimized the hyperparameters of the XGBoost model through a genetic algorithm to improve the model’s predictive accuracy. Combined with the better-performing LightGBM model and random forest model, a two-layer Stacking model was constructed. This model not only outperforms single machine learning models in predictive effect but also provides a new idea and method in the field of model integration. (3) Shapley value analysis identified features that have a significant impact on the prediction of diabetes, such as age and body mass index. This analysis not only enhances the transparency of the model but also provides more precise treatment decision support for doctors and patients. In summary, this study has not only improved the accuracy of predicting the risk of diabetes by adopting advanced machine learning techniques and model integration strategies but also provided a powerful tool for the early diagnosis and personalized treatment of diabetes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘DISEASE PREDICTION USING MACHINE LEARNING WITH GUI’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/neelima98/disease-prediction-using-machine-learning on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Due to big data progress in biomedical and healthcare communities, accurate study of medical data benefits early disease recognition, patient care and community services. When the quality of medical data is incomplete the exactness of study is reduced. Moreover, different regions exhibit unique appearances of certain regional diseases, which may results in weakening the prediction of disease outbreaks. In this project, it bid a Machine learning Decision tree map, Navie Bayes, Random forest algorithm by using structured and unstructured data from hospital. It also uses Machine learning algorithm for partitioning the data. To the highest of gen, none of the current work attentive on together data types in the zone of remedial big data analytics. Compared to several typical calculating algorithms, the scheming accuracy of our proposed algorithm reaches 94.8% with an regular speed which is quicker than that of the unimodal disease risk prediction algorithm and produces report.
--- Original source retains full ownership of the source dataset ---