100+ datasets found
  1. Comprehensive Medical Q&A Dataset

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Comprehensive Medical Q&A Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset
    Explore at:
    zip(5126941 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Comprehensive Medical Q&A Dataset

    Unlocking Healthcare Data with Natural Language Processing

    By Huggingface Hub [source]

    About this dataset

    The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.

    Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.

    Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!

    Research Ideas

    • Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.
    • Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.
    • Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  2. ❤️‍🩹 Medical Condition Prediction Dataset

    • kaggle.com
    Updated Sep 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ciobanu Marius (2024). ❤️‍🩹 Medical Condition Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/marius2303/medical-condition-prediction-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ciobanu Marius
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    About Dataset

    This dataset provides information about various medical conditions such as Cancer, Pneumonia, and Diabetic based on demographic, lifestyle, and health-related features. It contains randomly generated user data, including multiple missing values, making it suitable for handling imbalanced classification tasks and missing data problems.

    Features

    • id: Unique identifier for each user.
    • full_name: Randomly generated user name.
    • age: Age of the user (ranging from 18 to 90 years), with some missing values.
    • gender: The gender of the user (categorized as Male, Female, or Non-Binary).
    • smoking_status: Indicates the smoking status of the user (Smoker, Non-Smoker, Former-Smoker).
    • bmi: Body Mass Index (BMI) of the user (ranging from 15 to 40), with some missing values.
    • blood_pressure: Blood pressure levels of the user (ranging from 90 to 180 mmHg), with some missing values.
    • glucose_levels: Blood glucose levels of the user (ranging from 70 to 200 mg/dL), with some missing values.
    • condition: The target label indicating the medical condition of the user (Cancer, Pneumonia, or Diabetic), with imbalanced distribution (15% Cancer, 25% Pneumonia, 60% Diabetic).

    Goal

    The objective of this dataset is to predict the medical condition (Cancer, Pneumonia, Diabetic) of a user based on their demographic, lifestyle, and health-related features. This dataset can be used to explore strategies for dealing with imbalanced classes and missing data in healthcare applications. ​

  3. 11000 Medicine details

    • kaggle.com
    zip
    Updated Jun 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Navjot Singh (2024). 11000 Medicine details [Dataset]. https://www.kaggle.com/datasets/singhnavjot2062001/11000-medicine-details
    Explore at:
    zip(781809 bytes)Available download formats
    Dataset updated
    Jun 3, 2024
    Authors
    Navjot Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a valuable resource for healthcare professionals, data scientists, and enthusiasts interested in exploring the world of medicines and healthcare products. It contains a rich repository of information scraped from 1mg, a popular online pharmacy and healthcare platform, covering over 11,000 medicines.

  4. Fever Diagnosis and Medicine Dataset

    • kaggle.com
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2024). Fever Diagnosis and Medicine Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/fever-diagnosis-and-medicine-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    Kaggle
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset is designed to assist in predicting recommended medications for patients based on their fever condition, symptoms, medical history, and other relevant factors. It incorporates a mix of patient health data, environmental variables, and lifestyle choices to improve model accuracy and better simulate real-world scenarios.

    Dataset Characteristics: Total Samples: 1000 (modifiable based on user needs). Number of Features: 19 features + 1 target column. File Format: CSV (enhanced_fever_medicine_recommendation.csv). Features Description: Column Name Description Data Type Temperature Body temperature of the patient in Celsius (e.g., 36.5 - 40.0). Float Fever_Severity Categorized fever severity: Normal, Mild Fever, High Fever. Categorical Age Age of the patient (1-100 years). Integer Gender Gender of the patient: Male or Female. Categorical BMI Body Mass Index of the patient (e.g., 18.0 - 35.0). Float Headache Whether the patient has a headache: Yes or No. Categorical Body_Ache Whether the patient has body aches: Yes or No. Categorical Fatigue Whether the patient feels fatigued: Yes or No. Categorical Chronic_Conditions If the patient has any chronic conditions (e.g., diabetes, asthma): Yes or No. Categorical Allergies If the patient has any allergies to medications: Yes or No. Categorical Smoking_History If the patient has a history of smoking: Yes or No. Categorical Alcohol_Consumption If the patient consumes alcohol: Yes or No. Categorical Humidity Current humidity level in the patient’s area (e.g., 30-90%). Float AQI Current Air Quality Index in the patient’s area (e.g., 0-500). Integer Physical_Activity Daily physical activity level: Sedentary, Moderate, Active. Categorical Diet_Type Diet preference: Vegetarian, Non-Vegetarian, or Vegan. Categorical Heart_Rate Resting heart rate of the patient in beats per minute (e.g., 60-100). Integer Blood_Pressure Blood pressure category: Normal, High, or Low. Categorical Previous_Medication Medication previously taken by the patient: Paracetamol, Ibuprofen, Aspirin, or None. Categorical Recommended_Medication Target variable indicating the recommended medicine: Paracetamol or Ibuprofen. Categorical

  5. healthcare dataset

    • kaggle.com
    zip
    Updated Feb 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeinab Aladly (2024). healthcare dataset [Dataset]. https://www.kaggle.com/datasets/zeinabaladly/healthcare-dataset
    Explore at:
    zip(494700 bytes)Available download formats
    Dataset updated
    Feb 20, 2024
    Authors
    Zeinab Aladly
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Zeinab Aladly

    Released under CC0: Public Domain

    Contents

  6. Data from: Medicine Recommendation System Dataset

    • kaggle.com
    zip
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noor Saeed (2024). Medicine Recommendation System Dataset [Dataset]. https://www.kaggle.com/datasets/noorsaeed/medicine-recommendation-system-dataset
    Explore at:
    zip(61254 bytes)Available download formats
    Dataset updated
    Jan 10, 2024
    Authors
    Noor Saeed
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Noor Saeed

    Released under Apache 2.0

    Contents

  7. MedQuAD: Medical Question-Answer Dataset

    • kaggle.com
    zip
    Updated Sep 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afroz (2024). MedQuAD: Medical Question-Answer Dataset [Dataset]. https://www.kaggle.com/datasets/pythonafroz/medquad-medical-question-answer-for-ai-research
    Explore at:
    zip(5188686 bytes)Available download formats
    Dataset updated
    Sep 7, 2024
    Authors
    Afroz
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Medical Questions: Unveiling the MedQuAD Dataset

    Have you ever wondered where medical chatbots or intelligent search engines for health information get their knowledge? The answer lies in large datasets like MedQuAD! This rich resource provides a treasure trove of real-world medical questions and informative answers, paving the way for advancements in Natural Language Processing (NLP) and Information Retrieval (IR) within the healthcare domain.

    What is MedQuAD?

    MedQuAD, short for Medical Question Answering Dataset, is a collection of question-answer pairs meticulously curated from 12 trusted National Institutes of Health (NIH) websites. These websites cover a wide range of health topics, from cancer.gov to GARD (Genetic and Rare Diseases Information Resource).

    What makes MedQuAD unique?

    Beyond the sheer volume of data, MedQuAD offers unique features that empower researchers and developers:

    1. Diversity of Questions: MedQuAD encompasses a spectrum of 37 question types, ranging from treatment options and diagnosis inquiries to understanding side effects. This variety reflects the diverse needs of individuals seeking medical information.
    2. Focus on Specific Entities: MedQuAD goes beyond just questions and answers. It delves deeper by associating each question with the entity it focuses on, such as diseases, drugs, or other medical tests. This targeted approach facilitates more focused research and NLP applications.
    3. Rich Annotations: While the answers from MedlinePlus collections are excluded due to copyright restrictions, MedQuAD retains valuable annotations within its XML files. These annotations include question type, synonyms, unique identifiers (CUI) for medical concepts, and semantic types. This additional information opens doors for more sophisticated NLP tasks.

    The Power of MedQuAD

    MedQuAD serves as a valuable springboard for various applications in the medical NLP and IR field. Here are some potential uses:

    1. Training Chatbots and Virtual Assistants: AI-powered medical chatbots can leverage MedQuAD to learn how to respond accurately and informatively to a wide range of health inquiries from users.
    2. Developing Intelligent Search Engines: Search engines can be enhanced to provide more relevant and accurate health information by drawing insights from the question types and focuses presented in MedQuAD.
    3. Studying User Concerns in Healthcare: Analyzing the types of questions within MedQuAD can reveal valuable insights into what information users are most interested in and what areas require clearer explanations.

    In essence, MedQuAD is a powerful tool for unlocking the potential of NLP and IR in the medical domain. By leveraging this rich dataset, researchers and developers are paving the way for a future where individuals can access accurate and comprehensive health information with increasing ease and efficiency.

    Reference:

    If you use the MedQuAD dataset or the associated QA test collection, please cite the following paper: Ben Abacha, A., & Demner-Fushman, D. (2019). A Question-Entailment Approach to Question Answering. BMC Bioinformatics, 20(1), 511. https://doi.org/10.1186/s12859-019-3119-4

  8. Data from: Clinical Dataset

    • kaggle.com
    zip
    Updated Oct 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamadreza Momeni (2023). Clinical Dataset [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/clinical-dataset
    Explore at:
    zip(16220 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    Mohamadreza Momeni
    Description

    The purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.

    Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.

    About Dataset:

    333 scholarly articles cite this dataset.

    Unique identifier: DOI

    Dataset updated: 2023

    Authors: Haoyang Mi

    In this dataset, we have two dataset:

    1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time

    2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS

    Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.

  9. Multilingual Healthcare Text Dataset (Hi, En, Pu)

    • kaggle.com
    zip
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kajol Bagga (2025). Multilingual Healthcare Text Dataset (Hi, En, Pu) [Dataset]. https://www.kaggle.com/datasets/kajolagga/multilingual-healthcare-text-dataset-hi-en-pu
    Explore at:
    zip(421647 bytes)Available download formats
    Dataset updated
    Feb 13, 2025
    Authors
    Kajol Bagga
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains three healthcare datasets in Hindi and Punjabi, translated from English. The datasets cover medical diagnoses, disease names, and related healthcare information. The data has been carefully cleaned and formatted to ensure accuracy and usability for various applications, including machine learning, NLP, and healthcare analysis.

    Diagnosis: Description of the medical condition or disease. Symptoms: List of symptoms associated with the diagnosis. Treatment: Common treatments or recommended procedures. Severity: Severity level of the disease (e.g., mild, moderate, severe). Risk Factors: Known risk factors associated with the condition. Language: Specifies the language of the dataset (Hindi, Punjabi, or English). The purpose of these datasets is to facilitate research and development in regional language processing, especially in the healthcare sector.

    Column Descriptions: Original Data Columns: patient_id – Unique identifier for each patient. age – Age of the patient. gender – Gender of the patient (e.g., Male/Female/Other). Diagnosis – The diagnosed medical condition or disease. Remarks – Additional notes or comments from the doctor. doctor_id – Unique identifier for the doctor treating the patient. Patient History – Medical history of the patient, including previous conditions. age_group – Categorized age group (e.g., Child, Adult, Senior). gender_numeric – Numeric encoding for gender (e.g., 0 = Female, 1 = Male). symptoms – List of symptoms reported by the patient. treatment – Recommended treatment or medication. timespan – Duration of the illness or treatment period. Diagnosis Category – General category of the diagnosis (e.g., Cardiovascular, Neurological). Pseudonymized Data Columns: These columns replace personally identifiable information with anonymized versions for privacy compliance:

    Pseudonymized_patient_id – An anonymized patient identifier. Pseudonymized_age – Anonymized age value. Pseudonymized_gender – Anonymized gender field. Pseudonymized_Diagnosis – Diagnosis field with anonymized identifiers. Pseudonymized_Remarks – Anonymized doctor notes. Pseudonymized_doctor_id – Anonymized doctor identifier. Pseudonymized_Patient History – Anonymized version of patient history. Pseudonymized_age_group – Anonymized version of age groups. Pseudonymized_gender_numeric – Anonymized numeric encoding of gender. Pseudonymized_symptoms – Anonymized symptom descriptions. Pseudonymized_treatment – Anonymized treatment descriptions. Pseudonymized_timespan – Anonymized illness/treatment duration. Pseudonymized_Diagnosis Category – Anonymized category of diagnosis.

  10. Medical Datasets

    • kaggle.com
    zip
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    krishnandan sah (2024). Medical Datasets [Dataset]. https://www.kaggle.com/datasets/krishnandansah/medical-datasets
    Explore at:
    zip(61254 bytes)Available download formats
    Dataset updated
    Aug 3, 2024
    Authors
    krishnandan sah
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by krishnandan sah

    Released under Apache 2.0

    Contents

  11. Medical Appointment No Shows Dataset

    • kaggle.com
    Updated Jul 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marwan Diab (2022). Medical Appointment No Shows Dataset [Dataset]. https://www.kaggle.com/datasets/marwandiab/medical-appointment-no-shows-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marwan Diab
    Description

    Medical Appointment No Shows Project

    This analysis is part of the Udacity Data Analysis Nanodegree program and aims to explore a dataset containing aproximately 100k medical appointments from the Brazilian public health system. found insights about the problem are reviewed and communicated.focused on the question of whether or not patients show up for their appointment. **

    Data Dictionary

    01 - PatientId: Identification of a patient 02 - AppointmentID: Identification of each appointment 03 - Gender: Male or Female . 04 - ScheduledDay: is the day someone called or registered the appointment, this is before appointment 05 - Appointment day: is the day of the actual appointment 06 - Age: How old is the patient. 07 - Neighbourhood: Where the appointment takes place. 08 - Scholarship: True of False . 09 - Hipertension: True or False 10 - Diabetes: True or False 11 - Alcoholism: True or False 12 - Handcap: True or False 13 - SMS_received: 1 or more messages sent to the patient. 14- No-show: True or False.

  12. Healthcare Diabetes Dataset

    • kaggle.com
    zip
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandita Pore (2023). Healthcare Diabetes Dataset [Dataset]. https://www.kaggle.com/datasets/nanditapore/healthcare-diabetes
    Explore at:
    zip(27316 bytes)Available download formats
    Dataset updated
    Aug 23, 2023
    Authors
    Nandita Pore
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description: Welcome to the Diabetes Prediction Dataset, a valuable resource for researchers, data scientists, and medical professionals interested in the field of diabetes risk assessment and prediction. This dataset contains a diverse range of health-related attributes, meticulously collected to aid in the development of predictive models for identifying individuals at risk of diabetes. By sharing this dataset, we aim to foster collaboration and innovation within the data science community, leading to improved early diagnosis and personalized treatment strategies for diabetes.

    Columns: 1. Id: Unique identifier for each data entry. 2. Pregnancies: Number of times pregnant. 3. Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test. 4. BloodPressure: Diastolic blood pressure (mm Hg). 5. SkinThickness: Triceps skinfold thickness (mm). 6. Insulin: 2-Hour serum insulin (mu U/ml). 7. BMI: Body mass index (weight in kg / height in m^2). 8. DiabetesPedigreeFunction: Diabetes pedigree function, a genetic score of diabetes. 9. Age: Age in years. 10. Outcome: Binary classification indicating the presence (1) or absence (0) of diabetes.

    Utilize this dataset to explore the relationships between various health indicators and the likelihood of diabetes. You can apply machine learning techniques to develop predictive models, feature selection strategies, and data visualization to uncover insights that may contribute to more accurate risk assessments. As you embark on your journey with this dataset, remember that your discoveries could have a profound impact on diabetes prevention and management.

    Please ensure that you adhere to ethical guidelines and respect the privacy of individuals represented in this dataset. Proper citation and recognition of this dataset's source are appreciated to promote collaboration and knowledge sharing.

    Start your exploration of the Diabetes Prediction Dataset today and contribute to the ongoing efforts to combat diabetes through data-driven insights and innovations.

  13. Health Metrics Dataset

    • kaggle.com
    zip
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhay Ayare (2024). Health Metrics Dataset [Dataset]. https://www.kaggle.com/datasets/abhayayare/health-metrics-dataset
    Explore at:
    zip(46175 bytes)Available download formats
    Dataset updated
    Jul 22, 2024
    Authors
    Abhay Ayare
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset was generated using synthetic data created with the Python faker library. It simulates health metrics for 1,000 individuals, including information on blood pressure, cholesterol levels, BMI, smoking status, and diabetes status. The data was generated randomly, with certain constraints to mimic real-world distributions.

    Data Generation Date: July 22, 2024 Generated by: [Abhay Ayare] Data Source: Synthetic data generated using Python scripts. Purpose: The dataset is intended for educational and research purposes, allowing users to perform health-related data analysis and machine learning experiments without concerns about privacy and ethical issues related to real patient data.

    Columns Description:

    • Name: Randomly generated names of individuals.
    • Gender: Gender of the individuals (Male/Female).
    • Age: Age of the individuals (18-80 years).
    • Systolic BP: Systolic blood pressure.
    • Diastolic BP: Diastolic blood pressure.
    • Cholesterol: Cholesterol levels.
    • Height (cm): Height of the individuals in centimeters.
    • Weight (kg): Weight of the individuals in kilograms.
    • BMI: Body Mass Index calculated from height and weight.
    • Smoker: Smoking status (True/False).
    • Diabetes: Diabetes status (True/False).
    • Health: Overall health assessment based on combined metrics (Good/Fair/Bad).
  14. Data from: Clinical Dataset

    • kaggle.com
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laksika Tharmalingam (2025). Clinical Dataset [Dataset]. https://www.kaggle.com/datasets/uom190346a/synthetic-clinical-tabular-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 22, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Laksika Tharmalingam
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🏥 Clinical Tabular Dataset (Non-PII, Realistic, Research-Ready)

    Subtitle

    Dataset mimicking real-world patient records for AI research.

    Overview

    This dataset is a synthetically generated clinical tabular dataset designed to closely mimic real-world patient health records while ensuring zero personally identifiable information (PII). It was created using statistical distributions, clinical guidelines, and publicly available medical references to replicate patterns typically observed in hospital and outpatient settings.

    Unlike real EHR datasets, this synthetic dataset is free from privacy restrictions, making it safe to use for AI/ML model training, benchmarking, academic research, and prototyping healthcare applications.

    Dataset Details

    • Number of Records: 10000 patients (scalable to millions)
    • Data Type: Structured, tabular
    • Format: CSV (Comma-separated values)
    • Domain: Healthcare / Clinical informatics
    • PII Free: ✅ (No names, IDs, or sensitive personal details)

    🔍 Columns & Clinical Context Age, Sex, BMI — basic demographics Vitals: Systolic/Diastolic BP, Glucose, Cholesterol, Creatinine Comorbidities: Diabetes, Hypertension Diagnosis: Normal, Pneumonia, Heart Failure, Sepsis Outcomes: 30-day Readmission, Mortality

    Applications

    This dataset can be used for:

    • Machine Learning: Classification, clustering, regression models
    • Healthcare AI: Predictive modeling for risk factors and disease detection
    • Data Science Education: Hands-on exercises for students
    • Synthetic Data Research: Benchmarking synthetic data generation approaches
    • Fairness & Bias Testing: Evaluating ML models across age, gender, and lifestyle groups

    Why This Dataset?

    • Realistic: Matches clinical ranges and distributions found in actual healthcare data
    • Safe to Share: 100% synthetic, no HIPAA/GDPR concerns
    • Flexible: Can be scaled, modified, or extended with more medical variables
    • High Impact: Fills a major gap in openly available clinical tabular datasets

    Disclaimer

    This dataset is synthetic and for research/educational purposes only. It should not be used for medical decision-making or clinical care.

    Citation

    If you use this dataset, please cite as:

    Synthetic Clinical Tabular Dataset (2025). Generated for ML research and benchmarking.

  15. Arabic Medical Q&A Dataset

    • kaggle.com
    zip
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yassin Abdulmahdi (2023). Arabic Medical Q&A Dataset [Dataset]. https://www.kaggle.com/datasets/yassinabdulmahdi/arabic-medical-q-and-a-dataset
    Explore at:
    zip(20375710 bytes)Available download formats
    Dataset updated
    Dec 8, 2023
    Authors
    Yassin Abdulmahdi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This comprehensive dataset contains 87,930 medical questions and answers, meticulously compiled from the "medical" website. It offers a unique focus on Arabic language, catering specifically to research and development in medical natural language processing and AI in Arabic-speaking regions.

    Arabic Language Focus: As an Arabic dataset, it offers a valuable resource for developing and testing AI models in a language that is underrepresented in medical NLP research.

    Structured for Machine Learning: The data is organized into three distinct sets:

    Training Data: The largest portion, designed for AI models to learn and identify patterns. Validation Data: A separate set for fine-tuning and optimizing model parameters. Test Data: A final set to evaluate the performance and accuracy of models in a realistic setting.

  16. Medical Insurance Payout

    • kaggle.com
    zip
    Updated Jun 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harsh Singh (2022). Medical Insurance Payout [Dataset]. https://www.kaggle.com/datasets/harshsingh2209/medical-insurance-payout
    Explore at:
    zip(16423 bytes)Available download formats
    Dataset updated
    Jun 5, 2022
    Authors
    Harsh Singh
    Description

    ACME Insurance Inc. offers affordable health insurance to thousands of customer all over the United States. You're tasked with creating an automated system to estimate the annual medical expenditure for new customers, using information such as their age, sex, BMI, children, smoking habits and region of residence.

    Estimates from your system will be used to determine the annual insurance premium (amount paid every month) offered to the customer.

  17. FitLife: Health & Fitness Tracking Dataset

    • kaggle.com
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mojgan Taheri (2024). FitLife: Health & Fitness Tracking Dataset [Dataset]. https://www.kaggle.com/datasets/jijagallery/fitlife-health-and-fitness-tracking-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mojgan Taheri
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Overview

    FitLife360 is a synthetic dataset that simulates real-world health and fitness tracking data from 3,000 participants over a one-year period. The dataset captures daily activities, vital health metrics, and lifestyle factors, making it valuable for health analytics and predictive modeling.

    Features Description

    Demographic Information

    participant_id: Unique identifier for each participant age: Age of participant (18-65 years) gender: Gender (M/F/Other) height_cm: Height in centimeters weight_kg: Weight in kilograms bmi: Body Mass Index calculated from height and weight

    Activity Metrics

    activity_type: Type of exercise (Running, Swimming, Cycling, etc.) duration_minutes: Length of activity session intensity: Exercise intensity (Low/Medium/High) calories_burned: Estimated calories burned during activity daily_steps: Daily step count

    Health Indicators

    avg_heart_rate: Average heart rate during activity resting_heart_rate: Resting heart rate blood_pressure_systolic: Systolic blood pressure blood_pressure_diastolic: Diastolic blood pressure health_condition: Presence of health conditions smoking_status: Smoking history (Never/Former/Current)

    Lifestyle Metrics

    hours_sleep: Hours of sleep per night stress_level: Daily stress level (1-10) hydration_level: Daily water intake in liters fitness_level: Calculated fitness score based on cumulative activity

    Potential Use Cases

    1. Health Outcome Prediction

    Predict risk of health conditions based on activity patterns Forecast potential life expectancy based on health metrics Identify early warning signs of health issues

    2. Weight Management Analysis

    Develop personalized weight loss prediction models Analyze effectiveness of different activities for weight loss Study the relationship between sleep, stress, and weight management

    3. Fitness Progress Tracking

    Track fitness level progression over time Analyze the impact of consistent exercise on health metrics Study recovery patterns and optimal training frequencies

    4. Healthcare Analytics

    Analyze the relationship between lifestyle choices and health outcomes Study the impact of smoking on fitness performance Investigate correlations between sleep patterns and health metrics

    5. Personal Training Applications

    Develop personalized exercise recommendations Optimize workout intensity based on individual characteristics Create targeted fitness programs based on health conditions

    6. Research Applications

    Study seasonal patterns in exercise behavior Analyze the relationship between stress and physical activity Research the impact of hydration on exercise performance

  18. Medical Conversation Corpus (100k+)

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Medical Conversation Corpus (100k+) [Dataset]. https://www.kaggle.com/datasets/thedevastator/medical-conversation-corpus-100k
    Explore at:
    zip(46487525 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Medical Conversation Corpus (100k+)

    Generative Language Modeling for Medical Applications

    By Huggingface Hub [source]

    About this dataset

    This comprehensive and open-source dataset of 100k+ conversations and instructions that include medical terminologies is perfect for training Generative Language Models for various medical applications. With samples collected from human conversations, this dataset contains a variety of options and suggestions to assist in creating useful language models. From prescribed medications to home remedies such as yoga exercises, breathing exercises, and natural remedies—this collection has it all! Only if you trust the language model you build with the right data can you use it to make decisions that matter in real life. This data is sure to give your project the boost it needs with legitimate information power-packed into every sample!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Download the dataset. The dataset can be downloaded by clicking on the “Download” button located at the top of this page and following the prompts.
    • Unzip and save the file in a location of your choice on your computer or device.
    • Open up the ‘train’ or ‘test’ CSV file, depending on whether you would like to use it for training or testing purposes respectively. Both contain conversations and instructions utilizing medical terminologies which can be used to train a generative language model for medical applications.
    • Read through each conversation/instruction that is provided in each row outlined in data frame column labeled 'Conversation'. These conversations provide examples of transaction between doctors, patients, pharmacists etc., discussing topics such as health advice, natural home remedies and prescriptions etc., as well as conversation involving diagnosis, symptoms, medication side effects and health concerns pertaining to certain medical conditions etc..
    • Note that all conversations are written according to varying levels of complexity with an emphasis on effectiveness when communicating within a healthcare environment eiher directly with patients or amongst colleagues discussing about cases via Verbal/written exchanges utilizing Medical terminologies).

    6 Utilize natural language processing (NLP) techniques such as BERT Embeddings Or word embeddings corresponding to different domains Of medicine that might help relate And sort these conversations With regard To specific categories Of interest identified By domain experts For further Research purposes eiher Mathematically & statistically Or for wider Understanding contexts In diverse languages Such As Chinese , Spanish , Portuguese & French Etc

    Research Ideas

    • Natural language processing applications such as automated medical transcription.
    • Feature extraction and detection of health-related keywords for predictive analytics in healthcare applications.
    • Automated diagnostics utilizing the language models trained on this dataset to identify diseases and illnesses based on user inputs, either through symptoms or other risk factors (e.g., age, lifestyle etc.)

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------------------------------------------------| | Conversation | The conversation between two or more people or an instruction utilizing medical terminologies. (String) |

    File: test.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------------------------------------------------| | Conversation | The conversation between two or more people or an instruction utilizing medical terminologies. (String) |

    Acknowledgements

    If you use this dataset in your research, please cred...

  19. Healthcare Insurance

    • kaggle.com
    zip
    Updated Oct 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    willian oliveira (2023). Healthcare Insurance [Dataset]. https://www.kaggle.com/datasets/willianoliveiragibin/healthcare-insurance
    Explore at:
    zip(16425 bytes)Available download formats
    Dataset updated
    Oct 12, 2023
    Authors
    willian oliveira
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains information on the relationship between personal attributes (age, gender, BMI, family size, smoking habits), geographic factors, and their impact on medical insurance charges. It can be used to study how these features influence insurance costs and develop predictive models for estimating healthcare expenses. Age: The insured person's age.

    Sex: Gender (male or female) of the insured.

    BMI (Body Mass Index): A measure of body fat based on height and weight.

    Children: The number of dependents covered.

    Smoker: Whether the insured is a smoker (yes or no).

    Region: The geographic area of coverage.

    Charges: The medical insurance costs incurred by the insured person.

  20. Medical Insurance Cost Dataset

    • kaggle.com
    Updated Aug 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mosap Abdel-Ghany (2025). Medical Insurance Cost Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12853160
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 24, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mosap Abdel-Ghany
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains medical insurance cost information for 1338 individuals. It includes demographic and health-related variables such as age, sex, BMI, number of children, smoking status, and residential region in the US. The target variable is charges, which represents the medical insurance cost billed to the individual.

    The dataset is commonly used for:

    Regression modeling

    Health economics research

    Insurance pricing analysis

    Machine learning education and tutorials

    Columns

    age: Age of primary beneficiary (int)

    sex: Gender of beneficiary (male, female)

    bmi: Body Mass Index, a measure of body fat based on height and weight (float)

    children: Number of children covered by health insurance (int)

    smoker: Smoking status of the beneficiary (yes, no)

    region: Residential region in the US (northeast, northwest, southeast, southwest)

    charges: Medical insurance cost billed to the beneficiary (float)

    Potential Uses

    Build predictive models for medical costs Explore how smoking and BMI impact charges Teach students about regression and feature engineering Analyze healthcare affordability trends

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). Comprehensive Medical Q&A Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset
Organization logo

Comprehensive Medical Q&A Dataset

Unlocking Healthcare Data with Natural Language Processing

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
zip(5126941 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Comprehensive Medical Q&A Dataset

Unlocking Healthcare Data with Natural Language Processing

By Huggingface Hub [source]

About this dataset

The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.

Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.

Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!

Research Ideas

  • Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.
  • Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.
  • Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

Search
Clear search
Close search
Google apps
Main menu