100+ datasets found
  1. Health Care Analytics

    • kaggle.com
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abishek Sudarshan (2022). Health Care Analytics [Dataset]. https://www.kaggle.com/datasets/abisheksudarshan/health-care-analytics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abishek Sudarshan
    Description

    Context

    Part of Janatahack Hackathon in Analytics Vidhya

    Content

    The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.

    MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

    MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

    One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

    The Process:

    MedCamp employees / volunteers reach out to people and drive registrations.
    During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
    

    Other things to note:

    Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
    For a few camps, there was hardware failure, so some information about date and time of registration is lost.
    MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides  
    information about several health issues through various awareness stalls.
    

    Favorable outcome:

    For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
    You need to predict the chances (probability) of having a favourable outcome.
    

    Train / Test split:

    Camps started on or before 31st March 2006 are considered in Train
    Test data is for all camps conducted on or after 1st April 2006.
    

    Acknowledgements

    Credits to AV

    Inspiration

    To share with the data science community to jump start their journey in Healthcare Analytics

  2. Synthetic Healthcare Database for Research (SyH-DR)

    • catalog.data.gov
    • healthdata.gov
    • +2more
    Updated Sep 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agency for Healthcare Research and Quality (2023). Synthetic Healthcare Database for Research (SyH-DR) [Dataset]. https://catalog.data.gov/dataset/synthetic-healthcare-database-for-research-syh-dr
    Explore at:
    Dataset updated
    Sep 16, 2023
    Dataset provided by
    Agency for Healthcare Research and Qualityhttp://www.ahrq.gov/
    Description

    The Agency for Healthcare Research and Quality (AHRQ) created SyH-DR from eligibility and claims files for Medicare, Medicaid, and commercial insurance plans in calendar year 2016. SyH-DR contains data from a nationally representative sample of insured individuals for the 2016 calendar year. SyH-DR uses synthetic data elements at the claim level to resemble the marginal distribution of the original data elements. SyH-DR person-level data elements are not synthetic, but identifying information is aggregated or masked.

  3. m

    Data from: Generating Heterogeneous Big Data Set for Healthcare and...

    • data.mendeley.com
    Updated Jan 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Al-Obidi (2023). Generating Heterogeneous Big Data Set for Healthcare and Telemedicine Research Based on ECG, Spo2, Blood Pressure Sensors, and Text Inputs: Data set classified, Analyzed, Organized, And Presented in Excel File Format. [Dataset]. http://doi.org/10.17632/gsmjh55sfy.1
    Explore at:
    Dataset updated
    Jan 23, 2023
    Authors
    Omar Al-Obidi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Heterogenous Big dataset is presented in this proposed work: electrocardiogram (ECG) signal, blood pressure signal, oxygen saturation (SpO2) signal, and the text input. This work is an extension version for our relevant formulating of dataset that presented in [1] and a trustworthy and relevant medical dataset library (PhysioNet [2]) was used to acquire these signals. The dataset includes medical features from heterogenous sources (sensory data and non-sensory). Firstly, ECG sensor’s signals which contains QRS width, ST elevation, peak numbers, and cycle interval. Secondly: SpO2 level from SpO2 sensor’s signals. Third, blood pressure sensors’ signals which contain high (systolic) and low (diastolic) values and finally text input which consider non-sensory data. The text inputs were formulated based on doctors diagnosing procedures for heart chronic diseases. Python software environment was used, and the simulated big data is presented along with analyses.

  4. c

    Healthcare Dataset

    • cubig.ai
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Healthcare Dataset [Dataset]. https://cubig.ai/store/products/176/healthcare-dataset
    Explore at:
    Dataset updated
    May 2, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
    Description

    1) Data Introduction • The Healthcare Dataset is a synthetic dataset designed to mimic real-world healthcare data for data science, machine learning, and data analysis purposes. It includes patient information, medical conditions, admission details, and healthcare services provided. This dataset is ideal for developing and testing healthcare predictive models, practicing data manipulation techniques, and creating data visualizations.

    2) Data Utilization (1) Healthcare data has characteristics that: • It includes detailed patient information such as age, gender, blood type, medical condition, and admission details. This information can be used to analyze healthcare trends, patient demographics, and the effectiveness of medical treatments. (2) Healthcare data can be used to: • Predictive Modeling: Helps in developing models to predict patient outcomes, treatment success rates, and disease progression. • Healthcare Analytics: Assists in analyzing patient data to identify patterns, improve patient care, and optimize resource allocation. • Educational Purposes: Supports learning and teaching data science concepts in a healthcare context, providing realistic data for experimentation and practice.

  5. E

    Minimum Hospital Data Set

    • healthinformationportal.eu
    html
    Updated Mar 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Public Service (FPS) Health, Food Chain Safety, and Environment (2022). Minimum Hospital Data Set [Dataset]. https://www.healthinformationportal.eu/health-information-sources/minimum-hospital-data-set
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Mar 4, 2022
    Dataset authored and provided by
    Federal Public Service (FPS) Health, Food Chain Safety, and Environment
    License

    https://fair.healthdata.be/dataset/12d69eca-4449-47d2-943d-e4448a467292https://fair.healthdata.be/dataset/12d69eca-4449-47d2-943d-e4448a467292

    Variables measured
    sex, title, topics, acronym, country, language, data_owners, description, contact_name, geo_coverage, and 14 more
    Measurement technique
    Hospital resources & Healthcare administrative area resources
    Description

    The MZG is a registration with which all non-psychiatric hospitals in Belgium must make their (anonymised) administrative, medical and nursing data available to the Federal Public Service (FPS) Public Health. The aim of the MZG is to support the government's health policy by

    • Determining the needs for hospital facilities;
    • Describing the qualitative and quantitative accreditation standards of hospitals and their services;
    • Organising the financing of hospitals;
    • Determining policy for the practice of medicine;
    • To outline epidemiological policy.

    The MZG aims also to support the health policy of hospitals by providing national and individual feedback so that a hospital can compare itself with other hospitals and adapt its internal policy.

    All reports can be found here (in French/Dutch).

  6. d

    Office-based Health Care Providers Database

    • catalog.data.gov
    • data.virginia.gov
    • +3more
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of the National Coordinator for Health Information Technology (2025). Office-based Health Care Providers Database [Dataset]. https://catalog.data.gov/dataset/office-based-health-care-providers-database
    Explore at:
    Dataset updated
    Jul 11, 2025
    Description

    ONC uses the SK&A Office-based Provider Database to calculate the counts of medical doctors, doctors of osteopathy, nurse practitioners, and physician assistants at the state and count level from 2011 through 2013. These counts are grouped as a total, as well as segmented by each provider type and separately as counts of primary care providers.

  7. CarePrecise Collection U.S. HCP/HCO Dataset

    • datarade.ai
    .csv
    Updated Oct 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CarePrecise (2021). CarePrecise Collection U.S. HCP/HCO Dataset [Dataset]. https://datarade.ai/data-products/careprecise-collection-u-s-hcp-hco-dataset-careprecise
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Oct 27, 2021
    Dataset authored and provided by
    CarePrecise
    Area covered
    United States of America
    Description

    The CarePrecise U.S. HCP/HCO Collection Dataset includes deep data on all 6.7 million U.S. HIPAA-covered healthcare practitioners and organizations. Monthly full updates. Includes linkages between the individual practitioners and their practice groups, hospitals, and hospital systems. Licensing plans are available for basic (internal use), derivative products, and redistribution. Data updates are delivered quarterly or monthly to suit customer need; FTP push is available, standard delivery is via CDN. Single download for evaluation is available. CarePrecise is a leader in the fields of HCP/HCO data, supplying provider data to the industry since 2008. Note regarding pricing: The Collection price shown in Pricing is separate from email addresses. Email addresses are priced as low as $0.075 per, based on volume. Pricing shown is without derivative product (DP) licensing for use in web applications; DP license ranges in price from $1,900/year to $9,000/year on top of data purchase, based on application and overall exposure estimate. DP license is sold in two-year term and requires a license agreement.

  8. Medical_cost_dataset

    • kaggle.com
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandita Pore (2023). Medical_cost_dataset [Dataset]. https://www.kaggle.com/datasets/nanditapore/medical-cost-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nandita Pore
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description:

    Explore the intricacies of medical costs and healthcare expenses with our meticulously curated Medical Cost Dataset. This dataset offers valuable insights into the factors influencing medical charges, enabling researchers, analysts, and healthcare professionals to gain a deeper understanding of the dynamics within the healthcare industry.

    Columns: 1. ID: A unique identifier assigned to each individual record, facilitating efficient data management and analysis. 2. Age: The age of the patient, providing a crucial demographic factor that often correlates with medical expenses. 3. Sex: The gender of the patient, offering insights into potential cost variations based on biological differences. 4. BMI: The Body Mass Index (BMI) of the patient, indicating the relative weight status and its potential impact on healthcare costs. 5. Children: The number of children or dependents covered under the medical insurance, influencing family-related medical expenses. 6. Smoker: A binary indicator of whether the patient is a smoker or not, as smoking habits can significantly impact healthcare costs. 7. Region: The geographic region of the patient, helping to understand regional disparities in healthcare expenditure. 8. Charges: The medical charges incurred by the patient, serving as the target variable for analysis and predictions.

    Whether you're aiming to uncover patterns in medical billing, predict future healthcare costs, or explore the relationships between different variables and charges, our Medical Cost Dataset provides a robust foundation for your research. Researchers can utilize this dataset to develop data-driven models that enhance the efficiency of healthcare resource allocation, insurers can refine pricing strategies, and policymakers can make informed decisions to improve the overall healthcare system.

    Unlock the potential of healthcare data with our comprehensive Medical Cost Dataset. Gain insights, make informed decisions, and contribute to the advancement of healthcare economics and policy. Start your analysis today and pave the way for a healthier future.

  9. A

    AI Training Dataset In Healthcare Market Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). AI Training Dataset In Healthcare Market Report [Dataset]. https://www.archivemarketresearch.com/reports/ai-training-dataset-in-healthcare-market-5352
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    global
    Variables measured
    Market Size
    Description

    The AI Training Dataset In Healthcare Market size was valued at USD 341.8 million in 2023 and is projected to reach USD 1464.13 million by 2032, exhibiting a CAGR of 23.1 % during the forecasts period. The growth is attributed to the rising adoption of AI in healthcare, increasing demand for accurate and reliable training datasets, government initiatives to promote AI in healthcare, and technological advancements in data collection and annotation. These factors are contributing to the expansion of the AI Training Dataset In Healthcare Market. Healthcare AI training data sets are vital for building effective algorithms, and enhancing patient care and diagnosis in the industry. These datasets include large volumes of Electronic Health Records, images such as X-ray and MRI scans, and genomics data which are thoroughly labeled. They help the AI systems to identify trends, forecast and even help in developing unique approaches to treating the disease. However, patient privacy and ethical use of a patient’s information is of the utmost importance, thus requiring high levels of anonymization and compliance with laws such as HIPAA. Ongoing expansion and variety of datasets are crucial to address existing bias and improve the efficiency of AI for different populations and diseases to provide safer solutions for global people’s health.

  10. Healthcare Payments Data Snapshot

    • data.ca.gov
    • data.chhs.ca.gov
    • +2more
    csv, pdf, zip
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Access and Information (2025). Healthcare Payments Data Snapshot [Dataset]. https://data.ca.gov/dataset/healthcare-payments-data-snapshot
    Explore at:
    pdf, csv, zipAvailable download formats
    Dataset updated
    Jul 29, 2025
    Dataset authored and provided by
    Department of Health Care Access and Information
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data for the Healthcare Payments Data (HPD) Snapshot visualization. The Enrollment data file contains counts of claims and encounter data collected for California's statewide HPD Program. It includes counts of enrollment records, service records from medical and pharmacy claims, and the number of individuals represented across these records. Aggregate counts are grouped by payer type (Commercial, Medi-Cal, or Medicare), product type, and year. The Medical data file contains counts of medical procedures from medical claims and encounter data in HPD. Procedures are categorized using claim line procedure codes and grouped by year, type of setting (e.g., outpatient, laboratory, ambulance), and payer type. The Pharmacy data file contains counts of drug prescriptions from pharmacy claims and encounter data in HPD. Prescriptions are categorized by name and drug class using the reported National Drug Code (NDC) and grouped by year, payer type, and whether the drug dispensed is branded or a generic.

  11. m

    AHD: Arabic Healthcare Dataset

    • data.mendeley.com
    Updated Sep 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hezam Gawbah (2024). AHD: Arabic Healthcare Dataset [Dataset]. http://doi.org/10.17632/mgj29ndgrk.6
    Explore at:
    Dataset updated
    Sep 4, 2024
    Authors
    Hezam Gawbah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Numerous language-centric research on healthcare is conducted day by day. To address shortcomings of Arabic natural language generation models, we introduce a large Arabic Healthcare Dataset (AHD) of textual data. For this motivation, we named our dataset ‘AHD’.
    • The largest Arabic Healthcare Dataset (AHD) as we know was collected from altibbi website.

    • The AHD consists of more than 808k Question and Answer into 90 variety categories. The AHD contains one file, and the file description will be discussed here. One file is the actual data which is in Arabic language.

      • AHD.xlsx file contains dataset in excel format, which includes the question, answer, and category in Arabic.

      • AHD_english.xlsx file contains dataset in excel format, which includes the question, answer, and category translated to English.

    • Distribution of Question and Answer per category.xlsex shows the distribution of the data set by category.

  12. G

    Open Database of Healthcare Facilities

    • open.canada.ca
    • catalogue.arctic-sdi.org
    • +1more
    csv, esri rest +4
    Updated Mar 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics Canada (2022). Open Database of Healthcare Facilities [Dataset]. https://open.canada.ca/data/en/dataset/a1bcd4ee-8e57-499b-9c6f-94f6902fdf32
    Explore at:
    fgdb/gdb, esri rest, csv, html, pdf, wmsAvailable download formats
    Dataset updated
    Mar 2, 2022
    Dataset provided by
    Statistics Canada
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    The Open Database of Healthcare Facilities (ODHF) is a collection of open data containing the names, types, and locations of health facilities across Canada. It is released under the Open Government License - Canada. The ODHF compiles open, publicly available, and directly-provided data on health facilities across Canada. Data sources include regional health authorities, provincial, territorial and municipal governments, and public health and professional healthcare bodies. This database aims to provide enhanced access to a harmonized listing of health facilities across Canada by making them available as open data. This database is a component of the Linkable Open Data Environment (LODE).

  13. g

    Healthcare Dataset

    • gts.ai
    json
    Updated Oct 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2024). Healthcare Dataset [Dataset]. https://gts.ai/dataset-download/healthcare-dataset/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Explore our synthetic healthcare dataset designed for machine learning, data science, and healthcare analytics.

  14. Reduced Access to Care During COVID-19

    • catalog.data.gov
    • cloud.csiss.gmu.edu
    • +3more
    Updated Apr 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention (2025). Reduced Access to Care During COVID-19 [Dataset]. https://catalog.data.gov/dataset/reduced-access-to-care-during-covid-19
    Explore at:
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Description

    The Research and Development Survey (RANDS) is a platform designed for conducting survey question evaluation and statistical research. RANDS is an ongoing series of surveys from probability-sampled commercial survey panels used for methodological research at the National Center for Health Statistics (NCHS). RANDS estimates are generated using an experimental approach that differs from the survey design approaches generally used by NCHS, including possible biases from different response patterns and sampling frames as well as increased variability from lower sample sizes. Use of the RANDS platform allows NCHS to produce more timely data than would be possible using traditional data collection methods. RANDS is not designed to replace NCHS’ higher quality, core data collections. Below are experimental estimates of reduced access to healthcare for three rounds of RANDS during COVID-19. Data collection for the three rounds of RANDS during COVID-19 occurred between June 9, 2020 and July 6, 2020, August 3, 2020 and August 20, 2020, and May 17, 2021 and June 30, 2021. Information needed to interpret these estimates can be found in the Technical Notes. RANDS during COVID-19 included questions about unmet care in the last 2 months during the coronavirus pandemic. Unmet needs for health care are often the result of cost-related barriers. The National Health Interview Survey, conducted by NCHS, is the source for high-quality data to monitor cost-related health care access problems in the United States. For example, in 2018, 7.3% of persons of all ages reported delaying medical care due to cost and 4.8% reported needing medical care but not getting it due to cost in the past year. However, cost is not the only reason someone might delay or not receive needed medical care. As a result of the coronavirus pandemic, people also may not get needed medical care due to cancelled appointments, cutbacks in transportation options, fear of going to the emergency room, or an altruistic desire to not be a burden on the health care system, among other reasons. The Household Pulse Survey (https://www.cdc.gov/nchs/covid19/pulse/reduced-access-to-care.htm), an online survey conducted in response to the COVID-19 pandemic by the Census Bureau in partnership with other federal agencies including NCHS, also reports estimates of reduced access to care during the pandemic (beginning in Phase 1, which started on April 23, 2020). The Household Pulse Survey reports the percentage of adults who delayed medical care in the last 4 weeks or who needed medical care at any time in the last 4 weeks for something other than coronavirus but did not get it because of the pandemic. The experimental estimates on this page are derived from RANDS during COVID-19 and show the percentage of U.S. adults who were unable to receive medical care (including urgent care, surgery, screening tests, ongoing treatment, regular checkups, prescriptions, dental care, vision care, and hearing care) in the last 2 months. Technical Notes: https://www.cdc.gov/nchs/covid19/rands/reduced-access-to-care.htm#limitations

  15. d

    Data from: Health Care Cost Growth

    • catalog.data.gov
    • data.ok.gov
    • +3more
    Updated Nov 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OKStateStat (2024). Health Care Cost Growth [Dataset]. https://catalog.data.gov/dataset/health-care-cost-growth
    Explore at:
    Dataset updated
    Nov 22, 2024
    Dataset provided by
    OKStateStat
    Description

    Limit state-purchased health care cost growth to 2% less than the projected national health expenditures average every year through 2019.

  16. F

    Arabic Agent-Customer Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Arabic Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/arabic-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Arabic Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Arabic-speaking regions.

    Participant & Chat Overview

    Participants: 150+ native Arabic speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative outcomes included

    Topic Diversity

    The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:

    Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups
    Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

    This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.

    Language Diversity & Realism

    This dataset reflects the natural flow of Arabic healthcare communication and includes:

    Authentic Naming Patterns: Arabic personal names, clinic names, and brands
    Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Arabic formats
    Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Arabic-speaking regions
    Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

    These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.

    Conversational Flow & Structure

    Conversations range from simple inquiries to complex advisory sessions, including:

    General inquiries
    Detailed problem-solving
    Routine status updates
    Treatment recommendations
    Support and feedback interactions

    Each conversation typically includes these structural components:

    Greetings and verification
    Information gathering
    Problem definition
    Solution delivery
    Closing messages
    Follow-up and feedback (where applicable)

    This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.

    Data Format & Structure

    Available in JSON, CSV, and TXT formats, each conversation includes:

    Full message history with clear speaker labels
    Participant identifiers
    Metadata (e.g., topic tags, region, sentiment)
    Compatibility with common NLP and ML pipelines

    Applications

    <p

  17. National Health Care Practitioner Database (NHCPD)

    • catalog.data.gov
    • datahub.va.gov
    • +2more
    Updated Apr 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Veterans Affairs (2021). National Health Care Practitioner Database (NHCPD) [Dataset]. https://catalog.data.gov/dataset/national-health-care-practitioner-database-nhcpd
    Explore at:
    Dataset updated
    Apr 26, 2021
    Dataset provided by
    United States Department of Veterans Affairshttp://va.gov/
    Description

    This database is part of the National Medical Information System (NMIS). The National Health Care Practitioner Database (NHCPD) supports Veterans Health Administration Privacy Act requirements by segregating personal information about health care practitioners such as name and social security number from patient information recorded in the National Patient Care Database for Ambulatory Care Reporting and Primary Care Management Module.

  18. d

    Healthcare Payments Data (HPD) Healthcare Measures

    • catalog.data.gov
    • data.ca.gov
    • +3more
    Updated Jul 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Access and Information (2025). Healthcare Payments Data (HPD) Healthcare Measures [Dataset]. https://catalog.data.gov/dataset/healthcare-payments-data-hpd-healthcare-measures-9f673
    Explore at:
    Dataset updated
    Jul 23, 2025
    Dataset provided by
    Department of Health Care Access and Information
    Description

    This dataset contains data for the Healthcare Payments Data (HPD) Healthcare Measures report. The data cover three measurement categories: Health conditions, Utilization, and Demographics. The health condition measurements quantify the prevalence of long-term illnesses and major medical events prominent in California’s communities like diabetes and heart failure. Utilization measures convey rates of healthcare system use through visits to the emergency department and different categories of inpatient stays, such as maternity or surgical stays. The demographic measures describe the health coverage and other characteristics (e.g., age) of the Californians included in the data and represented in the other measures. The data include both a count or sum of each measure and a count of the base population so that data users can calculate the percentages, rates, and averages in the visualization. Measures are grouped by year, age band, sex (assigned sex at birth), payer type, Covered California Region, and county.

  19. f

    Statistics of the ORBDA source database content at the dataset and patient...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Douglas Teodoro; Erik Sundvall; Mario João Junior; Patrick Ruch; Sergio Miranda Freire (2023). Statistics of the ORBDA source database content at the dataset and patient levels. [Dataset]. http://doi.org/10.1371/journal.pone.0190028.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Douglas Teodoro; Erik Sundvall; Mario João Junior; Patrick Ruch; Sergio Miranda Freire
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistics of the ORBDA source database content at the dataset and patient levels.

  20. d

    Dataplex: All CMS Data Feeds | Access 1519 Reports & 26B+ Rows of Contact...

    • datarade.ai
    .csv
    Updated Aug 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: All CMS Data Feeds | Access 1519 Reports & 26B+ Rows of Contact Data | Perfect for Historical Analysis & Easy Ingestion [Dataset]. https://datarade.ai/data-products/dataplex-all-cms-data-feeds-access-1519-reports-26b-row-dataplex-3b76
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Aug 29, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    United States of America
    Description

    The All CMS Data Feeds dataset is an expansive resource offering access to 119 unique report feeds, providing in-depth insights into various aspects of the U.S. healthcare system including nursing facility owners and accountable care organization participants contact data. With over 25.8 billion rows of data meticulously collected since 2007, this dataset is invaluable for healthcare professionals, analysts, researchers, and businesses seeking to understand and analyze healthcare trends, performance metrics, and demographic shifts over time. The dataset is updated monthly, ensuring that users always have access to the most current and relevant data available.

    Dataset Overview:

    118 Report Feeds: - The dataset includes a wide array of report feeds, each providing unique insights into different dimensions of healthcare. These topics range from Medicare and Medicaid service metrics, patient demographics, provider information, financial data, and much more. The breadth of information ensures that users can find relevant data for nearly any healthcare-related analysis. - As CMS releases new report feeds, they are automatically added to this dataset, keeping it current and expanding its utility for users.

    25.8 Billion Rows of Data:

    • With over 25.8 billion rows of data, this dataset provides a comprehensive view of the U.S. healthcare system. This extensive volume of data allows for granular analysis, enabling users to uncover insights that might be missed in smaller datasets. The data is also meticulously cleaned and aligned, ensuring accuracy and ease of use.

    Historical Data Since 2007: - The dataset spans from 2007 to the present, offering a rich historical perspective that is essential for tracking long-term trends and changes in healthcare delivery, policy impacts, and patient outcomes. This historical data is particularly valuable for conducting longitudinal studies and evaluating the effects of various healthcare interventions over time.

    Monthly Updates:

    • To ensure that users have access to the most current information, the dataset is updated monthly. These updates include new reports as well as revisions to existing data, making the dataset a continuously evolving resource that stays relevant and accurate.

    Data Sourced from CMS:

    • The data in this dataset is sourced directly from the Centers for Medicare & Medicaid Services (CMS). After collection, the data is meticulously cleaned and its attributes are aligned, ensuring consistency, accuracy, and ease of use for any application. Furthermore, any new updates or releases from CMS are automatically integrated into the dataset, keeping it comprehensive and current.

    Use Cases:

    Market Analysis:

    • The dataset is ideal for market analysts who need to understand the dynamics of the healthcare industry. The extensive historical data allows for detailed segmentation and analysis, helping users identify trends, market shifts, and growth opportunities. The comprehensive nature of the data enables users to perform in-depth analyses of specific market segments, making it a valuable tool for strategic decision-making.

    Healthcare Research:

    • Researchers will find the All CMS Data Feeds dataset to be a robust foundation for academic and commercial research. The historical data, combined with the breadth of coverage across various healthcare metrics, supports rigorous, in-depth analysis. Researchers can explore the effects of healthcare policies, study patient outcomes, analyze provider performance, and more, all within a single, comprehensive dataset.

    Performance Tracking:

    • Healthcare providers and organizations can use the dataset to track performance metrics over time. By comparing data across different periods, organizations can identify areas for improvement, monitor the effectiveness of initiatives, and ensure compliance with regulatory standards. The dataset provides the detailed, reliable data needed to track and analyze key performance indicators.

    Compliance and Regulatory Reporting:

    • The dataset is also an essential tool for compliance officers and those involved in regulatory reporting. With detailed data on provider performance, patient outcomes, and healthcare utilization, the dataset helps organizations meet regulatory requirements, prepare for audits, and ensure adherence to best practices. The accuracy and comprehensiveness of the data make it a trusted resource for regulatory compliance.

    Data Quality and Reliability:

    The All CMS Data Feeds dataset is designed with a strong emphasis on data quality and reliability. Each row of data is meticulously cleaned and aligned, ensuring that it is both accurate and consistent. This attention to detail makes the dataset a trusted resource for high-stakes applications, where data quality is critical.

    Integration and Usability:

    Ease of Integration:

    • The dataset is provided in a CSV format, which is widely compatible with most data analysis too...
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abishek Sudarshan (2022). Health Care Analytics [Dataset]. https://www.kaggle.com/datasets/abisheksudarshan/health-care-analytics
Organization logo

Health Care Analytics

Predicting Patient Outcome

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abishek Sudarshan
Description

Context

Part of Janatahack Hackathon in Analytics Vidhya

Content

The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.

MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

The Process:

MedCamp employees / volunteers reach out to people and drive registrations.
During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.

Other things to note:

Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
For a few camps, there was hardware failure, so some information about date and time of registration is lost.
MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides  
information about several health issues through various awareness stalls.

Favorable outcome:

For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
You need to predict the chances (probability) of having a favourable outcome.

Train / Test split:

Camps started on or before 31st March 2006 are considered in Train
Test data is for all camps conducted on or after 1st April 2006.

Acknowledgements

Credits to AV

Inspiration

To share with the data science community to jump start their journey in Healthcare Analytics

Search
Clear search
Close search
Google apps
Main menu