Part of Janatahack Hackathon in Analytics Vidhya
The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.
MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).
MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.
One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.
The Process:
MedCamp employees / volunteers reach out to people and drive registrations.
During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
Other things to note:
Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
For a few camps, there was hardware failure, so some information about date and time of registration is lost.
MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides
information about several health issues through various awareness stalls.
Favorable outcome:
For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
You need to predict the chances (probability) of having a favourable outcome.
Train / Test split:
Camps started on or before 31st March 2006 are considered in Train
Test data is for all camps conducted on or after 1st April 2006.
Credits to AV
To share with the data science community to jump start their journey in Healthcare Analytics
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Attrition of nurses in the US Healthcare system is at an all-time high. It is a major area of focus, especially for hospitals.
This dataset contains employee and company data useful for supervised ML, unsupervised ML, and analytics. Attrition - whether an employee left or not - is included and can be used as the target variable.
The data is synthetic and based on the IBM Watson dataset for attrition. Employee roles and departments were changed to reflect the healthcare domain. Also, known outcomes for some employees were changed to help increase the performance of ML models.
Here's an app I use as a demo based on this dataset and an ML classification model.
https://i.imgur.com/Aft3t1E.png">
https://i.imgur.com/QNRX2LA.png">
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Explore the intricacies of medical costs and healthcare expenses with our meticulously curated Medical Cost Dataset. This dataset offers valuable insights into the factors influencing medical charges, enabling researchers, analysts, and healthcare professionals to gain a deeper understanding of the dynamics within the healthcare industry.
Columns: 1. ID: A unique identifier assigned to each individual record, facilitating efficient data management and analysis. 2. Age: The age of the patient, providing a crucial demographic factor that often correlates with medical expenses. 3. Sex: The gender of the patient, offering insights into potential cost variations based on biological differences. 4. BMI: The Body Mass Index (BMI) of the patient, indicating the relative weight status and its potential impact on healthcare costs. 5. Children: The number of children or dependents covered under the medical insurance, influencing family-related medical expenses. 6. Smoker: A binary indicator of whether the patient is a smoker or not, as smoking habits can significantly impact healthcare costs. 7. Region: The geographic region of the patient, helping to understand regional disparities in healthcare expenditure. 8. Charges: The medical charges incurred by the patient, serving as the target variable for analysis and predictions.
Whether you're aiming to uncover patterns in medical billing, predict future healthcare costs, or explore the relationships between different variables and charges, our Medical Cost Dataset provides a robust foundation for your research. Researchers can utilize this dataset to develop data-driven models that enhance the efficiency of healthcare resource allocation, insurers can refine pricing strategies, and policymakers can make informed decisions to improve the overall healthcare system.
Unlock the potential of healthcare data with our comprehensive Medical Cost Dataset. Gain insights, make informed decisions, and contribute to the advancement of healthcare economics and policy. Start your analysis today and pave the way for a healthier future.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Explore our synthetic healthcare dataset designed for machine learning, data science, and healthcare analytics.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Healthcare Dataset is a synthetic dataset designed to mimic real-world healthcare data for data science, machine learning, and data analysis purposes. It includes patient information, medical conditions, admission details, and healthcare services provided. This dataset is ideal for developing and testing healthcare predictive models, practicing data manipulation techniques, and creating data visualizations.
2) Data Utilization (1) Healthcare data has characteristics that: • It includes detailed patient information such as age, gender, blood type, medical condition, and admission details. This information can be used to analyze healthcare trends, patient demographics, and the effectiveness of medical treatments. (2) Healthcare data can be used to: • Predictive Modeling: Helps in developing models to predict patient outcomes, treatment success rates, and disease progression. • Healthcare Analytics: Assists in analyzing patient data to identify patterns, improve patient care, and optimize resource allocation. • Educational Purposes: Supports learning and teaching data science concepts in a healthcare context, providing realistic data for experimentation and practice.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Description:
This dataset comprises transcriptions of conversations between doctors and patients, providing valuable insights into the dynamics of medical consultations. It includes a wide range of interactions, covering various medical conditions, patient concerns, and treatment discussions. The data is structured to capture both the questions and concerns raised by patients, as well as the medical advice, diagnoses, and explanations provided by doctors.
Key Features:
Potential Use Cases:
This dataset is a valuable resource for researchers, data scientists, and healthcare professionals interested in the intersection of technology and medicine, aiming to improve healthcare communication through data-driven approaches.
https://choosealicense.com/licenses/creativeml-openrail-m/https://choosealicense.com/licenses/creativeml-openrail-m/
AI Medical Dataset
Introduction
The AI Medical General Dataset is an experimental dataset designed to build a general chatbot with a strong foundation in medical knowledge. This dataset provides a large corpus of medical data, consisting of approximately 27 million rows, specifically adapted for training Large Language Models (LLMs) in the medical domain.
Data Sources
Our dataset is comprised of three primary sources:
Source Number of Words… See the full description on the dataset page: https://huggingface.co/datasets/ruslanmv/ai-medical-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The largest Arabic Healthcare Dataset (AHD) as we know was collected from altibbi website.
The AHD consists of more than 808k Question and Answer into 90 variety categories. The AHD contains one file, and the file description will be discussed here. One file is the actual data which is in Arabic language.
AHD.xlsx file contains dataset in excel format, which includes the question, answer, and category in Arabic.
AHD_english.xlsx file contains dataset in excel format, which includes the question, answer, and category translated to English.
Distribution of Question and Answer per category.xlsex shows the distribution of the data set by category.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Heterogenous Big dataset is presented in this proposed work: electrocardiogram (ECG) signal, blood pressure signal, oxygen saturation (SpO2) signal, and the text input. This work is an extension version for our relevant formulating of dataset that presented in [1] and a trustworthy and relevant medical dataset library (PhysioNet [2]) was used to acquire these signals. The dataset includes medical features from heterogenous sources (sensory data and non-sensory). Firstly, ECG sensor’s signals which contains QRS width, ST elevation, peak numbers, and cycle interval. Secondly: SpO2 level from SpO2 sensor’s signals. Third, blood pressure sensors’ signals which contain high (systolic) and low (diastolic) values and finally text input which consider non-sensory data. The text inputs were formulated based on doctors diagnosing procedures for heart chronic diseases. Python software environment was used, and the simulated big data is presented along with analyses.
This dataset contains demographic and personal health information for individuals, along with the corresponding medical insurance charges billed to them. It is commonly used to build predictive models for insurance costs and to explore relationships between factors such as age, BMI, smoking status, and region on medical expenses.
Features: - age: Age of the primary beneficiary (integer) - sex: Gender of the individual (male, female) - bmi: Body mass index, providing a measure of body fat based on height and weight (float) - children: Number of children/dependents covered by the insurance (integer) - smoker: Smoking status of the individual (yes, no) - region: Residential area in the US (northeast, northwest, southeast, southwest) - charges: Individual medical costs billed by health insurance (float, in USD)
Applications: This dataset is frequently used in regression modeling, cost prediction, and data visualization tasks. It is ideal for learning how lifestyle and demographic factors impact healthcare expenses and serves as a foundational dataset for applied machine learning in health economics.
gowthamrvc/healthcare-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
This service provides web services used to obtain clinical data for patients. There are three service methods that allow write functionality signNote, writeNote and writeSimpleOrder all of the other functionality exposed by this service is read only access. The service supports multiple Vista sites data access. Users of this service are intended to be healthcare providers
Overview This dataset is a collection of multimodal high quality image sets of medical data that are ready to use for optimizing the accuracy of computer vision models. All of the contents are sourced from Pixta AI's partner network with high quality & full data compliance.
Data subject The datasets consist of various models
X-ray datasets
CT datasets
MRI datasets
Mammography datasets
Segmentation datasets
Classification datasets
Regression datasets
Use case The dataset could be used for various Healthcare & Medical models:
Medical Image Analysis
Remote Diagnosis
Medical Record Keeping ... Each data set is supported by both AI and expert doctors review process to ensure labelling consistency and accuracy. Contact us for more custom datasets.
About PIXTA PIXTASTOCK is the largest Asian-featured stock platform providing data, contents, tools and services since 2005. PIXTA experiences 15 years of integrating advanced AI technology in managing, curating, processing over 100M visual materials and serving global leading brands for their creative and data demands. Visit us at https://www.pixta.ai/ or contact via our email admin.bi@pixta.co.jp.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Arabic Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Arabic speakers in Healthcare contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Arabic Healthcare interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
smart city
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Synthetic dataset of emergency services comprised of several CSV files that we have generated using a simulation software. This dataset is open for public use; please cite our work if used in research or applications. File Overview CheckBloodPressure.csv** - (9 KB): Contains blood pressure Server records of patients. CheckPatientType.csv** - (19 KB): Identifies the type of each patient (e.g., 1 or 3). Fill_Information.csv - (2 KB): Fill information records for new patients. MedicalRecord1.csv - (10 KB): Medical record dataset for patient type 1. MedicalRecord2.csv - (4 KB): Medical record dataset for patient type 2. MedicalRecord3.csv - (2 KB): Medical record dataset for patient type 3. MedicalRecord4.csv - (13 KB): Medical record dataset for patient type 4. OutPatientDepartment.csv - (18 KB): Data related to the satisfaction and length of stay of an given patient. Triage.csv - (13 KB): Data related to the triage process. README.txt - (4 KB): Documentation of the dataset, including structure, metadata, and usage. Common Fields Across Files Patient ID (Integer): Unique identifier for each patient. Patient Type (Integer): Classification of patient (e.g., 1, 4). Medical Records Arrival Time (DateTime): Timestamp of the patient's first arrival in the medical record department. Exiting Time (DateTime): Timestamp when the patient exits a Server. Waiting Time (min) (Real): Total waiting time before being attended to. Resource Used (String): Resource (e.g., Operator) allocated to the patient. Utilization % (Real): Utilization rate of the resource as a percentage. Queue Count Before Processing (Integer): Number of patients in the queue before processing begins. Queue Count After Processing (Integer): Number of patients in the queue after processing ends. Queue Difference (Integer): Difference between the before and after queue counts. Length of Stay (min) (Real): Total time spent in the simulation by the patient. LOS without Queues (min) (Real): Length of stay excluding any queuing time. Satisfaction % (Real): Patient satisfaction rating based on their experience. New Patient? (String): Indicates if this is a new patient or a returning one.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This synthetic healthcare dataset serves as a valuable resource for data science, machine learning, and data analysis enthusiasts. It is designed to mimic real-world healthcare data, enabling users to practise, develop, and showcase their data manipulation and analysis skills within the healthcare industry. The inspiration behind this dataset stems from the need for practical and diverse healthcare data for educational and research purposes, addressing the challenge of accessing sensitive real-world healthcare information. Generated using Python's Faker library, it mirrors the structure and attributes commonly found in healthcare records, aiming to foster innovation, learning, and knowledge sharing in healthcare analytics.
This dataset is typically provided as a data file in CSV format. It is structured with columns providing specific information about the patient, their admission, and the healthcare services received. While the exact number of rows or records is not specified, it is designed to be a synthetic dataset suitable for various data analysis and modelling tasks in the healthcare domain.
This dataset is ideal for a wide range of applications, including: * Developing and testing healthcare predictive models. * Practising data cleaning, transformation, and analysis techniques. * Creating data visualisations to gain insights into healthcare trends. * Learning and teaching data science and machine learning concepts in a healthcare context. It can specifically be treated as a Multi-Class Classification Problem for predicting 'Test Results', which contains three categories: Normal, Abnormal, and Inconclusive.
The dataset has a global geographic region. The time range for admissions and discharges, as indicated by the 'Date of Admission' and 'Discharge Date' columns, spans across several years, with examples observed from 2019 to 2024. Demographic scope is covered by patient 'Name', 'Age', 'Gender', and 'Blood Type' information. As this is a synthetic dataset, it does not contain real patient information and is created to mirror common healthcare record structures.
CCO
This dataset is intended for data science, machine learning, and data analysis enthusiasts. It is particularly useful for those looking to engage in learning and experimentation within the healthcare analytics domain. The dataset encourages exploration, analysis, and sharing of findings within communities like Kaggle.
Original Data Source: Healthcare Dataset
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🩺 Recurv-Clinical-Dataset:
The Recurv Clinical Dataset is a comprehensive resource containing 12,631 high-quality question-answer pairs specifically designed for training and fine-tuning medical AI models. Curated from trusted medical sources, this dataset focuses on real-world scenarios, including patient history, diagnostics, and treatment recommendations. It sets a new benchmark for advancing conversational AI in the healthcare field.
📈 Dataset Statistics… See the full description on the dataset page: https://huggingface.co/datasets/RecurvAI/Recurv-Clinical-Dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset is a JSON file containing the Intents Information: 1. Greetings 2. Introduction to the Diseases 3. Types of Diseases 4. Symptoms of Disease 5. Prevention of Disease
The MedVidCL dataset contains a collection of 6, 617 videos annotated into ‘medical instructional’, ‘medical non-instructional' and ‘non-medical’ classes. A two-step approach is used to construct the MedVidCL dataset. In the first step, the videos annotated by health informatics experts are used to train a machine learning model that predicts the given video to one of the three aforementioned classes. In the second step, only the high-confidence videos are used and health informatics experts assess the model’s predicted video category and update the category wherever needed.
Part of Janatahack Hackathon in Analytics Vidhya
The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.
MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).
MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.
One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.
The Process:
MedCamp employees / volunteers reach out to people and drive registrations.
During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
Other things to note:
Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
For a few camps, there was hardware failure, so some information about date and time of registration is lost.
MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides
information about several health issues through various awareness stalls.
Favorable outcome:
For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
You need to predict the chances (probability) of having a favourable outcome.
Train / Test split:
Camps started on or before 31st March 2006 are considered in Train
Test data is for all camps conducted on or after 1st April 2006.
Credits to AV
To share with the data science community to jump start their journey in Healthcare Analytics