Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Patients Table:
This table stores information about individual patients, including their names and contact details.
Doctors Table:
This table contains details about healthcare providers, including their names, specializations, and contact information.
Appointments Table:
This table records scheduled appointments, linking patients to doctors.
MedicalProcedure Table:
This table stores details about medical procedures associated with specific appointments.
Billing Table:
This table maintains records of billing transactions, associating them with specific patients.
demo Table:
This table appears to be a demonstration or testing table, possibly unrelated to the healthcare management system.
This dataset schema is designed to capture comprehensive information about patients, doctors, appointments, medical procedures, and billing transactions in a healthcare management system. Adjustments can be made based on specific requirements, and additional attributes can be included as needed.
The Agency for Healthcare Research and Quality (AHRQ) created SyH-DR from eligibility and claims files for Medicare, Medicaid, and commercial insurance plans in calendar year 2016. SyH-DR contains data from a nationally representative sample of insured individuals for the 2016 calendar year. SyH-DR uses synthetic data elements at the claim level to resemble the marginal distribution of the original data elements. SyH-DR person-level data elements are not synthetic, but identifying information is aggregated or masked.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The heart attack datasets were collected at Zheen hospital in Erbil, Iraq, from January 2019 to May 2019. The attributes of this dataset are: age, gender, heart rate, systolic blood pressure, diastolic blood pressure, blood sugar, ck-mb and troponin with negative or positive output. According to the provided information, the medical dataset classifies either heart attack or none. The gender column in the data is normalized: the male is set to 1 and the female to 0. The glucose column is set to 1 if it is > 120; otherwise, 0. As for the output, positive is set to 1 and negative to 0.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The Open Database of Healthcare Facilities (ODHF) is a listing of health facilities across Canada. Facilities are classified into one of three types: ambulatory health care services, hospitals, and nursing and residential care facilities. The listing contains the names, addresses, and geo coordinates of facilities, as well as the facility type as assigned in the data source. The ODHF is based on data from authoritative sources that include among them all levels of government and public health and professional healthcare bodies. The ODHF is released as open data under the Open Government License - Canada and provided as a zipped comma-separated values (.csv) file.
The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.
The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mental Health reports the prevalence of the mental illness in the past year by age range.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This product presents comparable time-series data for a range of health indicators from a number of sources including the Canadian Community Health Survey, Vital Statistics, and Canadian Cancer Registry.
The Health Statistics and Health Research Database is Estonian largest set of health-related statistics and survey results administrated by National Institute for Health Development. Use of the database is free of charge.
The database consists of eight main areas divided into sub-areas. The data tables included in the sub-areas are assigned unique codes. The data tables presented in the database can be both viewed in the Internet environment, and downloaded using different file formats (.px, .xlsx, .csv, .json). You can download the detailed database user manual here (.pdf).
The database is constantly updated with new data. Dates of updating the existing data tables and adding new data are provided in the release calendar. The date of the last update to each table is provided after the title of the table in the list of data tables.
A contact person for each sub-area is provided under the "Definitions and Methodology" link of each sub-area, so you can ask additional information about the data published in the database. Contact this person for any further questions and data requests.
Read more about publication of health statistics by National Institute for Health Development in Health Statistics Dissemination Principles.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Synthetic Metabolic Syndrome Dataset is designed for educational and research purposes in healthcare, focusing on metabolic syndrome and related health parameters. The dataset contains demographic, anthropometric, and biochemical information that can be used to analyze and predict the presence of metabolic syndrome in individuals.
https://storage.googleapis.com/opendatabay_public/7bf17077-77ce-40cc-84e8-05b5e545d5eb/7e880a16ea2c_Metabolic_1.png" alt="Synthetic Metabolic Syndrome Data">
This dataset is well-suited for applications in healthcare analytics, public health, and data science:
CC0 (Public Domain)
Who Can Use It - Healthcare Professionals: To study metabolic syndrome trends and tailor interventions. - Data Scientists: For practicing classification, regression, and clustering techniques in healthcare analytics. - Public Health Analysts: To assess population-level metabolic health and inform policies. - Researchers: To simulate the impact of lifestyle changes on metabolic health outcomes.
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy.
The Medical Information Mart for Intensive Care (MIMIC)-III database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC). Importantly, MIMIC-III was deidentified, and patient identifiers were removed according to the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. MIMIC-III has been integral in driving large amounts of research in clinical informatics, epidemiology, and machine learning. Here we present MIMIC-IV, an update to MIMIC-III, which incorporates contemporary data and improves on numerous aspects of MIMIC-III. MIMIC-IV adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy. Here we present Medical Information Mart for Intensive Care (MIMIC)-IV, a large deidentified dataset of patients admitted to the emergency department or an intensive care unit at the Beth Israel Deaconess Medical Center in Boston, MA. MIMIC-IV contains data for over 65,000 patients admitted to an ICU and over 200,000 patients admitted to the emergency department. MIMIC-IV incorporates contemporary data and adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Medically Validated, Age-Accurate, and Balanced
Samples: 35,000 | Features: 16 | Targets: 2 (Binary + Regression)
This dataset is designed for predicting stroke risk using symptoms, demographics, and medical literature-inspired risk modeling. Version 2 significantly improves upon Version 1 by incorporating age-dependent symptom probabilities, gender-specific risk modifiers, and medically validated feature engineering.
Age-Accurate Risk Modeling:
Gender-Specific Risk:
Balanced and Expanded Data:
Column | Type | Description |
---|---|---|
age | Integer | Age (18β90) |
gender | String | Male/Female |
chest_pain | Binary | 1 = Present, 0 = Absent |
shortness_of_breath | Binary | 1 = Present, 0 = Absent |
irregular_heartbeat | Binary | 1 = Present, 0 = Absent |
fatigue_weakness | Binary | 1 = Present, 0 = Absent |
dizziness | Binary | 1 = Present, 0 = Absent |
swelling_edema | Binary | 1 = Present, 0 = Absent |
neck_jaw_pain | Binary | 1 = Present, 0 = Absent |
excessive_sweating | Binary | 1 = Present, 0 = Absent |
persistent_cough | Binary | 1 = Present, 0 = Absent |
nausea_vomiting | Binary | 1 = Present, 0 = Absent |
high_blood_pressure | Binary | 1 = Present, 0 = Absent |
chest_discomfort | Binary | 1 = Present, 0 = Absent |
cold_hands_feet | Binary | 1 = Present, 0 = Absent |
snoring_sleep_apnea | Binary | 1 = Present, 0 = Absent |
anxiety_doom | Binary | 1 = Present, 0 = Absent |
at_risk | Binary | Target for classification (1 = At Risk, 0 = Not At Risk) |
stroke_risk_percentage | Float | Target for regression (0β100%) |
Age distribution in Version 2 vs. Version 1
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21100322%2F6317df05bc7526268853e24a5ce831ba%2FAge%20Distribution%20Plot.png?generation=1740875866152537&alt=media" alt="">
This dataset is grounded in peer-reviewed medical literature, with symptom probabilities, risk weights, and demographic relationships directly derived from clinical guidelines and epidemiological studies. Below is a detailed breakdown of how medical knowledge was translated into dataset parameters:
The prevalence of symptoms increases with age, reflecting real-world clinical observations. Probabilities are calibrated using population-level data from medical literature:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Federated learning is to build machine learning models based on data sets that are distributed across multiple devices while preventing data leakage.(Q. Yang et al. 2019)
source:
smoking https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking license = CC0: Public Domain
heart https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset license = CC0: Public Domain
water https://www.kaggle.com/datasets/adityakadiwal/water-potability license = CC0: Public Domain
customer https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis license = CC0: Public Domain
insurance https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data license = CC0: Public Domain
credit https://www.kaggle.com/datasets/ajay1735/hmeq-data license = CC0: Public Domain
income https://www.kaggle.com/datasets/mastmustu/income license = CC0: Public Domain
machine https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification license: CC0: Public Domain
skin https://www.kaggle.com/datasets/saurabhshahane/lumpy-skin-disease-dataset license = Attribution 4.0 International (CC BY 4.0)
score https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv license = CC0: Public Domain
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adaptation of http://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records# Ready for usage with ehrapy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a publication of the CoAID dataset originaly dedicated to fake news detection. We changed here the purpose of this dataset in order to use it in the context of event tracking in press documents.
Cui, Limeng, et Dongwon Lee. 2020. Β« CoAID: COVID-19 Healthcare Misinformation Dataset Β». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.
In this dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.
Features are extracted using:
- A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]
- A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]
- A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) [3]
- A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) [4]
References:
[1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406
[2]: Reimers, Nils, et Iryna Gurevych. 2019. Β« Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks Β». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982β92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.
[3]: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1
[4]: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Provides basic information for general acute care hospital buildings such as height, number of stories, the building code used to design the building, and the year it was completed. The data is sorted by counties and cities. Structural Performance Categories (SPC ratings) are also provided. SPC ratings range from 1 to 5 with SPC 1 assigned to buildings that may be at risk of collapse during a strong earthquake and SPC 5 assigned to buildings reasonably capable of providing services to the public following a strong earthquake. Where SPC ratings have not been confirmed by the Department of Health Care Access and Information (HCAI) yet, the rating index is followed by 's'. A URL for the building webpage in HCAI/OSHPD eServices Portal is also provided to view projects related to any building.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A synthetic heart disease dataset has been generated to serve as an educational resource for data science, machine learning, and data analysis applications in the healthcare industry. It simulates patient records related to heart disease, allowing users to practice data manipulation and develop analytical skills in a healthcare context.
https://storage.googleapis.com/opendatabay_public/images/image_88c9876e-c5a3-48be-837e-f1ea77d11693.png" alt="Synthetic Heart Disease Data">
https://storage.googleapis.com/opendatabay_public/images/image_041922c7-f3dc-49c9-bfbf-16cdf98d6bd8.png" alt="Synthetic Heart Disease Patient Records Dataset">
https://storage.googleapis.com/opendatabay_public/images/hearr_disease_09f51ed4-86d0-4ac4-b6c0-b7b376a9f7f2.png" alt="Synthetic Heart Disease Statistics">
https://storage.googleapis.com/opendatabay_public/images/heart_disease3_abb20b90-1bbd-4e2c-87ce-a47f1e414583.png" alt="Synthetic Heart Disease Data Distribution">
https://storage.googleapis.com/opendatabay_public/images/heart_disease4_64b65bf1-9b53-4ab1-a7ea-3486c050f607.png" alt="Synthetic Heart Disease Dataset Heatmap and Correlation">
This dataset can be used for: - Healthcare research: To explore trends and patterns in cardiovascular health, treatment efficacy, and patient demographics. - Educational training: To teach data cleaning, transformation, and visualisation techniques specific to healthcare data. - Predictive modelling: To develop models that predict heart disease risk based on various patient and demographic factors.
This dataset is synthetic and anonymized, making it a safe tool for experimentation and learning without compromising real patient privacy.
CCO (Public Domain)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These BRFSS datasets were downloaded prior to them being taken offline on January 31st, 2025. Special thanks to James Bailey & Doug Livingston who made earlier years of BRFSS data available!
Data 2000-2023 are provided in SAS, Stata, and R formats. Data for 1987-1999 are provided in CSV format.
This repository has a DOI assigned if you need to cite it.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A beginner-friendly version of the MIT-BIH Arrhythmia Database, which contains 48 electrocardiograms (EKGs) from 47 patients that were at Beth Israel Deaconess Medical Center in Boston, MA in 1975-1979.
There are 48 CSVs, each of which is a 30-minute echocardiogram (EKG) from a single patient (record 201 and 202 are from the same patient). Data was collected at 360 Hz, meaning that 360 data points is equal to 1 second of time.
Banner photo by Joshua Chehov on Unsplash.
EKGs, or electrocardiograms, measure the heart's function by looking at its electrical activity. The electrical activity in each part of the heart is supposed to happen in a particular order and intensity, creating that classic "heartbeat" line (or "QRS complex") you see on monitors in medical TV shows.
There are a few types of EKGs (4-lead, 5-lead, 12-lead, etc.), which give us varying detail about the heart. A 12-lead is one of the most detailed types of EKGs, as it allows us to get 12 different outputs or graphs, all looking at different, specific parts of the heart muscles.
This dataset only publishes two leads from each patient's 12-lead EKG, since that is all that the original MIT-BIH database provided.
Check out Ninja Nerd's EKG Basics tutorial on YouTube to understand what each part of the QRS complex (or heartbeat) means from an electrical standpoint.
Each file's name is the ID of the patient (except for 201 and 202, which are the same person).
index / 360 * 1000
)The two leads are often lead MLII and another lead such as V1, V2, or V5, though some datasets do not use MLII at all. MLII is the lead most often associated with the classic QRS Complex (the medical name for a single heartbeat).
Milliseconds were calculated and added as a secondary index to each dataset. Calculations were made by dividing the index
by 360
Hz then multiplying by 1000
. The original index was preserved, since the calculation of milliseconds as digital signals processing (e.g. filtering) occurs may cause issues with the correlation and merging of data. You are encouraged to try whichever index is most suitable for your analysis and/or recalculate a time index with Pandas' to_timedelta()
.
Info about each of the 47 patients is available here, including age, gender, medications, diagnoses, etc.
Physionet has some online tutorials and tips for analyzing EKGs and other time series / digital signals.
Check out our notebook for opening and visualizing the data.
A write-up on how the data was converted from .dat
to .csv
files is available on Medium.com. Data was downloaded from the MIT-BIH Arrhythmia Database then converted to CSV.
Moody GB, Mark RG. The impact of the MIT-BIH Arrhythmia Database. IEEE Eng in Med and Biol 20(3):45-50 (May-June 2001). (PMID: 11446209)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215βe220.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Tracking United HealthCare Stock Performance Since IPO
This dataset provides historical stock data for UnitedHealth Group (UHG), one of the largest healthcare and insurance companies in the world. It covers stock prices, market capitalization, and trading volumes from the company's IPO to the present. As a Fortune 500 company with a significant market presence, analyzing UHG's stock performance can provide valuable insights into healthcare market trends, investment opportunities, and economic indicators.
This dataset is useful for:
CC0 (Public Domain) β This dataset is freely available for public and commercial use.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Patients Table:
This table stores information about individual patients, including their names and contact details.
Doctors Table:
This table contains details about healthcare providers, including their names, specializations, and contact information.
Appointments Table:
This table records scheduled appointments, linking patients to doctors.
MedicalProcedure Table:
This table stores details about medical procedures associated with specific appointments.
Billing Table:
This table maintains records of billing transactions, associating them with specific patients.
demo Table:
This table appears to be a demonstration or testing table, possibly unrelated to the healthcare management system.
This dataset schema is designed to capture comprehensive information about patients, doctors, appointments, medical procedures, and billing transactions in a healthcare management system. Adjustments can be made based on specific requirements, and additional attributes can be included as needed.