100+ datasets found

Chronic Disease Indicators
kaggle.com
zip
Updated Aug 17, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Disease Control and Prevention (2017). Chronic Disease Indicators [Dataset]. https://www.kaggle.com/datasets/cdc/chronic-disease
Explore at:
zip(8401214 bytes)Available download formats
Dataset updated
Aug 17, 2017
Dataset authored and provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context:

CDC's Division of Population Health provides cross-cutting set of 124 indicators that were developed by consensus and that allows states and territories and large metropolitan areas to uniformly define, collect, and report chronic disease data that are important to public health practice and available for states, territories and large metropolitan areas. In addition to providing access to state-specific indicator data, the CDI web site serves as a gateway to additional information and data resources.

Content:

A variety of health-related questions were assessed at various times and places across the US over the past 15 years. Data is provided with confidence intervals and demographic stratification.

Acknowledgements:

Data was compiled by the CDC.

Inspiration:

Any interesting trends in certain groups?

Any correlation between disease indicators and locality hospital spending?
U.S. Healthcare Data
kaggle.com
zip
Updated Dec 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BuryBuryZymon (2017). U.S. Healthcare Data [Dataset]. https://www.kaggle.com/maheshdadhich/us-healthcare-data
Explore at:
zip(37547642 bytes)Available download formats
Dataset updated
Dec 22, 2017
Authors
BuryBuryZymon
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Context

Health care in the United States is provided by many distinct organizations. Health care facilities are largely owned and operated by private sector businesses. 58% of US community hospitals are non-profit, 21% are government owned, and 21% are for-profit. According to the World Health Organization (WHO), the United States spent more on healthcare per capita ($9,403), and more on health care as percentage of its GDP (17.1%), than any other nation in 2014. Many different datasets are needed to portray different aspects of healthcare in US like disease prevalences, pharmaceuticals and drugs, Nutritional data of different food products available in US. Such data is collected by surveys (or otherwise) conducted by Centre of Disease Control and Prevention (CDC), Foods and Drugs Administration, Center of Medicare and Medicaid Services and Agency for Healthcare Research and Quality (AHRQ). These datasets can be used to properly review demographics and diseases, determining start ratings of healthcare providers, different drugs and their compositions as well as package informations for different diseases and for food quality. We often want such information and finding and scraping such data can be a huge hurdle. So, Here an attempt is made to make available all US healthcare data at one place to download from in csv files.

Content

Nhanes Survey (National Health and Nutrition Examination Survey) - The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation. The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel. The diseases, medical conditions, and health indicators to be studied include: Anemia, Cardiovascular disease, Diabetes, Environmental exposures, Eye diseases, Hearing loss, Infectious diseases, Kidney disease, Nutrition, Obesity, Oral health, Osteoporosis, Physical fitness and physical functioning, Reproductive history and sexual behavior, Respiratory disease (asthma, chronic bronchitis, emphysema), Sexually transmitted diseases, Vision. 10000 individuals are surveyed to represent US statistics. Five files in this datasets represent current recent Nhanes data -
Nhanes_2005_2006.csv
Nhanes_2007_2008.csv
Nhanes_2009_2010.csv
Nhanes_2011_2012.csv
Nhanes_2013_2014.csv

Data fields' description -

Nhanes_2005_2006.csv - Demographic, Dietary, Examinations, Laboratory

Nhanes_2007_2008.csv - Demographic, Dietary, Examinations, Laboratory

Nhanes_2009_2010.csv - Demographic, Dietary, Examinations, Laboratory

Nhanes_2011_2012.csv - Demographic, Dietary, [Examinations](http://https://wwwn.cdc.gov/nchs/nhanes/search/variab...
Infectious Diseases by Disease, County, Year, and Sex
data.chhs.ca.gov
data.ca.gov
+3more
csv, zip
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Public Health (2025). Infectious Diseases by Disease, County, Year, and Sex [Dataset]. https://data.chhs.ca.gov/dataset/infectious-disease
Explore at:
zip, csv(12953665)Available download formats
Dataset updated
Nov 7, 2025
Dataset authored and provided by
California Department of Public Healthhttps://www.cdph.ca.gov/
Description
These data contain case counts and rates for selected communicable diseases—listed in the data dictionary—that met the surveillance case definition for that disease and was reported for California residents, by disease, county, year, and sex. The data represent cases with an estimated illness onset date from 2001 through the last year indicated from California Confidential Morbidity Reports and/or Laboratory Reports. Data captured represent reportable case counts as of the date indicated in the “Temporal Coverage” section below, so the data presented may differ from previous publications due to delays inherent to case reporting, laboratory reporting, and epidemiologic investigation.
m
Disease and symptoms dataset 2023
data.mendeley.com
Updated Mar 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bran Stark (2025). Disease and symptoms dataset 2023 [Dataset]. http://doi.org/10.17632/2cxccsxydc.1
Explore at:
Unique identifier
https://doi.org/10.17632/2cxccsxydc.1
Dataset updated
Mar 3, 2025
Authors
Bran Stark
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains disease names along with the symptoms faced by the respective patient. There are a total of 773 unique diseases and 377 symptoms, with ~246,000 rows. The dataset was artificially generated, preserving Symptom Severity and Disease Occurrence Possibility. Several distinct groups of symptoms might all be indicators of the same disease. There may even be one single symptom contributing to a disease in a row or sample. This is an indicator of a very high correlation between the symptom and that particular disease. A larger number of rows for a particular disease corresponds to its higher probability of occurrence in the real world. Similarly, in a row, if the feature vector has the occurrence of a single symptom, it implies that this symptom has more correlation to classify the disease than any one symptom of a feature vector with multiple symptoms in another sample.
PFAS and multimorbidity among a random sample of patients from the...
catalog.data.gov
s.cnmilf.com
Updated Oct 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). PFAS and multimorbidity among a random sample of patients from the University of North Carolina Healthcare System [Dataset]. https://catalog.data.gov/dataset/pfas-and-multimorbidity-among-a-random-sample-of-patients-from-the-university-of-north-car
Explore at:
Dataset updated
Oct 28, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This dataset contains electronic health records used to study associations between PFAS occurrence and multimorbidity in a random sample of UNC Healthcare system patients. The dataset contains the medical record number to uniquely identify each individual as well as information on PFAS occurrence at the zip code level, the zip code of residence for each individual, chronic disease diagnoses, patient demographics, and neighborhood socioeconomic information from the 2010 US Census. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Because this data has PII from electronic health records the data can only be accessed with an approved IRB application. Project analytic code is available at L:/PRIV/EPHD_CRB/Cavin/CARES/Project Analytic Code/Cavin Ward/PFAS Chronic Disease and Multimorbidity. Format: This data is formatted as a R dataframe and associated comma-delimited flat text file. The data has the medical record number to uniquely identify each individual (which also serves as the primary key for the dataset), as well as information on the occurrence of PFAS contamination at the zip code level, socioeconomic data at the census tract level from the 2010 US Census, demographics, and the presence of chronic disease as well as multimorbidity (the presence of two or more chronic diseases). This dataset is associated with the following publication: Ward-Caviness, C., J. Moyer, A. Weaver, R. Devlin, and D. Diazsanchez. Associations between PFAS occurrence and multimorbidity as observed in an electronic health record cohort. Environmental Epidemiology. Wolters Kluwer, Alphen aan den Rijn, NETHERLANDS, 6(4): p e217, (2022).
m
Cardiovascular_Disease_Dataset
data.mendeley.com
kaggle.com
Updated Apr 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanu Prakash Doppala (2021). Cardiovascular_Disease_Dataset [Dataset]. http://doi.org/10.17632/dzz48mvjht.1
Explore at:
Unique identifier
https://doi.org/10.17632/dzz48mvjht.1
Dataset updated
Apr 16, 2021
Authors
Bhanu Prakash Doppala
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This heart disease dataset is acquired from one o f the multispecialty hospitals in India. Over 14 common features which makes it one of the heart disease dataset available so far for research purposes. This dataset consists of 1000 subjects with 12 features. This dataset will be useful for building a early-stage heart disease detection as well as to generate predictive machine learning models.
c
Coronary heart disease (in persons of all ages): England
data.catchmentbasedapproach.org
hub.arcgis.com
Updated Apr 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Rivers Trust (2021). Coronary heart disease (in persons of all ages): England [Dataset]. https://data.catchmentbasedapproach.org/items/832de0122e4b4bba9ff69cadc1bf53c4
Explore at:
Dataset updated
Apr 7, 2021
Dataset authored and provided by
The Rivers Trust
Area covered

Description
SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of coronary heart disease (in persons of all ages). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to coronary heart disease (in persons of all ages).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (all ages) with coronary heart disease was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with coronary heart disease was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with coronary heart disease, within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have coronary heart diseaseB) the NUMBER of people within that MSOA who are estimated to have coronary heart diseaseAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have coronary heart disease, compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from coronary heart disease, and where those people make up a large percentage of the population, indicating there is a real issue with coronary heart disease within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of coronary heart disease, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of coronary heart disease.TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.
m
Data from: URDD: An open dataset for urban roadway disease detection and...
data.mendeley.com
Updated Feb 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuaiqi Liu (2025). URDD: An open dataset for urban roadway disease detection and classification [Dataset]. http://doi.org/10.17632/r7pnxpr2bb.2
Explore at:
Unique identifier
https://doi.org/10.17632/r7pnxpr2bb.2
Dataset updated
Feb 20, 2025
Authors
Shuaiqi Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present two urban road disease datasets: DURDD for road disease detection and CURDD for road disease classification. DURDD includes four main types of underground road diseases: cavity, detachment, water-rich, and looseness. It also contains disease detection datasets in three base formats: COCO, Pascal VOC, and YOLO. In CURDD, the dataset is divided into two levels: level 0 and level 1, corresponding to the "Cls0" and "Cls1" catalogs, respectively. Level 1 includes cavity, detachment, water-rich, looseness, and background. Level 0 categories combine the four main disease types mentioned earlier into a single "diseases" category, with the other category being "background." This dataset was jointly published by Hebei University and the 519 Team of North China Geological Exploration Bureau. We support individuals or teams using the data for research purposes. We also welcome collaboration for commercial use. For commercial inquiries, please contact us for authorization.
Indicators of Heart Disease (2022 UPDATE)
kaggle.com
zip
Updated Oct 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamil Pytlak (2023). Indicators of Heart Disease (2022 UPDATE) [Dataset]. https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease/discussion
Explore at:
zip(22474335 bytes)Available download formats
Dataset updated
Oct 12, 2023
Authors
Kamil Pytlak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Key Indicators of Heart Disease

2022 annual CDC survey data of 400k+ adults related to their health status

What subject does the dataset cover?

According to the CDC, heart disease is a leading cause of death for people of most races in the U.S. (African Americans, American Indians and Alaska Natives, and whites). About half of all Americans (47%) have at least 1 of 3 major risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicators include diabetes status, obesity (high BMI), not getting enough physical activity, or drinking too much alcohol. Identifying and preventing the factors that have the greatest impact on heart disease is very important in healthcare. In turn, developments in computing allow the application of machine learning methods to detect "patterns" in the data that can predict a patient's condition.

Where did the data set come from and what treatments has it undergone?

The dataset originally comes from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to collect data on the health status of U.S. residents. As described by the CDC: "Established in 1984 with 15 states, BRFSS now collects data in all 50 states, the District of Columbia, and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world. The most recent dataset includes data from 2023. In this dataset, I noticed many factors (questions) that directly or indirectly influence heart disease, so I decided to select the most relevant variables from it. I also decided to share with you two versions of the most recent dataset: with NaNs and without it.

What can you do with this data set?

As described above, the original dataset of nearly 300 variables was reduced to 40variables. In addition to classical EDA, this dataset can be used to apply a number of machine learning methods, especially classifier models (logistic regression, SVM, random forest, etc.). You should treat the variable "HadHeartAttack" as binary ("Yes" - respondent had heart disease; "No" - respondent did not have heart disease). Note, however, that the classes are unbalanced, so the classic approach of applying a model is not advisable. Fixing the weights/undersampling should yield much better results. Based on the data set, I built a logistic regression model and embedded it in an application that might inspire you: https://share.streamlit.io/kamilpytlak/heart-condition-checker/main/app.py. Can you indicate which variables have a significant effect on the likelihood of heart disease?

What steps did you use to convert the dataset?

Check out this notebook in my GitHub repository: https://github.com/kamilpytlak/data-science-projects/blob/main/heart-disease-prediction/2022/notebooks/data_processing.ipynb
Creutzfeldt-Jakob Disease Lookback Dataset (CJDLD)
catalog.data.gov
datahub.va.gov
+2more
Updated Aug 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2025). Creutzfeldt-Jakob Disease Lookback Dataset (CJDLD) [Dataset]. https://catalog.data.gov/dataset/creutzfeldt-jakob-disease-lookback-dataset-cjdld
Explore at:
Dataset updated
Aug 2, 2025
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
The tracking system is for patients identified in the Creutzfeldt Jakob Disease (CJD) lookback notification initiative established in January 1995 as part of the lookback notification of all Department of Veterans Affairs (VA) patients who may have received certain lot numbers of blood derivatives or blood components produced from donors with CJD. Even though the Centers of Disease Control and Prevention characterized the risk of transmission of CJD from blood derivative products as 'small and immeasurable' and 'theoretical', VA believed it had an ethical obligation to inform patients of the exposure to potentially contaminated blood components or plasma derivative products while under VA's care. The patients were notified. The Veterans Health Administration (VHA) established a tracking system for individuals who received these products to determine if there was an increase in VA CJD cases. Every two years, the VHA National Infectious Diseases Service updates the status of patients who had previously been identified through the VA CJD lookback notification initiative. The Creutzfeldt-Jakob Disease Lookback Dataset (CJDLD) is a prospective collection of data; requests for individual reports are not accepted.
PERU MIGRANT Study | Baseline and 5yr follow-up dataset
figshare.com
datasetcatalog.nlm.nih.gov
bin
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. Jaime Miranda; Antonio Bernabe-Ortiz; Rodrigo Carrillo Larco (2023). PERU MIGRANT Study | Baseline and 5yr follow-up dataset [Dataset]. http://doi.org/10.6084/m9.figshare.4832612.v4
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4832612.v4
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
J. Jaime Miranda; Antonio Bernabe-Ortiz; Rodrigo Carrillo Larco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Peru
Description
This is an update of a prior dataset publication containing baseline and 5-year follow-up data from the PERU MIGRANT Study (PEru's Rural to Urban MIGRANTs Study).The PERU MIGRANT Study was designed to investigate the magnitude of differences between rural-to-urban migrant and non-migrant groups in specific cardiovascular risk factors. Three groups were selected: i) Rural, people who have always have lived in a rural environment; ii) Rural-urban, people who migrated from rural to urban areas; and, iii) Urban, people who have always lived in a urban environment.PERU MIGRANT Study protocol, instruments and variables are described in full in:Miranda JJ, Gilman RH, García HH, Smeeth L. The effect on cardiovascular risk factors of migration from rural to urban areas in Peru: PERU MIGRANT Study. BMC Cardiovasc Disord 2009;9:23. PERU MIGRANT Study baseline dataset is available at:https://figshare.com/articles/PERU_MIGRANT_Study_Baseline_dataset/3125005Main findings of the baseline study:Miranda JJ, Gilman RH, Smeeth L. Differences in cardiovascular risk factors in rural, urban and rural-to-urban migrants in Peru. Heart 2011;97(10):787-96. Main findings of the 5-yr follow-up study: Carrillo-Larco RM, Bernabé-Ortiz A, Pillay TD, Gilman RH, Sanchez JF, Poterico JA, Quispe R, Smeeth L, Miranda JJ. Obesity risk in rural, urban and rural-to-urban migrants: prospective results of the PERU MIGRANT study. Int J Obes (Lond) 2016;40(1):181-5. Bernabe-Ortiz A, Sanchez JF, Carrillo-Larco RM, Gilman RH, Poterico JA, Quispe R, Smeeth L, Miranda JJ. Rural-to-urban migration and risk of hypertension: longitudinal results of the PERU MIGRANT study. J Hum Hypertens 2017;31(1):22-28. Lazo-Porras M, Bernabe-Ortiz A, Málaga G, Gilman RH, Acuña-Villaorduña A, Cardenas-Montero D, Smeeth L, Miranda JJ. Low HDL cholesterol as a cardiovascular risk factor in rural, urban, and rural-urban migrants: PERU MIGRANT cohort study. Atherosclerosis 2016;246:36-43.Burroughs Pena MS, Bernabé-Ortiz A, Carrillo-Larco RM, Sánchez JF, Quispe R, Pillay TD, Málaga G, Gilman RH, Smeeth L, Miranda JJ. Migration, urbanisation and mortality: 5-year longitudinal analysis of the PERU MIGRANT study. J Epidemiol Community Health 2015;69(7):715-8.
m
Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...
data.mendeley.com
Updated Jul 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak [Dataset]. http://doi.org/10.17632/xmcg82mx9k.3
Explore at:
Unique identifier
https://doi.org/10.17632/xmcg82mx9k.3
Dataset updated
Jul 25, 2022
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2

Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)

The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.
f
Zoonotic Disease
rochester.figshare.com
txt
Updated Sep 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aabha Pandit; Alois Romanowski; Heather Owen (2025). Zoonotic Disease [Dataset]. http://doi.org/10.60593/ur.d.26462428.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.60593/ur.d.26462428.v1
Dataset updated
Sep 17, 2025
Dataset provided by
University of Rochester
Authors
Aabha Pandit; Alois Romanowski; Heather Owen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Zoonotic Disease DatasetZoonotic diseases are infections that spread between people and animals. This dataset contains information to investigate the correlation between climate variables (temperature, precipitation) and zoonotic disease in different countries across different years. The data is clean and does not have missing values.Dataset Variables:Country: region from where data was collectedYear: year when data was collectedTemperature: collected in degrees Celsius Precipitation: collected in millimeters (mm)Zoonotic Cases: number of zoonotic infections Population Density: number of people per kilometer square of countryUrbanization Rate: percentage of country's population living in urban areas
d
Mortality Rates
catalog.data.gov
datasets.ai
+4more
Updated Nov 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lake County Illinois GIS (2024). Mortality Rates [Dataset]. https://catalog.data.gov/dataset/mortality-rates-6fb72
Explore at:
Dataset updated
Nov 22, 2024
Dataset provided by
Lake County Illinois GIS
Description
Mortality Rates for Lake County, Illinois. Explanation of field attributes: Average Age of Death – The average age at which a people in the given zip code die. Cancer Deaths – Cancer deaths refers to individuals who have died of cancer as the underlying cause. This is a rate per 100,000. Heart Disease Related Deaths – Heart Disease Related Deaths refers to individuals who have died of heart disease as the underlying cause. This is a rate per 100,000. COPD Related Deaths – COPD Related Deaths refers to individuals who have died of chronic obstructive pulmonary disease (COPD) as the underlying cause. This is a rate per 100,000.
l
Data from: Coronary Heart Disease Mortality
data.lacounty.gov
arc-gis-hub-home-arcgishub.hub.arcgis.com
+1more
Updated Dec 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
County of Los Angeles (2023). Coronary Heart Disease Mortality [Dataset]. https://data.lacounty.gov/datasets/coronary-heart-disease-mortality/about
Explore at:
Dataset updated
Dec 19, 2023
Dataset authored and provided by
County of Los Angeles
Area covered

Description
Death rate has been age-adjusted to the 2000 U.S. standard population. Single-year data are only available for Los Angeles County overall, Service Planning Areas, Supervisorial Districts, City of Los Angeles overall, and City of Los Angeles Council Districts.Coronary heart disease is a type of heart disease in which the arteries of the heart cannot deliver enough oxygen-rich blood to the heart muscles. Over time, this can weaken the heart muscle and may lead to heart attack or heart failure. It is the most common type of heart disease in the US and has been the leading cause of death in Los Angeles County for the last two decades. Poor diet, sedentary lifestyle, tobacco exposure, and chronic stress are all important risk factors for coronary heart disease. Cities and communities can mitigate these risks by improving local food environments and encouraging physical activity by making communities safer and more walkable.For more information about the Community Health Profiles Data Initiative, please see the initiative homepage.
m
Sexually Transmitted Diseases in Females for Data Security and Privacy in 3D...
data.mendeley.com
Updated Nov 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ankita gupta (2023). Sexually Transmitted Diseases in Females for Data Security and Privacy in 3D Modeling for Healthcare [Dataset]. http://doi.org/10.17632/ty68672dnz.1
Explore at:
Unique identifier
https://doi.org/10.17632/ty68672dnz.1
Dataset updated
Nov 10, 2023
Authors
ankita gupta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview: This dataset comprises information collected from 450 females and is utilized for research in the field of Data Security and Privacy in 3D Modeling for Healthcare. The dataset is focused on the prevalence and analysis of sexually transmitted diseases (STDs) in females. It provides valuable insights into the demographics, behaviors, and health status of the study participants.

Data Fields: The dataset includes the following key fields for each of the 450 females:

S.no: A unique identifier assigned to each participant. Age: The age of the female participant, ranging from young adults to middle-aged individuals. Intimate Partners: The number of intimate partners the female has had, indicating their level of sexual activity. Protection Usage: A binary variable (0: Never, 1: Sometimes, 2: Always) representing the usage of protection during sexual activity. Symptoms: A binary variable (0: No symptoms, 1: Symptoms) indicating the presence or absence of symptoms related to STDs. Location: The location of the participant, categorized into general city/district areas, which can provide geographical context. Education: A binary variable (0: Low education, 1: High education) representing the education level of the participant. STD Testing History: A binary variable (0: No, 1: Yes) indicating whether the participant has a history of undergoing anonymous STD testing. STD Status: A binary variable (0: Uninfected, 1: Infected) reflecting the STD status of the female participants. Usage: This dataset serves as a valuable resource for researchers in the fields of healthcare, data security, and 3D modeling. Researchers can leverage this dataset to explore the relationship between demographic factors, behaviors, and STD prevalence among females. It is particularly relevant for studies that aim to enhance data security and privacy while utilizing 3D modeling techniques for healthcare applications.

Data Privacy and Ethics: The collection of this dataset adheres to ethical and privacy considerations, with a focus on ensuring the anonymity and confidentiality of the study participants. Personal identifiers have been removed to protect the privacy of the individuals.

Citation: If you intend to use this dataset in your research, please consider citing the source and acknowledging the data collection process. Proper citation helps maintain transparency and credit the researchers and institutions involved in data collection.
p
Heart Failure Prediction - Dataset - CKAN
data.poltekkes-smg.ac.id
Updated Oct 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Heart Failure Prediction - Dataset - CKAN [Dataset]. https://data.poltekkes-smg.ac.id/dataset/heart-failure-prediction
Explore at:
Dataset updated
Oct 8, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure. Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies. People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.
COVID-19 Case Surveillance Public Use Data
data.cdc.gov
data.virginia.gov
+7more
csv, xlsx, xml
Updated Jul 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/widgets/vbim-akqf
Explore at:
xml, xlsx, csvAvailable download formats
Dataset updated
Jul 9, 2024
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Authors
CDC Data, Analytics and Visualization Task Force
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

CDC has three COVID-19 case surveillance datasets:
COVID-19 Case Surveillance Public Use Data with Geography: Public use, patient-level dataset with clinical data (including symptoms), demographics, and county and state of residence. (19 data elements)
COVID-19 Case Surveillance Public Use Data: Public use, patient-level dataset with clinical and symptom data and demographics, with no geographic data. (12 data elements)
COVID-19 Case Surveillance Restricted Access Detailed Data: Restricted access, patient-level dataset with clinical and symptom data, demographics, and state and county of residence. Access requires a registration process and a data use agreement. (33 data elements)
The following apply to all three datasets:
Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers.
Some data cells are suppressed to protect individual privacy.
The datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the current datasets. This 14-day lag allows case reporting to be stabilized and ensures that time-dependent outcome data are accurately captured.
Datasets are updated monthly.
Datasets are created using CDC’s Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy.
For more information about data collection and reporting, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/about-us-cases-deaths.html.
For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html

Overview

The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

For more information: NNDSS Supports the COVID-19 Response | CDC.

The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

COVID-19 Case Reports

COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

Data are Considered Provisional

The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

Data Limitations

To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

Data Quality Assurance Procedures

CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

Data Suppression

To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

For questions, please contact Ask SRRG (eocevent394@cdc.gov).

Additional COVID-19 Data

COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These
Variables for Alzheimer's analysis (without PII data)
catalog.data.gov
datasets.ai
+1more
Updated Dec 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Variables for Alzheimer's analysis (without PII data) [Dataset]. https://catalog.data.gov/dataset/variables-for-alzheimers-analysis-without-pii-data
Explore at:
Dataset updated
Dec 13, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Organized by zipcode: Rates of Alzheimer's disease Percent of landcover types Modelled PM2.5 Socioeconomic variables. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Lucas Neas (CPHEA/PHESD/EB) is the owner of the copy of this dataset that was used. Format: Medicare database. This dataset is associated with the following publication: Wu, J., and L. Jackson. Greenspace inversely associated with the risk of Alzheimer’s disease in the mid-Atlantic United States. Earth. MDPI AG, Basel, SWITZERLAND, 2(1): 140-150, (2021).
U.S. Chronic Disease Indicators
kaggle.com
zip
Updated Jun 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahir Maharaj (2024). U.S. Chronic Disease Indicators [Dataset]. https://www.kaggle.com/datasets/sahirmaharajj/u-s-chronic-disease-indicators
Explore at:
zip(8271208 bytes)Available download formats
Dataset updated
Jun 23, 2024
Authors
Sahir Maharaj
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
United States
Description
CDC's Division of Population Health provides a cross-cutting set of 115 indicators developed by consensus among CDC, the Council of State and Territorial Epidemiologists, and the National Association of Chronic Disease Directors. These indicators allow states and territories to uniformly define, collect, and report chronic disease data that are important to public health practice in their area.

This dataset is extremely useful for public health data science as it enables the study of prevalence and distribution of chronic diseases across different demographics and geographical areas. Analysts can assess health outcomes, identify risk factors, and measure the impact of public health interventions.

Some analysis that can be performed include:

Epidemiological Studies: Understand the distribution and determinants of health-related states or events.

Geospatial Analysis: Map disease prevalence to geographic locations to identify hotspots or areas in need of targeted interventions.

Longitudinal Analysis: Study trends over time to see how health indicators change in response to public health policies or other external factors.

Facebook

Twitter

Click to copy link

Link copied

Cite

Centers for Disease Control and Prevention (2017). Chronic Disease Indicators [Dataset]. https://www.kaggle.com/datasets/cdc/chronic-disease

Chronic Disease Indicators

Disease Data Across the US, 2001-2016

Explore at:

zip(8401214 bytes)Available download formats

Dataset updated

Aug 17, 2017

Dataset authored and provided by

Centers for Disease Control and Preventionhttp://www.cdc.gov/

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context:

CDC's Division of Population Health provides cross-cutting set of 124 indicators that were developed by consensus and that allows states and territories and large metropolitan areas to uniformly define, collect, and report chronic disease data that are important to public health practice and available for states, territories and large metropolitan areas. In addition to providing access to state-specific indicator data, the CDI web site serves as a gateway to additional information and data resources.

Content:

A variety of health-related questions were assessed at various times and places across the US over the past 15 years. Data is provided with confidence intervals and demographic stratification.

Acknowledgements:

Data was compiled by the CDC.

Inspiration:

Any interesting trends in certain groups?
Any correlation between disease indicators and locality hospital spending?

Clear search

Close search

Google apps

Main menu

Chronic Disease Indicators

Context:

Content:

Acknowledgements:

Inspiration:

U.S. Healthcare Data

Context

Content

Infectious Diseases by Disease, County, Year, and Sex

Disease and symptoms dataset 2023

PFAS and multimorbidity among a random sample of patients from the...

Cardiovascular_Disease_Dataset

Coronary heart disease (in persons of all ages): England

Data from: URDD: An open dataset for urban roadway disease detection and...

Indicators of Heart Disease (2022 UPDATE)

Key Indicators of Heart Disease

2022 annual CDC survey data of 400k+ adults related to their health status

What subject does the dataset cover?

Where did the data set come from and what treatments has it undergone?

What can you do with this data set?

What steps did you use to convert the dataset?

Creutzfeldt-Jakob Disease Lookback Dataset (CJDLD)

PERU MIGRANT Study | Baseline and 5yr follow-up dataset

Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...

Zoonotic Disease

Mortality Rates

Data from: Coronary Heart Disease Mortality

Sexually Transmitted Diseases in Females for Data Security and Privacy in 3D...

Heart Failure Prediction - Dataset - CKAN

COVID-19 Case Surveillance Public Use Data

CDC has three COVID-19 case surveillance datasets:

Overview

COVID-19 Case Reports

Data are Considered Provisional

Data Limitations

Data Quality Assurance Procedures

Data Suppression

Additional COVID-19 Data

Variables for Alzheimer's analysis (without PII data)

U.S. Chronic Disease Indicators

Chronic Disease IndicatorsSee More Versions

Disease Data Across the US, 2001-2016

Context:

Content:

Acknowledgements:

Inspiration:

Chronic Disease Indicators