Facebook
TwitterBy Saloni Dattani, Lucas Rodés-Guirao, Edouard Mathieu, Hannah Ritchie and Max Roser.
Data description:
Disease outbreaks may be inevitable, but large-scale pandemics are not. The world can respond swiftly and effectively to pandemic risks in the future with better understanding, resources, and effort.
To avoid suffering through another large pandemic, we have to take the risk of pandemics seriously. Despite warnings that another one was likely, the COVID-19 pandemic killed more than 27 million people.1
We must build the capacity to test for pathogens and understand them: which pathogens put us at the greatest risk, how they spread, and how to tackle them.
We know it is possible to greatly reduce the risk of infectious disease. We’ve learned over history how to reduce their impact with vaccines, public health efforts, and medicine.
In addition to the old risks, we face new threats from factory farming, genetic modification, climate change, and antimicrobial resistance. With more attention and effort, we can reduce their risks too.
Good luck in your analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundFoodborne diseases are important worldwide, resulting in considerable morbidity and mortality. To our knowledge, we present the first global and regional estimates of the disease burden of the most important foodborne bacterial, protozoal, and viral diseases.Methods and FindingsWe synthesized data on the number of foodborne illnesses, sequelae, deaths, and Disability Adjusted Life Years (DALYs), for all diseases with sufficient data to support global and regional estimates, by age and region. The data sources included varied by pathogen and included systematic reviews, cohort studies, surveillance studies and other burden of disease assessments. We sought relevant data circa 2010, and included sources from 1990–2012. The number of studies per pathogen ranged from as few as 5 studies for bacterial intoxications through to 494 studies for diarrheal pathogens. To estimate mortality for Mycobacterium bovis infections and morbidity and mortality for invasive non-typhoidal Salmonella enterica infections, we excluded cases attributed to HIV infection. We excluded stillbirths in our estimates. We estimate that the 22 diseases included in our study resulted in two billion (95% uncertainty interval [UI] 1.5–2.9 billion) cases, over one million (95% UI 0.89–1.4 million) deaths, and 78.7 million (95% UI 65.0–97.7 million) DALYs in 2010. To estimate the burden due to contaminated food, we then applied proportions of infections that were estimated to be foodborne from a global expert elicitation. Waterborne transmission of disease was not included. We estimate that 29% (95% UI 23–36%) of cases caused by diseases in our study, or 582 million (95% UI 401–922 million), were transmitted by contaminated food, resulting in 25.2 million (95% UI 17.5–37.0 million) DALYs. Norovirus was the leading cause of foodborne illness causing 125 million (95% UI 70–251 million) cases, while Campylobacter spp. caused 96 million (95% UI 52–177 million) foodborne illnesses. Of all foodborne diseases, diarrheal and invasive infections due to non-typhoidal S. enterica infections resulted in the highest burden, causing 4.07 million (95% UI 2.49–6.27 million) DALYs. Regionally, DALYs per 100,000 population were highest in the African region followed by the South East Asian region. Considerable burden of foodborne disease is borne by children less than five years of age. Major limitations of our study include data gaps, particularly in middle- and high-mortality countries, and uncertainty around the proportion of diseases that were foodborne.ConclusionsFoodborne diseases result in a large disease burden, particularly in children. Although it is known that diarrheal diseases are a major burden in children, we have demonstrated for the first time the importance of contaminated food as a cause. There is a need to focus food safety interventions on preventing foodborne diseases, particularly in low- and middle-income settings.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Users can find data on a range of global health topics like mortality, the burden of disease, infectious diseases, risk factors and health expenditures. Background The Global Health Observatory (GHO) database is the World Health Organization's main health statistics repository. Data is available for 193 World Health Organization member states on topics including but not limited to: Health related millennium goals, mortality, immunization, nutrition, infectious disease, non- communicable disease, tobacco control, violence, injuries, alcohol, HIV/AIDS, tuberculosis, malaria, water and sanitation, maternal and reproductive health, cho lera, child health, child nutrition, and road safety. User FunctionalityUsers can generate tables and charts according to country or region, health indicator, and time period. Data can also be compared across countries. Data can be filtered, tabulated, charted, and downloaded into Excel statistical software. These data are also published in statistical reports covering topics including: Alcohol and health, Child health, Cholera, HIV/AIDS, Malaria, Maternal and reproductive heal th, Non-communicable diseases, Public health and environment, Road safety, Tuberculosis, Tobacco control. Data Notes Data are derived from surveillance and household surveys. Years in which data were collected is indicated with these health statistics. Information is available for each WHO member country and international region. The most recent data is available from 2009.
Facebook
TwitterBy Health [source]
This dataset provides comprehensive information on the number and rate of infectious diseases in California. Focusing on counties, sexes, and various diseases between 2001-2014, it offers powerful insights into the health status of its citizens. Its data also reveals trends in the spread of common illnesses in this state. Whether you are an epidemiologist looking to inform public health policy or a researcher seeking to investigate particular illnesses within certain populations, this dataset contains all the necessary information to answer your questions. Explore it today and discover hidden stories waiting to be uncovered!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains counts and rates of infectious diseases in California by county, disease, sex, and year. This dataset can be used to generate trends to understand the changes in incidence of different types of diseases over time and across counties or between sexes.
To use this dataset: - Select the columns you are interested in exploring - these could include Disease, County, Sex or Year. - Filter out the rows that do not relate to your question - for example filtering by a specific county or disease. - Examine the average rate per 100000 people for each group you selected as well as its lower and upper confidence intervals (CI). - Use Rate as your dependent variable for analysis; Population is likely also important determining factors. Make sure to check if any Rates have 'unstable' flags.
- Visualise or statistically analyse your data using suitable methods such as descriptive statistics (means/medians/mode etc.)for comparison between 2+ groups or correlation/regression based models when comparing one variable to another over time etc.
- Analyzing the geographic spread of infectious diseases over time to identify areas in need of increased education, resources, and care.
- Comparing rates of disease by sex to identify and understand any gender-based differences in infectious disease cases.
- Using the Unstable column to determine whether a particular county or region needs further study of a certain type of infectious disease due to unusual spikes or drops in rate or count during a specific year
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: Infectious_Disease_Cases_by_County_Year_and_Sex_2001-2014.csv | Column name | Description | |:---------------|:---------------------------------------------------------------------------------------------------------------| | Disease | The type of infectious disease reported. (String) | | County | The county in California where the cases were reported. (String) | | Year | The year in which the cases were reported. (Integer) | | Sex | The gender of the individuals who contracted the disease. (String) | | Population | The population size of the county in which the cases were reported. (Integer) | | Rate | The rate of infection per 100 thousand people living in the county. (Float) | | CI.lower | The lower confidence interval associated with the rate of infection. (Float) | | CI.upper | The upper confidence interval associated with the rate of infection. (Float) ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This formatted dataset (AnalysisDatabaseGBD) originates from raw data files from the Institute of Health Metrics and Evaluation (IHME) Global Burden of Disease Study (GBD2017) affiliated with the University of Washington. We are volunteer collaborators with IHME and not employed by IHME or the University of Washington.
The population weighted GBD2017 data are on male and female cohorts ages 15-69 years including noncommunicable diseases (NCDs), body mass index (BMI), cardiovascular disease (CVD), and other health outcomes and associated dietary, metabolic, and other risk factors. The purpose of creating this population-weighted, formatted database is to explore the univariate and multiple regression correlations of health outcomes with risk factors. Our research hypothesis is that we can successfully model NCDs, BMI, CVD, and other health outcomes with their attributable risks.
These Global Burden of disease data relate to the preprint: The EAT-Lancet Commission Planetary Health Diet compared with Institute of Health Metrics and Evaluation Global Burden of Disease Ecological Data Analysis.
The data include the following:
1. Analysis database of population weighted GBD2017 data that includes over 40 health risk factors, noncommunicable disease deaths/100k/year of male and female cohorts ages 15-69 years from 195 countries (the primary outcome variable that includes over 100 types of noncommunicable diseases) and over 20 individual noncommunicable diseases (e.g., ischemic heart disease, colon cancer, etc).
2. A text file to import the analysis database into SAS
3. The SAS code to format the analysis database to be used for analytics
4. SAS code for deriving Tables 1, 2, 3 and Supplementary Tables 5 and 6
5. SAS code for deriving the multiple regression formula in Table 4.
6. SAS code for deriving the multiple regression formula in Table 5
7. SAS code for deriving the multiple regression formula in Supplementary Table 7
8. SAS code for deriving the multiple regression formula in Supplementary Table 8
9. The Excel files that accompanied the above SAS code to produce the tables
For questions, please email davidkcundiff@gmail.com. Thanks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains disease names along with the symptoms faced by the respective patient. There are a total of 773 unique diseases and 377 symptoms, with ~246,000 rows. The dataset was artificially generated, preserving Symptom Severity and Disease Occurrence Possibility. Several distinct groups of symptoms might all be indicators of the same disease. There may even be one single symptom contributing to a disease in a row or sample. This is an indicator of a very high correlation between the symptom and that particular disease. A larger number of rows for a particular disease corresponds to its higher probability of occurrence in the real world. Similarly, in a row, if the feature vector has the occurrence of a single symptom, it implies that this symptom has more correlation to classify the disease than any one symptom of a feature vector with multiple symptoms in another sample.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Users can view statistics and generate cross-country comparisons pertaining to infectious diseases and health indicators in 193 WHO member states. Background The Global Health Atlas is a database maintained by the World Health Organization (WHO) that provides information regarding infectious diseases in WHO member states. Health conditions include: malaria, HIV/AIDS, cholera, STIs, meningitis, and polio, among others. User Functionality Users can generate statistics regarding infectious diseases and health systems indicators by country or region, or generate cross-country comparisons. In addition, users can v iew maps showing the distribution of various health indicators and diseases by geographic region or individual country. Data Notes Statistics are available for all WHO member states. Data are available from 1949 to 2009.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Health care in the United States is provided by many distinct organizations. Health care facilities are largely owned and operated by private sector businesses. 58% of US community hospitals are non-profit, 21% are government owned, and 21% are for-profit. According to the World Health Organization (WHO), the United States spent more on healthcare per capita ($9,403), and more on health care as percentage of its GDP (17.1%), than any other nation in 2014. Many different datasets are needed to portray different aspects of healthcare in US like disease prevalences, pharmaceuticals and drugs, Nutritional data of different food products available in US. Such data is collected by surveys (or otherwise) conducted by Centre of Disease Control and Prevention (CDC), Foods and Drugs Administration, Center of Medicare and Medicaid Services and Agency for Healthcare Research and Quality (AHRQ). These datasets can be used to properly review demographics and diseases, determining start ratings of healthcare providers, different drugs and their compositions as well as package informations for different diseases and for food quality. We often want such information and finding and scraping such data can be a huge hurdle. So, Here an attempt is made to make available all US healthcare data at one place to download from in csv files.
Facebook
TwitterProject Tycho datasets contain case counts for reported disease conditions for countries around the world. The Project Tycho data curation team extracts these case counts from various reputable sources, typically from national or international health authorities, such as the US Centers for Disease Control or the World Health Organization. These original data sources include both open- and restricted-access sources. For restricted-access sources, the Project Tycho team has obtained permission for redistribution from data contributors. All datasets contain case count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team. The Project Tycho team has pre-processed datasets by adding new variables, such as standard disease and location identifiers, that improve data interpretability. We also formatted the data into a standard data format.
Each Project Tycho dataset contains case counts for a specific condition (e.g. measles) and for a specific country (e.g. The United States). Case counts are reported per time interval. In addition to case counts, datasets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of acquisition, and the source from which we extracted case counts. One dataset can include many series of case count time intervals, such as "US measles cases as reported by CDC", or "US measles cases reported by WHO", or "US measles cases that originated abroad", etc.
Depending on the intended use of a dataset, we recommend a few data processing steps before analysis: - Analyze missing data: Project Tycho datasets do not include time intervals for which no case count was reported (for many datasets, time series of case counts are incomplete, due to incompleteness of source documents) and users will need to add time intervals for which no count value is available. Project Tycho datasets do include time intervals for which a case count value of zero was reported. - Separate cumulative from non-cumulative time interval series. Case count time series in Project Tycho datasets can be "cumulative" or "fixed-intervals". Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc. It is common practice among public health agencies to report cases for cumulative time intervals. Case count series with fixed time intervals consist of mutually exclusive time intervals that all start and end on different dates and all have identical length (day, week, month, year). Given the different nature of these two types of case count data, we indicated this with an attribute for each count value, named "PartOfCumulativeCountSeries".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data on causes of death (COD) provide information on mortality patterns and form a major element of public health information.
The COD data refer to the underlying cause which - according to the World Health Organisation (WHO) - is "the disease or injury which initiated the train of morbid events leading directly to death, or the circumstances of the accident or violence which produced the fatal injury".
The data are derived from the medical certificate of death, which is obligatory in the Member States. The information recorded in the death certificate is according to the rules specified by the WHO.
Data published in Eurostat's dissemination database are broken down by sex, 5-year age groups, cause of death and by residency and country of occurrence. For stillbirths and neonatal deaths additional breakdowns might include age of mother and parity.
Data are available for Member States, Iceland, Norway, Liechtenstein, Switzerland, United Kingdom, Serbia, Turkey, North Macedonia and Albania. Regional data (NUTS level 2) are available for all of the countries having NUTS2 regions except Albania.
Annual national data are available in Eurostat's dissemination database in absolute number, crude death rates and standardised death rates. At regional level the same is provided in form of 3-years averages (the average of year, year -1 and year -2). Annual crude and standardised death rates are also available at NUTS2 level. Monthly national data are available for 21 EU Member States from reference year 2019 and in 24 Member States from reference year 2022 in absolute numbers and standardised death rates.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this day and age, people face a lot of stress due to the fast pace of life. Due to this, people in today's digital age, suffer from a plethora of ailments. It is universally accepted that a greater awareness of ailments and their corresponding symptoms leads to an increased lifespan and better quality of life. Early detection and screening can help doctors nip diseases in their natal stages. However, not everyone is aware of them, which makes it a global issue. The study of the degree of disease awareness amongst people belonging to different nations and continents is a matter of great interest. One method that is suitable for this purpose is using clinical data. But, this data is not readily available. However, today a plethora of platforms are available to people to share their thoughts and experiences. People post about many of the important events in their lives on social media. Their posts offer a microscopic view into their lives and thought processes. Based on this intuition, twitter data pertaining to various chronic and acute diseases has been collected. Tweets for 30 deadly ailments have been collected over a period of 3 months amounting to a total of 19 million. A feature extraction approach is proposed which is used to identify the disease awareness levels across different nations. Deriving the global awareness landscape for ailments can help to identify regions which are well aware and also those that need to get aware. Clustering has been used for this purpose.
Facebook
TwitterProject Tycho datasets contain case counts for reported disease conditions for countries around the world. The Project Tycho data curation team extracts these case counts from various reputable sources, typically from national or international health authorities, such as the US Centers for Disease Control or the World Health Organization. These original data sources include both open- and restricted-access sources. For restricted-access sources, the Project Tycho team has obtained permission for redistribution from data contributors. All datasets contain case count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team. The Project Tycho team has pre-processed datasets by adding new variables, such as standard disease and location identifiers, that improve data interpretability. We also formatted the data into a standard data format.
Each Project Tycho dataset contains case counts for a specific condition (e.g. measles) and for a specific country (e.g. The United States). Case counts are reported per time interval. In addition to case counts, datasets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of acquisition, and the source from which we extracted case counts. One dataset can include many series of case count time intervals, such as "US measles cases as reported by CDC", or "US measles cases reported by WHO", or "US measles cases that originated abroad", etc.
Depending on the intended use of a dataset, we recommend a few data processing steps before analysis: - Analyze missing data: Project Tycho datasets do not include time intervals for which no case count was reported (for many datasets, time series of case counts are incomplete, due to incompleteness of source documents) and users will need to add time intervals for which no count value is available. Project Tycho datasets do include time intervals for which a case count value of zero was reported. - Separate cumulative from non-cumulative time interval series. Case count time series in Project Tycho datasets can be "cumulative" or "fixed-intervals". Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc. It is common practice among public health agencies to report cases for cumulative time intervals. Case count series with fixed time intervals consist of mutually exclusive time intervals that all start and end on different dates and all have identical length (day, week, month, year). Given the different nature of these two types of case count data, we indicated this with an attribute for each count value, named "PartOfCumulativeCountSeries".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2
Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. As the CDC describes: "Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.". The most recent dataset (as of February 15, 2022) includes data from 2020. It consists of 401,958 rows and 279 columns. The vast majority of columns are questions asked to respondents about their health status, such as "Do you have serious difficulty walking or climbing stairs?" or "Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]".
To improve the efficiency and relevance of our analysis, we removed certain attributes from the original BRFSS dataset. Many of the 279 original attributes included administrative codes, metadata, or survey-specific variables that do not contribute meaningfully to heart disease prediction—such as respondent IDs, timestamps, state-level identifiers, and detailed lifestyle questions unrelated to cardiovascular health. By focusing on a carefully selected subset of 18 attributes directly linked to medical, behavioral, and demographic factors known to influence heart health, we streamlined the dataset. This not only reduced computational complexity but also improved model interpretability and performance by eliminating noise and irrelevant information. All predicting variables could be divided into 4 broad categories:
Demographic factors: sex, age category (14 levels), race, BMI (Body Mass Index)
Diseases: weather respondent ever had such diseases as asthma, skin cancer, diabetes, stroke or kidney disease (not including kidney stones, bladder infection or incontinence)
Unhealthy habits:
General Health:
Below is a description of the features collected for each patient:
| # | Feature | Coded Variable Name | Description |
|---|---|---|---|
| 1 | HeartDisease | CVDINFR4 | Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) |
| 2 | BMI | _BMI5CAT | Body Mass Index (BMI) |
| 3 | Smoking | _SMOKER3 | Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] |
| 4 | AlcoholDrinking | _RFDRHV7 | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week |
| 5 | Stroke | CVDSTRK3 | (Ever told) (you had) a stroke? |
| 6 | PhysicalHealth | PHYSHLTH | Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 |
| 7 | MentalHealth | MENTHLTH | Thinking about your mental health, for how many days during the past 30 days was your mental health not good? |
| 8 | DiffWalking | DIFFWALK | Do you have serious difficulty walking or climbing stairs? |
| 9 | Sex | SEXVAR | Are you male or female? |
| 10 | AgeCategory | _AGE_G, | Fourteen-level age category |
| 11 | Race | _IMPRACE | Imputed race/ethnicity value |
| 12 | Diabetic | DIABETE4 | (Ever told) (you had) diabetes? |
| 13 | PhysicalActivity | EXERANY2 | Adults who reported doing physical activity or exercise during the past 30 days other than their regular job |
| 14 | GenHealth | GENHLTH | Would you say that in general your health is... |
| 15 | SleepTime | SLEPTIM1 | On average, how many hours of sleep do you get in a 24-hour period? |
| 16 | Asthma | CHASTHMA | (Ever told) (you had) asthma? |
| 17 | KidneyDisease | CHCKDNY2 | Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? |
| 18 | SkinCancer | CHCSCNCR | (Ever told) (you had) skin cancer? |
Facebook
TwitterData for deaths by leading cause of death categories are now available in the death profiles dataset for each geographic granularity. The cause of death categories are based solely on the underlying cause of death as coded by the International Classification of Diseases. The underlying cause of death is defined by the World Health Organization (WHO) as "the disease or injury which initiated the train of events leading directly to death, or the circumstances of the accident or violence which produced the fatal injury." It is a single value assigned to each death based on the details as entered on the death certificate. When more than one cause is listed, the order in which they are listed can affect which cause is coded as the underlying cause. This means that similar events could be coded with different underlying causes of death depending on variations in how they were entered. Consequently, while underlying cause of death provides a convenient comparison between cause of death categories, it may not capture the full impact of each cause of death as it does not always take into account all conditions contributing to the death. Cause of death categories for years 1999 and later are based on tenth revision of International Classification of Diseases (ICD-10) codes. Comparable categories are provided for years 1979 through 1998 based on ninth revision (ICD-9) codes. For more information on the comparability of cause of death classification between ICD revisions see Comparability of Cause-of-death Between ICD Revisions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Facebook
TwitterAs a member of the World Organisation for Animal Health (OIE), and the reporting authority for the United States, the USGS National Wildlife Health Center (NWHC) is responsible for reporting wildlife disease outbreaks that involve diseases which are not OIE-listed (https://www.oie.int/wahis_2/public/wahidwild.php# ). These outbreaks are to be reported on a semesterly basis via OIE’s WAHIS-Wild reporting system. The data fields described within are based on those in WAHIS-Wild. Since OIE’s reporting mechanism is based primarily on domestic and agricultural animals, several of the variables are not applicable to wildlife (i.e. vaccination, slaughtered, etc.). In an effort to use a consistent data source that is broad in scope and captures information from around the country, from various natural resource management authorities, NWHC will use the Wildlife Health Information Sharing Partnership - Event Reporting System (WHISPers - https://whispers.usgs.gov/home ) as the sole source to generate and supply the requested information to OIE. Data supplied to OIE have been restricted to publicly available information on wildlife morbidity/mortality and surveillance events in WHISPers.
Facebook
TwitterProject Tycho datasets contain case counts for reported disease conditions for countries around the world. The Project Tycho data curation team extracts these case counts from various reputable sources, typically from national or international health authorities, such as the US Centers for Disease Control or the World Health Organization. These original data sources include both open- and restricted-access sources. For restricted-access sources, the Project Tycho team has obtained permission for redistribution from data contributors. All datasets contain case count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team. The Project Tycho team has pre-processed datasets by adding new variables, such as standard disease and location identifiers, that improve data interpretability. We also formatted the data into a standard data format.
Each Project Tycho dataset contains case counts for a specific condition (e.g. measles) and for a specific country (e.g. The United States). Case counts are reported per time interval. In addition to case counts, datasets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of acquisition, and the source from which we extracted case counts. One dataset can include many series of case count time intervals, such as "US measles cases as reported by CDC", or "US measles cases reported by WHO", or "US measles cases that originated abroad", etc.
Depending on the intended use of a dataset, we recommend a few data processing steps before analysis: - Analyze missing data: Project Tycho datasets do not include time intervals for which no case count was reported (for many datasets, time series of case counts are incomplete, due to incompleteness of source documents) and users will need to add time intervals for which no count value is available. Project Tycho datasets do include time intervals for which a case count value of zero was reported. - Separate cumulative from non-cumulative time interval series. Case count time series in Project Tycho datasets can be "cumulative" or "fixed-intervals". Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc. It is common practice among public health agencies to report cases for cumulative time intervals. Case count series with fixed time intervals consist of mutually exclusive time intervals that all start and end on different dates and all have identical length (day, week, month, year). Given the different nature of these two types of case count data, we indicated this with an attribute for each count value, named "PartOfCumulativeCountSeries".
Facebook
TwitterNNDSS - TABLE 1O. Hansen's disease to Hantavirus pulmonary syndrome - 2020. In this Table, provisional cases* of notifiable diseases are displayed for United States, U.S. territories, and Non-U.S. residents. Note: This table contains provisional cases of national notifiable diseases from the National Notifiable Diseases Surveillance System (NNDSS). NNDSS data from the 50 states, New York City, the District of Columbia and the U.S. territories are collated and published weekly on the NNDSS Data and Statistics web page (https://wwwn.cdc.gov/nndss/data-and-statistics.html). Cases reported by state health departments to CDC for weekly publication are provisional because of the time needed to complete case follow-up. Therefore, numbers presented in later weeks may reflect changes made to these counts as additional information becomes available. The national surveillance case definitions used to define a case are available on the NNDSS web site at https://wwwn.cdc.gov/nndss/. Information about the weekly provisional data and guides to interpreting data are available at: https://wwwn.cdc.gov/nndss/infectious-tables.html. Footnotes: U: Unavailable — The reporting jurisdiction was unable to send the data to CDC or CDC was unable to process the data. -: No reported cases — The reporting jurisdiction did not submit any cases to CDC. N: Not reportable — The disease or condition was not reportable by law, statute, or regulation in the reporting jurisdiction. NN: Not nationally notifiable — This condition was not designated as being nationally notifiable. NP: Nationally notifiable but not published. NC: Not calculated — There is insufficient data available to support the calculation of this statistic. Cum: Cumulative year-to-date counts. Max: Maximum — Maximum case count during the previous 52 weeks. * Case counts for reporting years 2019 and 2020 are provisional and subject to change. Cases are assigned to the reporting jurisdiction submitting the case to NNDSS, if the case's country of usual residence is the U.S., a U.S. territory, unknown, or null (i.e. country not reported); otherwise, the case is assigned to the 'Non-U.S. Residents' category. Country of usual residence is currently not reported by all jurisdictions or for all conditions. For further information on interpretation of these data, see https://wwwn.cdc.gov/nndss/document/Users_guide_WONDER_tables_cleared_final.pdf. †Previous 52 week maximum and cumulative YTD are determined from periods of time when the condition was reportable in the jurisdiction (i.e., may be less than 52 weeks of data or incomplete YTD data). § Includes data for old world hantavirus infections, such as Seoul virus infections. Prior to 2015, this condition was not nationally notifiable and data for this condition was not submitted to CDC's National Notifiable Diseases Surveillance System (NNDSS). ¶ Includes data for Andes virus infections.
Facebook
TwitterBy Saloni Dattani, Lucas Rodés-Guirao, Edouard Mathieu, Hannah Ritchie and Max Roser.
Data description:
Disease outbreaks may be inevitable, but large-scale pandemics are not. The world can respond swiftly and effectively to pandemic risks in the future with better understanding, resources, and effort.
To avoid suffering through another large pandemic, we have to take the risk of pandemics seriously. Despite warnings that another one was likely, the COVID-19 pandemic killed more than 27 million people.1
We must build the capacity to test for pathogens and understand them: which pathogens put us at the greatest risk, how they spread, and how to tackle them.
We know it is possible to greatly reduce the risk of infectious disease. We’ve learned over history how to reduce their impact with vaccines, public health efforts, and medicine.
In addition to the old risks, we face new threats from factory farming, genetic modification, climate change, and antimicrobial resistance. With more attention and effort, we can reduce their risks too.
Good luck in your analysis.