Facebook
TwitterBy Saloni Dattani, Lucas Rodés-Guirao, Edouard Mathieu, Hannah Ritchie and Max Roser.
Data description:
Disease outbreaks may be inevitable, but large-scale pandemics are not. The world can respond swiftly and effectively to pandemic risks in the future with better understanding, resources, and effort.
To avoid suffering through another large pandemic, we have to take the risk of pandemics seriously. Despite warnings that another one was likely, the COVID-19 pandemic killed more than 27 million people.1
We must build the capacity to test for pathogens and understand them: which pathogens put us at the greatest risk, how they spread, and how to tackle them.
We know it is possible to greatly reduce the risk of infectious disease. We’ve learned over history how to reduce their impact with vaccines, public health efforts, and medicine.
In addition to the old risks, we face new threats from factory farming, genetic modification, climate change, and antimicrobial resistance. With more attention and effort, we can reduce their risks too.
Good luck in your analysis.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Users can find data on a range of global health topics like mortality, the burden of disease, infectious diseases, risk factors and health expenditures. Background The Global Health Observatory (GHO) database is the World Health Organization's main health statistics repository. Data is available for 193 World Health Organization member states on topics including but not limited to: Health related millennium goals, mortality, immunization, nutrition, infectious disease, non- communicable disease, tobacco control, violence, injuries, alcohol, HIV/AIDS, tuberculosis, malaria, water and sanitation, maternal and reproductive health, cho lera, child health, child nutrition, and road safety. User FunctionalityUsers can generate tables and charts according to country or region, health indicator, and time period. Data can also be compared across countries. Data can be filtered, tabulated, charted, and downloaded into Excel statistical software. These data are also published in statistical reports covering topics including: Alcohol and health, Child health, Cholera, HIV/AIDS, Malaria, Maternal and reproductive heal th, Non-communicable diseases, Public health and environment, Road safety, Tuberculosis, Tobacco control. Data Notes Data are derived from surveillance and household surveys. Years in which data were collected is indicated with these health statistics. Information is available for each WHO member country and international region. The most recent data is available from 2009.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterBy Health [source]
This dataset provides comprehensive information on the number and rate of infectious diseases in California. Focusing on counties, sexes, and various diseases between 2001-2014, it offers powerful insights into the health status of its citizens. Its data also reveals trends in the spread of common illnesses in this state. Whether you are an epidemiologist looking to inform public health policy or a researcher seeking to investigate particular illnesses within certain populations, this dataset contains all the necessary information to answer your questions. Explore it today and discover hidden stories waiting to be uncovered!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains counts and rates of infectious diseases in California by county, disease, sex, and year. This dataset can be used to generate trends to understand the changes in incidence of different types of diseases over time and across counties or between sexes.
To use this dataset: - Select the columns you are interested in exploring - these could include Disease, County, Sex or Year. - Filter out the rows that do not relate to your question - for example filtering by a specific county or disease. - Examine the average rate per 100000 people for each group you selected as well as its lower and upper confidence intervals (CI). - Use Rate as your dependent variable for analysis; Population is likely also important determining factors. Make sure to check if any Rates have 'unstable' flags.
- Visualise or statistically analyse your data using suitable methods such as descriptive statistics (means/medians/mode etc.)for comparison between 2+ groups or correlation/regression based models when comparing one variable to another over time etc.
- Analyzing the geographic spread of infectious diseases over time to identify areas in need of increased education, resources, and care.
- Comparing rates of disease by sex to identify and understand any gender-based differences in infectious disease cases.
- Using the Unstable column to determine whether a particular county or region needs further study of a certain type of infectious disease due to unusual spikes or drops in rate or count during a specific year
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: Infectious_Disease_Cases_by_County_Year_and_Sex_2001-2014.csv | Column name | Description | |:---------------|:---------------------------------------------------------------------------------------------------------------| | Disease | The type of infectious disease reported. (String) | | County | The county in California where the cases were reported. (String) | | Year | The year in which the cases were reported. (Integer) | | Sex | The gender of the individuals who contracted the disease. (String) | | Population | The population size of the county in which the cases were reported. (Integer) | | Rate | The rate of infection per 100 thousand people living in the county. (Float) | | CI.lower | The lower confidence interval associated with the rate of infection. (Float) | | CI.upper | The upper confidence interval associated with the rate of infection. (Float) ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This formatted dataset (AnalysisDatabaseGBD) originates from raw data files from the Institute of Health Metrics and Evaluation (IHME) Global Burden of Disease Study (GBD2017) affiliated with the University of Washington. We are volunteer collaborators with IHME and not employed by IHME or the University of Washington.
The population weighted GBD2017 data are on male and female cohorts ages 15-69 years including noncommunicable diseases (NCDs), body mass index (BMI), cardiovascular disease (CVD), and other health outcomes and associated dietary, metabolic, and other risk factors. The purpose of creating this population-weighted, formatted database is to explore the univariate and multiple regression correlations of health outcomes with risk factors. Our research hypothesis is that we can successfully model NCDs, BMI, CVD, and other health outcomes with their attributable risks.
These Global Burden of disease data relate to the preprint: The EAT-Lancet Commission Planetary Health Diet compared with Institute of Health Metrics and Evaluation Global Burden of Disease Ecological Data Analysis.
The data include the following:
1. Analysis database of population weighted GBD2017 data that includes over 40 health risk factors, noncommunicable disease deaths/100k/year of male and female cohorts ages 15-69 years from 195 countries (the primary outcome variable that includes over 100 types of noncommunicable diseases) and over 20 individual noncommunicable diseases (e.g., ischemic heart disease, colon cancer, etc).
2. A text file to import the analysis database into SAS
3. The SAS code to format the analysis database to be used for analytics
4. SAS code for deriving Tables 1, 2, 3 and Supplementary Tables 5 and 6
5. SAS code for deriving the multiple regression formula in Table 4.
6. SAS code for deriving the multiple regression formula in Table 5
7. SAS code for deriving the multiple regression formula in Supplementary Table 7
8. SAS code for deriving the multiple regression formula in Supplementary Table 8
9. The Excel files that accompanied the above SAS code to produce the tables
For questions, please email davidkcundiff@gmail.com. Thanks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundFoodborne diseases are important worldwide, resulting in considerable morbidity and mortality. To our knowledge, we present the first global and regional estimates of the disease burden of the most important foodborne bacterial, protozoal, and viral diseases.Methods and FindingsWe synthesized data on the number of foodborne illnesses, sequelae, deaths, and Disability Adjusted Life Years (DALYs), for all diseases with sufficient data to support global and regional estimates, by age and region. The data sources included varied by pathogen and included systematic reviews, cohort studies, surveillance studies and other burden of disease assessments. We sought relevant data circa 2010, and included sources from 1990–2012. The number of studies per pathogen ranged from as few as 5 studies for bacterial intoxications through to 494 studies for diarrheal pathogens. To estimate mortality for Mycobacterium bovis infections and morbidity and mortality for invasive non-typhoidal Salmonella enterica infections, we excluded cases attributed to HIV infection. We excluded stillbirths in our estimates. We estimate that the 22 diseases included in our study resulted in two billion (95% uncertainty interval [UI] 1.5–2.9 billion) cases, over one million (95% UI 0.89–1.4 million) deaths, and 78.7 million (95% UI 65.0–97.7 million) DALYs in 2010. To estimate the burden due to contaminated food, we then applied proportions of infections that were estimated to be foodborne from a global expert elicitation. Waterborne transmission of disease was not included. We estimate that 29% (95% UI 23–36%) of cases caused by diseases in our study, or 582 million (95% UI 401–922 million), were transmitted by contaminated food, resulting in 25.2 million (95% UI 17.5–37.0 million) DALYs. Norovirus was the leading cause of foodborne illness causing 125 million (95% UI 70–251 million) cases, while Campylobacter spp. caused 96 million (95% UI 52–177 million) foodborne illnesses. Of all foodborne diseases, diarrheal and invasive infections due to non-typhoidal S. enterica infections resulted in the highest burden, causing 4.07 million (95% UI 2.49–6.27 million) DALYs. Regionally, DALYs per 100,000 population were highest in the African region followed by the South East Asian region. Considerable burden of foodborne disease is borne by children less than five years of age. Major limitations of our study include data gaps, particularly in middle- and high-mortality countries, and uncertainty around the proportion of diseases that were foodborne.ConclusionsFoodborne diseases result in a large disease burden, particularly in children. Although it is known that diarrheal diseases are a major burden in children, we have demonstrated for the first time the importance of contaminated food as a cause. There is a need to focus food safety interventions on preventing foodborne diseases, particularly in low- and middle-income settings.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains disease names along with the symptoms faced by the respective patient. There are a total of 773 unique diseases and 377 symptoms, with ~246,000 rows. The dataset was artificially generated, preserving Symptom Severity and Disease Occurrence Possibility. Several distinct groups of symptoms might all be indicators of the same disease. There may even be one single symptom contributing to a disease in a row or sample. This is an indicator of a very high correlation between the symptom and that particular disease. A larger number of rows for a particular disease corresponds to its higher probability of occurrence in the real world. Similarly, in a row, if the feature vector has the occurrence of a single symptom, it implies that this symptom has more correlation to classify the disease than any one symptom of a feature vector with multiple symptoms in another sample.
Facebook
TwitterProject Tycho datasets contain case counts for reported disease conditions for countries around the world. The Project Tycho data curation team extracts these case counts from various reputable sources, typically from national or international health authorities, such as the US Centers for Disease Control or the World Health Organization. These original data sources include both open- and restricted-access sources. For restricted-access sources, the Project Tycho team has obtained permission for redistribution from data contributors. All datasets contain case count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team. The Project Tycho team has pre-processed datasets by adding new variables, such as standard disease and location identifiers, that improve data interpretability. We also formatted the data into a standard data format.
Each Project Tycho dataset contains case counts for a specific condition (e.g. measles) and for a specific country (e.g. The United States). Case counts are reported per time interval. In addition to case counts, datasets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of acquisition, and the source from which we extracted case counts. One dataset can include many series of case count time intervals, such as "US measles cases as reported by CDC", or "US measles cases reported by WHO", or "US measles cases that originated abroad", etc.
Depending on the intended use of a dataset, we recommend a few data processing steps before analysis: - Analyze missing data: Project Tycho datasets do not include time intervals for which no case count was reported (for many datasets, time series of case counts are incomplete, due to incompleteness of source documents) and users will need to add time intervals for which no count value is available. Project Tycho datasets do include time intervals for which a case count value of zero was reported. - Separate cumulative from non-cumulative time interval series. Case count time series in Project Tycho datasets can be "cumulative" or "fixed-intervals". Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc. It is common practice among public health agencies to report cases for cumulative time intervals. Case count series with fixed time intervals consist of mutually exclusive time intervals that all start and end on different dates and all have identical length (day, week, month, year). Given the different nature of these two types of case count data, we indicated this with an attribute for each count value, named "PartOfCumulativeCountSeries".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2
Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.
Facebook
TwitterProject Tycho datasets contain case counts for reported disease conditions for countries around the world. The Project Tycho data curation team extracts these case counts from various reputable sources, typically from national or international health authorities, such as the US Centers for Disease Control or the World Health Organization. These original data sources include both open- and restricted-access sources. For restricted-access sources, the Project Tycho team has obtained permission for redistribution from data contributors. All datasets contain case count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team. The Project Tycho team has pre-processed datasets by adding new variables, such as standard disease and location identifiers, that improve data interpretability. We also formatted the data into a standard data format.
Each Project Tycho dataset contains case counts for a specific condition (e.g. measles) and for a specific country (e.g. The United States). Case counts are reported per time interval. In addition to case counts, datasets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of acquisition, and the source from which we extracted case counts. One dataset can include many series of case count time intervals, such as "US measles cases as reported by CDC", or "US measles cases reported by WHO", or "US measles cases that originated abroad", etc.
Depending on the intended use of a dataset, we recommend a few data processing steps before analysis: - Analyze missing data: Project Tycho datasets do not include time intervals for which no case count was reported (for many datasets, time series of case counts are incomplete, due to incompleteness of source documents) and users will need to add time intervals for which no count value is available. Project Tycho datasets do include time intervals for which a case count value of zero was reported. - Separate cumulative from non-cumulative time interval series. Case count time series in Project Tycho datasets can be "cumulative" or "fixed-intervals". Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc. It is common practice among public health agencies to report cases for cumulative time intervals. Case count series with fixed time intervals consist of mutually exclusive time intervals that all start and end on different dates and all have identical length (day, week, month, year). Given the different nature of these two types of case count data, we indicated this with an attribute for each count value, named "PartOfCumulativeCountSeries".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. As the CDC describes: "Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.". The most recent dataset (as of February 15, 2022) includes data from 2020. It consists of 401,958 rows and 279 columns. The vast majority of columns are questions asked to respondents about their health status, such as "Do you have serious difficulty walking or climbing stairs?" or "Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]".
To improve the efficiency and relevance of our analysis, we removed certain attributes from the original BRFSS dataset. Many of the 279 original attributes included administrative codes, metadata, or survey-specific variables that do not contribute meaningfully to heart disease prediction—such as respondent IDs, timestamps, state-level identifiers, and detailed lifestyle questions unrelated to cardiovascular health. By focusing on a carefully selected subset of 18 attributes directly linked to medical, behavioral, and demographic factors known to influence heart health, we streamlined the dataset. This not only reduced computational complexity but also improved model interpretability and performance by eliminating noise and irrelevant information. All predicting variables could be divided into 4 broad categories:
Demographic factors: sex, age category (14 levels), race, BMI (Body Mass Index)
Diseases: weather respondent ever had such diseases as asthma, skin cancer, diabetes, stroke or kidney disease (not including kidney stones, bladder infection or incontinence)
Unhealthy habits:
General Health:
Below is a description of the features collected for each patient:
| # | Feature | Coded Variable Name | Description |
|---|---|---|---|
| 1 | HeartDisease | CVDINFR4 | Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) |
| 2 | BMI | _BMI5CAT | Body Mass Index (BMI) |
| 3 | Smoking | _SMOKER3 | Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] |
| 4 | AlcoholDrinking | _RFDRHV7 | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week |
| 5 | Stroke | CVDSTRK3 | (Ever told) (you had) a stroke? |
| 6 | PhysicalHealth | PHYSHLTH | Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 |
| 7 | MentalHealth | MENTHLTH | Thinking about your mental health, for how many days during the past 30 days was your mental health not good? |
| 8 | DiffWalking | DIFFWALK | Do you have serious difficulty walking or climbing stairs? |
| 9 | Sex | SEXVAR | Are you male or female? |
| 10 | AgeCategory | _AGE_G, | Fourteen-level age category |
| 11 | Race | _IMPRACE | Imputed race/ethnicity value |
| 12 | Diabetic | DIABETE4 | (Ever told) (you had) diabetes? |
| 13 | PhysicalActivity | EXERANY2 | Adults who reported doing physical activity or exercise during the past 30 days other than their regular job |
| 14 | GenHealth | GENHLTH | Would you say that in general your health is... |
| 15 | SleepTime | SLEPTIM1 | On average, how many hours of sleep do you get in a 24-hour period? |
| 16 | Asthma | CHASTHMA | (Ever told) (you had) asthma? |
| 17 | KidneyDisease | CHCKDNY2 | Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? |
| 18 | SkinCancer | CHCSCNCR | (Ever told) (you had) skin cancer? |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Users can view statistics and generate cross-country comparisons pertaining to infectious diseases and health indicators in 193 WHO member states. Background The Global Health Atlas is a database maintained by the World Health Organization (WHO) that provides information regarding infectious diseases in WHO member states. Health conditions include: malaria, HIV/AIDS, cholera, STIs, meningitis, and polio, among others. User Functionality Users can generate statistics regarding infectious diseases and health systems indicators by country or region, or generate cross-country comparisons. In addition, users can v iew maps showing the distribution of various health indicators and diseases by geographic region or individual country. Data Notes Statistics are available for all WHO member states. Data are available from 1949 to 2009.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This formatted dataset originates from raw data files from the Institute of Health Metrics and Evaluation Global Burden of Disease (GBD2017). It is population weighted worldwide data on male and female cohorts ages 15-69 years including cardiovascular disease early death and associated dietary, metabolic and other risk factors. The purpose of creating this formatted database is to explore the univariate and multiple regression correlations of cardiovascular early deaths and other health outcomes with risk factors. Our research hypothesis is that we can successfully apply artificial intelligence to model cardiovascular disease outcomes with risk factors. We found that fat-soluble vitamin containing foods (animal products) and added fats are negatively correlated with CVD early deaths worldwide but positively correlated with CVD early deaths in high fat-soluble vitamin cohorts. We interpret this as showing that optimal cardiovascular outcomes come with moderate (not low and not high) intakes of animal foods and added fats. You are invited to download the dataset, the associated SAS code to access the dataset, and the tables that have resulted from the analysis. Please comment on the article by indicating what you found by exploring the dataset with the provided SAS codes. Please say whether or not you found the outputs from the SAS codes accurately reflected the tables provided and the tables in the published article. If you use our data to reproduce our findings and comment on your findings on the MedRxIV website (https://www.medrxiv.org/content/10.1101/2021.04.17.21255675v4) and would like to be recognized, we will be happy to list you as a contributor when the article is summited to JAMA. For questions, please email davidkcundiff@gmail.com. Thanks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data on causes of death (COD) provide information on mortality patterns and form a major element of public health information.
The COD data refer to the underlying cause which - according to the World Health Organisation (WHO) - is "the disease or injury which initiated the train of morbid events leading directly to death, or the circumstances of the accident or violence which produced the fatal injury".
The data are derived from the medical certificate of death, which is obligatory in the Member States. The information recorded in the death certificate is according to the rules specified by the WHO.
Data published in Eurostat's dissemination database are broken down by sex, 5-year age groups, cause of death and by residency and country of occurrence. For stillbirths and neonatal deaths additional breakdowns might include age of mother and parity.
Data are available for Member States, Iceland, Norway, Liechtenstein, Switzerland, United Kingdom, Serbia, Turkey, North Macedonia and Albania. Regional data (NUTS level 2) are available for all of the countries having NUTS2 regions except Albania.
Annual national data are available in Eurostat's dissemination database in absolute number, crude death rates and standardised death rates. At regional level the same is provided in form of 3-years averages (the average of year, year -1 and year -2). Annual crude and standardised death rates are also available at NUTS2 level. Monthly national data are available for 21 EU Member States from reference year 2019 and in 24 Member States from reference year 2022 in absolute numbers and standardised death rates.
Facebook
TwitterNotice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.
April 9, 2020
April 20, 2020
April 29, 2020
September 1st, 2020
February 12, 2021
new_deaths column.February 16, 2021
The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.
The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.
This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.
The AP is updating this dataset hourly at 45 minutes past the hour.
To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.
Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic
Filter cases by state here
Rank states by their status as current hotspots. Calculates the 7-day rolling average of new cases per capita in each state: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=481e82a4-1b2f-41c2-9ea1-d91aa4b3b1ac
Find recent hotspots within your state by running a query to calculate the 7-day rolling average of new cases by capita in each county: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=b566f1db-3231-40fe-8099-311909b7b687&showTemplatePreview=true
Join county-level case data to an earlier dataset released by AP on local hospital capacity here. To find out more about the hospital capacity dataset, see the full details.
Pull the 100 counties with the highest per-capita confirmed cases here
Rank all the counties by the highest per-capita rate of new cases in the past 7 days here. Be aware that because this ranks per-capita caseloads, very small counties may rise to the very top, so take into account raw caseload figures as well.
The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.
@(https://datawrapper.dwcdn.net/nRyaf/15/)
<iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>
Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here
This data should be credited to Johns Hopkins University COVID-19 tracking project
Facebook
TwitterProject Tycho datasets contain case counts for reported disease conditions for countries around the world. The Project Tycho data curation team extracts these case counts from various reputable sources, typically from national or international health authorities, such as the US Centers for Disease Control or the World Health Organization. These original data sources include both open- and restricted-access sources. For restricted-access sources, the Project Tycho team has obtained permission for redistribution from data contributors. All datasets contain case count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team. The Project Tycho team has pre-processed datasets by adding new variables, such as standard disease and location identifiers, that improve data interpretability. We also formatted the data into a standard data format.
Each Project Tycho dataset contains case counts for a specific condition (e.g. measles) and for a specific country (e.g. The United States). Case counts are reported per time interval. In addition to case counts, datasets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of acquisition, and the source from which we extracted case counts. One dataset can include many series of case count time intervals, such as "US measles cases as reported by CDC", or "US measles cases reported by WHO", or "US measles cases that originated abroad", etc.
Depending on the intended use of a dataset, we recommend a few data processing steps before analysis: - Analyze missing data: Project Tycho datasets do not include time intervals for which no case count was reported (for many datasets, time series of case counts are incomplete, due to incompleteness of source documents) and users will need to add time intervals for which no count value is available. Project Tycho datasets do include time intervals for which a case count value of zero was reported. - Separate cumulative from non-cumulative time interval series. Case count time series in Project Tycho datasets can be "cumulative" or "fixed-intervals". Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc. It is common practice among public health agencies to report cases for cumulative time intervals. Case count series with fixed time intervals consist of mutually exclusive time intervals that all start and end on different dates and all have identical length (day, week, month, year). Given the different nature of these two types of case count data, we indicated this with an attribute for each count value, named "PartOfCumulativeCountSeries".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Four datasets are presented here. The original dataset is a collection of the COVID-19 data maintained by Our World in Data. It includes data on confirmed cases, and deaths, as well as other variables of potential interest for ten countries such as Australia, Brazil, Canada, China, Denmark, France, Israel, Italy, the United Kingdom, and the United States. The original dataset includes the data from the date of 31st December in 2019 to 31st May in 2020 with a total of 1.530 instances and 19 features. This dataset is collected from a variety of sources (the European Centre for Disease Prevention and Control, United Nations, World Bank, Global Burden of Disease, Blavatnik School of Government, etc.). After the original dataset is pre-processed by cleaning and removing some data including unnecessary and blank. Then, all strings are converted numeric values, and some new features such as continent, hemisphere, year, month, and day are added by extracting the original features. After that, the processed original dataset is organized for prediction of the number of new cases of COVID-19 for 1 day, 3 days, and 10 days ago and three datasets (Dataset-1, 2, 3) are created for that.
Facebook
TwitterAs a member of the World Organisation for Animal Health (OIE), and the reporting authority for the United States, the USGS National Wildlife Health Center (NWHC) is responsible for reporting wildlife disease outbreaks that involve diseases which are not OIE-listed (https://www.oie.int/wahis_2/public/wahidwild.php# ). These outbreaks are to be reported on a semesterly basis via OIE’s WAHIS-Wild reporting system. The data fields described within are based on those in WAHIS-Wild. Since OIE’s reporting mechanism is based primarily on domestic and agricultural animals, several of the variables are not applicable to wildlife (i.e. vaccination, slaughtered, etc.). In an effort to use a consistent data source that is broad in scope and captures information from around the country, from various natural resource management authorities, NWHC will use the Wildlife Health Information Sharing Partnership - Event Reporting System (WHISPers - https://whispers.usgs.gov/home ) as the sole source to generate and supply the requested information to OIE. Data supplied to OIE have been restricted to publicly available information on wildlife morbidity/mortality and surveillance events in WHISPers.
Facebook
TwitterBy Saloni Dattani, Lucas Rodés-Guirao, Edouard Mathieu, Hannah Ritchie and Max Roser.
Data description:
Disease outbreaks may be inevitable, but large-scale pandemics are not. The world can respond swiftly and effectively to pandemic risks in the future with better understanding, resources, and effort.
To avoid suffering through another large pandemic, we have to take the risk of pandemics seriously. Despite warnings that another one was likely, the COVID-19 pandemic killed more than 27 million people.1
We must build the capacity to test for pathogens and understand them: which pathogens put us at the greatest risk, how they spread, and how to tackle them.
We know it is possible to greatly reduce the risk of infectious disease. We’ve learned over history how to reduce their impact with vaccines, public health efforts, and medicine.
In addition to the old risks, we face new threats from factory farming, genetic modification, climate change, and antimicrobial resistance. With more attention and effort, we can reduce their risks too.
Good luck in your analysis.