https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.
However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.
2 Data-set Introduction
2.1 Data Collection
We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:
The headline must have one or more words directly or indirectly related to COVID-19.
The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.
The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.
Avoid taking duplicate reports.
Maintain a time frame for the above mentioned newspapers.
To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.
2.2 Data Pre-processing and Statistics
Some pre-processing steps performed on the newspaper report dataset are as follows:
Remove hyperlinks.
Remove non-English alphanumeric characters.
Remove stop words.
Lemmatize text.
While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.
The primary data statistics of the two dataset are shown in Table 1 and 2.
Table 1: Covid-News-USA-NNK data statistics
No of words per headline
7 to 20
No of words per body content
150 to 2100
Table 2: Covid-News-BD-NNK data statistics No of words per headline
10 to 20
No of words per body content
100 to 1500
2.3 Dataset Repository
We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.
3 Literature Review
Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.
Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].
Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.
Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.
4 Our experiments and Result analysis
We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:
In February, both the news paper have talked about China and source of the outbreak.
StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.
Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.
Washington Post discussed global issues more than StarTribune.
StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.
While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.
We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases
where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contain informative data related to COVID-19 pandemic. Specially, figure out about the First Case and First Death information for every single country. The datasets mainly focus on two major fields first one is First Case which consists of information of Date of First Case(s), Number of confirm Case(s) at First Day, Age of the patient(s) of First Case, Last Visited Country and the other one First Death information consist of Date of First Death and Age of the Patient who died first for every Country mentioning corresponding Continent. The datasets also contain the Binary Matrix of spread chain among different country and region.
*This is not a country. This is a ship. The name of the Cruise Ship was not given from the government.
"N+": the age is not specified but greater than N
“No Trace”: some data was not found
“Unspecified”: not available from the authority
“N/A”: for “Last Visited Country(s) of Confirmed Case(s)” column, “N/A” indicates that the confirmed case(s) of those countries do not have any travel history in recent past; in “Age of First Death(s)” column “N/A” indicates that those countries do not have may death case till May 16, 2020.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Within the current response of a pandemic caused by the SARS-CoV-2 coronavirus, which in turn causes the disease, called COVID-19. It is necessary to join forces to minimize the effects of this disease.
Therefore, the intention of this dataset is to save data scientists time:
This dataset is not intended to be static, so suggestions for expanding it are welcome. If someone considers it important to add information, please let me know.
The data contained in this dataset comes mainly from the following sources:
Source: Center for Systems Science and Engineering (CSSE) at Johns Hopkins University https://github.com/CSSEGISandData/COVID-19 Provided by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE): https://systems.jhu.edu/
Source: OXFORD COVID-19 GOVERNMENT RESPONSE TRACKER https://www.bsg.ox.ac.uk/research/research-projects/oxford-covid-19-government-response-tracker Hale, Thomas and Samuel Webster (2020). Oxford COVID-19 Government Response Tracker. Data use policy: Creative Commons Attribution CC BY standard.
The original data is updated daily.
The features it includes are:
Country Name
Country Code ISO 3166 Alpha 3
Date
Incidence data:
Daily increments:
Empirical Contagion Rate - ECR
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3508582%2F3e90ecbcdf76dfbbee54a21800f5e0d6%2FECR.jpg?generation=1586861653126435&alt=media" alt="">
GOVERNMENT RESPONSE TRACKER - GRTStringencyIndex
OXFORD COVID-19 GOVERNMENT RESPONSE TRACKER - Stringency Index
Indices from Start Contagion
Percentages over the country's population:
The method of obtaining the data and its transformations can be seen in the notebook:
Notebook COVID-19 Data by country with Government Response
Photo by Markus Spiske on Unsplash
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"COVID-19 mortality correlation with cloudiness, sunlight, latitude in European countries"
Dataset for article titled "COVID-19 mortality: positive correlation with cloudiness, sunlight and no correlation with latitude in Europe"
by SECIL OMER, ADRIAN IFTIME, VICTOR BURCEA
Corresponding author: A. Iftime, University of Medicine and Pharmacy "Carol Davila", Biophysics Department, 8 Blvd. Eroii Sanitari, 050474 Bucharest, Romania. Email address: adrian.iftime [at] umfcd.ro.
Preprint corresponding to this dataset: https://doi.org/10.1101/2021.01.27.21250658
===========
Dataset file: 1.0.0.COVID-19_Mortality_Cloudiness_Insolation_EUROPE_March_August_2020.csv
Dataset graphical preview: 1.0.0.INFOGRAFIC_CloudFraction_vs_COVID-19_mortality_Europe_March-August_2020.png
DATASET fields: "Country" : Country name; 37 European countries included.
"Date": Date stamp at the collection time. Data collection was performed in the last day of every month. Date format: YYYY-MM-DD
"Month_Key" : Date stamp at the collection time, formatted for easier monthly time series analysis. Date format: YYYY-MM
"Month_Fct2020" Date stamp at the collection time,formatted for easier graphing, as a string with names of the months (in English).
"Deaths_per_1Mpop" : Monthly mortality from COVID-19 raported in the country, reported as number of COVID-19 deaths per 1 million population of the country, in that particular month / country. NB: it is reported as million population, not patients.
"LogDeaths_per_1Mpop" : Log10 transformation of "Deaths_per_1Mpop"
"Insolation_Average" : Insolation average (solar irradiance at ground level), in that particular month / country. It is expressed in Watt / square meter of the ground surface. Data derived from data avaialble at NASA Langley Research Center, NASA’s Earth Observatory, CERES / FLASHFlux team, 2020, https://neo.sci.gsfc.nasa.gov/view.php?datasetId=CERES_INSOL_M
"Cloud_Fraction" : Cloudiness (also known as cloud fraction, cloud cover, cloud amount or sky cover), as decimal fraction of the sky obscured by clouds, in that particular month / country. Data derived from NASA Goddard Space Flight Center, NASA’s Earth Observatory, MODIS Atmosphere Science Team, 2020, https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MODAL2_M_CLD_FR
"CENTR_latitude" and
"CENTR_longitude" :
Latitude and Longitude of the country centroid, for each country.
Data derived from Google LLC, "Dataset publishing language: country centroids",
https://developers.google.com/public-data/docs/canonical/countries_csv
NOTE: This is identical in every month (obviuously);
it is redundantly included for easier monthly sectional analysis of the data.
===========
Versioning: 1.0.0.COVID-19_Mortality_Cloudiness_Insolation_EUROPE_March_August_2020.csv
MAJOR: changes yearly; 1 = 2020 MINOR: changes if new monthly data is added in that particular year. PATCH: Changes only if errors or minor edits were performed.
DOI for this version: 10.5281/zenodo.4266758
Dataset file source for this version (internal analysis source file): db_covid_all-ANALYSIS.2020-09-22_r10.csv
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
From World Health Organization - On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus did not match any other known virus. This raised concern because when a virus is new, we do not know how it affects people.
So daily level information on the affected people can give some interesting insights when it is made available to the broader data science community.
The European CDC publishes daily statistics on the COVID-19 pandemic. Not just for Europe, but for the entire world. We rely on the ECDC as they collect and harmonize data from around the world which allows us to compare what is happening in different countries.
This dataset has daily level information on the number of affected cases, deaths and recovery etc. from coronavirus. It also contains various other parameters like average life expectancy, population density, smocking population etc. which users can find useful in further prediction that they need to make.
The data is available from 31 Dec,2019.
Give people weekly data so that they can use it to make accurate predictions.
https://www.ontario.ca/page/open-government-licence-ontariohttps://www.ontario.ca/page/open-government-licence-ontario
This dataset reports the daily reported number of the 7-day moving average rates of Deaths involving COVID-19 by vaccination status and by age group.
Effective November 14, 2024 this page will no longer be updated. Information about COVID-19 and other respiratory viruses is available on Public Health Ontario’s interactive respiratory virus tool: https://www.publichealthontario.ca/en/Data-and-Analysis/Infectious-Disease/Respiratory-Virus-Tool
Data includes:
As of June 16, all COVID-19 datasets will be updated weekly on Thursdays by 2pm.
As of January 12, 2024, data from the date of January 1, 2024 onwards reflect updated population estimates. This update specifically impacts data for the 'not fully vaccinated' category.
On November 30, 2023 the count of COVID-19 deaths was updated to include missing historical deaths from January 15, 2020 to March 31, 2023.
CCM is a dynamic disease reporting system which allows ongoing update to data previously entered. As a result, data extracted from CCM represents a snapshot at the time of extraction and may differ from previous or subsequent results. Public Health Units continually clean up COVID-19 data, correcting for missing or overcounted cases and deaths. These corrections can result in data spikes and current totals being different from previously reported cases and deaths. Observed trends over time should be interpreted with caution for the most recent period due to reporting and/or data entry lags.
The data does not include vaccination data for people who did not provide consent for vaccination records to be entered into the provincial COVaxON system. This includes individual records as well as records from some Indigenous communities where those communities have not consented to including vaccination information in COVaxON.
“Not fully vaccinated” category includes people with no vaccine and one dose of double-dose vaccine. “People with one dose of double-dose vaccine” category has a small and constantly changing number. The combination will stabilize the results.
Spikes, negative numbers and other data anomalies: Due to ongoing data entry and data quality assurance activities in Case and Contact Management system (CCM) file, Public Health Units continually clean up COVID-19, correcting for missing or overcounted cases and deaths. These corrections can result in data spikes, negative numbers and current totals being different from previously reported case and death counts.
Public Health Units report cause of death in the CCM based on information available to them at the time of reporting and in accordance with definitions provided by Public Health Ontario. The medical certificate of death is the official record and the cause of death could be different.
Deaths are defined per the outcome field in CCM marked as “Fatal”. Deaths in COVID-19 cases identified as unrelated to COVID-19 are not included in the Deaths involving COVID-19 reported.
Rates for the most recent days are subject to reporting lags
All data reflects totals from 8 p.m. the previous day.
This dataset is subject to change.
https://dataverse.no/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.18710/VMUP44https://dataverse.no/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.18710/VMUP44
The dataset is a cross-sectional dataset covering social and public health data pertaining to the Covid-19 outbreak in 199 countries. The dataset was compiled from public register and other openly available sources. Data on Covid-19 cases and related fatalities is current as of medio July 2020. Data on other variables is mainly from the last three years, depending on data availability. Standardized unique unit identifiers (ISO-3166-1 Alpha-3) are included, enabling merging with other data. The dataset was assembled concurrently with a similar one on the Norwegian municipal level, as part of the project «Ressurs for studentaktiv læring i undervisning i statistisk og romlig analyse for samfunnsfag», at the Department of Social Science and The Norwegian College of Fishery Science, UiT. Dette er et tverrsnittsdatasett med samfunns- og folkehelsedata relatert til den pågående Covid-19-pandemien. Datasettet dekker 199 land. Det er satt sammen med data fra offentlige registre og andre åpent tilgjengelige kilder. Data om Covid-19-tilfeller og -dødsfall er à jour per medio juli 2020. Data på andre variabler er hovedsaklig fra de tre siste årene, avhengig av hva som var tilgjengelig på innsamlingstidspunktet. Standardiserte unike ID-variabler (ISO-3166-1 Alpha-3) er inkludert for å muliggjøre fusjonering med annen data. Datasettet ble satt sammen parallellt med et tilsvarende på kommunenivå (Norge), som en del av prosjektet «Ressurs for studentaktiv læring i undervisning i statistisk og romlig analyse for samfunnsfag» ved Institutt for samfunnsvitenskap og Norges fiskerihøgskole, UiT.
Coronavirus (COVID-19) pandemic
Our complete COVID-19 dataset is a collection of the COVID-19 data maintained by Our World in Data. It includes data on confirmed cases, deaths, hospitalizations, and testing. Data is collected from multiple sources that update at different times and may not always align. Some locations may not provide complete information.
Regarding all Vaccination Data The date of Last Update is 4/21/2023. Additionally on 4/27/2023 several COVID-19 datasets were retired and no longer included in public COVID-19 data dissemination. See this link for more information https://imap.maryland.gov/pages/covid-data Summary The cumulative number of COVID-19 vaccinations percent age group population: 16-17; 18-49; 50-64; 65 Plus. Description COVID-19 - Vaccination Percent Age Group Population data layer is a collection of COVID-19 vaccinations that have been reported each day into ImmuNet. COVID-19 is a disease caused by a respiratory virus first identified in Wuhan, Hubei Province, China in December 2019. COVID-19 is a new virus that hasn't caused illness in humans before. Worldwide, COVID-19 has resulted in thousands of infections, causing illness and in some cases death. Cases have spread to countries throughout the world, with more cases reported daily. The Maryland Department of Health reports daily on COVID-19 cases by county. Terms of Use The Spatial Data, and the information therein, (collectively the Data) is provided as is without warranty of any kind, either expressed, implied, or statutory. The user assumes the entire risk as to quality and performance of the Data. No guarantee of accuracy is granted, nor is any responsibility for reliance thereon assumed. In no event shall the State of Maryland be liable for direct, indirect, incidental, consequential or special damages of any kind. The State of Maryland does not accept liability for any damages or misrepresentation caused by inaccuracies in the Data or as a result to changes to the Data, nor is there responsibility assumed to maintain the Data in any manner or form. The Data can be freely distributed as long as the metadata entry is not modified or deleted. Any data derived from the Data must acknowledge the State of Maryland in the metadata. This map is for planning purposes only. MEMA does not guarantee the accuracy of any forecast or predictive elements.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides values for CORONAVIRUS DEATHS reported in several countries. The data includes current values, previous releases, historical highs and record lows, release frequency, reported unit and currency.
JHU Coronavirus COVID-19 Global Cases, by country
PHS is updating the Coronavirus Global Cases dataset weekly, Monday, Wednesday and Friday from Cloud Marketplace.
This data comes from the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). This database was created in response to the Coronavirus public health emergency to track reported cases in real-time. The data include the location and number of confirmed COVID-19 cases, deaths, and recoveries for all affected countries, aggregated at the appropriate province or state. It was developed to enable researchers, public health authorities and the general public to track the outbreak as it unfolds. Additional information is available in the blog post.
Visual Dashboard (desktop): https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6
Included Data Sources are:
%3C!-- --%3E
**Terms of Use: **
This GitHub repo and its contents herein, including all data, mapping, and analysis, copyright 2020 Johns Hopkins University, all rights reserved, is provided to the public strictly for educational and academic research purposes. The Website relies upon publicly available data from multiple sources, that do not always agree. The Johns Hopkins University hereby disclaims any and all representations and warranties with respect to the Website, including accuracy, fitness for use, and merchantability. Reliance on the Website for medical guidance or use of the Website in commerce is strictly prohibited.
**U.S. county-level characteristics relevant to COVID-19 **
Chin, Kahn, Krieger, Buckee, Balsari and Kiang (forthcoming) show that counties differ significantly in biological, demographic and socioeconomic factors that are associated with COVID-19 vulnerability. A range of publicly available county-specific data identifying these key factors, guided by international experiences and consideration of epidemiological parameters of importance, have been combined by the authors and are available for use:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset has been collected from multiple sources provided by MVCR on their websites and contains daily summarized statistics as well as details statistics up to age & sex level.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
Date - Calendar date when data were collected Daily tested - Sum of tests performed Daily infected - Sum of confirmed cases those were positive Daily cured - Sum of cured people that does not have Covid-19 anymore Daily deaths - Sum of people those died on Covid-19 Daily cum tested - Cumulative sum of tests performed Daily infected - Cumulative sum of confirmed cases those were positive Daily cured - Cumulative sum of cured people that does not have Covid-19 anymore Daily deaths - Cumulative sum of people those died on Covid-19 Region - Region of Czech republic Sub-Region - Sub-Region of Czech republic Region accessories qty - Quantity of health care accessories delivered to region for all the time Age - Age of person Sex - Sex of person Infected - Sum of infected people for specific date, region, sub-region, age and sex Cured - Sum of cured people for specific date, region, sub-region, age and sex Death - Sum of people those dies on Covid-19 for specific date, region, sub-region, age and sex Infected abroad - Identifies if person was infected by Covid-19 in Czech republic or abroad Infected in country - code of country from where person came (origin country of Covid-19)
Dataset contains data on different level of granularities. Make sure you do not mix different granularities. Let's suppose you have loaded data into pandas dataframe called df.
df_daily = df.groupby(['date']).max()[['daily_tested','daily_infected','daily_cured','daily_deaths','daily_cum_tested','daily_cum_infected','daily_cum_cured','daily_cum_deaths']].reset_index()
df_region = df[df['region'] != ''].groupby(['region']).agg(
region_accessories_qty=pd.NamedAgg(column='region_accessories_qty', aggfunc='max'),
infected=pd.NamedAgg(column='infected', aggfunc='sum'),
cured=pd.NamedAgg(column='cured', aggfunc='sum'),
death=pd.NamedAgg(column='death', aggfunc='sum')
).reset_index()
df_detail = df[['date','region','sub_region','age','sex','infected','cured','death','infected_abroad','infected_in_country']].reset_index(drop=True)
Thanks to websites of MVCR for sharing such great information.
Can you see relation between health care accessories delivered to region and number of cured/infected in that region? Why Czech Republic belongs to pretty safe countries when talking about Covid-19 Pandemic? Can you find out what is difference of pandemic evolution in Czech Republic comparing to other surrounding coutries, like Germany or Slovakia?
SummaryThe number of cases interviewed who had a completed answer to the question asking if they attended any gatherings of more than 10 people in the 14 days before they became ill (or had a positive test) during their covidLINK interviews.DescriptionMD COVID-19 - Contact Tracing Cases Social Gatherings of More than 10 People layer reflects the number of cases interviewed who had a completed answer to the question asking if they attended any gatherings of more than 10 people in the 14 days before they became ill (or had a positive test) during their covidLINK interviews. Respondents may indicate that they attended more than one category of social gathering. For a variety of reasons, some individuals choose not to answer particular questions during the course of their interview.Events and locations where there is prolonged exposure to other people — including weddings, parties, stores, restaurants, etc. — are considered “high risk” for COVID-19 transmission. The more interaction at a gathering or location, the more likely a person may be to transmit or become infected with the virus. More information about considerations for events and gatherings — including how to assess risk levels and promote healthy behaviors that reduce spread — is available from the Centers for Disease Control and Prevention.Answers to interview questions do not provide evidence of cause and effect. Due to the nature of COVID-19 and the wide range of scenarios in which a person can become infected, most of the time it will not be possible to pinpoint exactly where and when a case became infected. Though a person may report attendance at a particular location, that does not mean that transmission happened at that location.The covidLINK interview questionnaire is updated as necessary to capture relevant information related to case exposure and potential onward transmission. These revisions should be taken into consideration when evaluating trends in case responses over time.COVID-19 is a disease caused by a respiratory virus first identified in Wuhan, Hubei Province, China in December 2019. COVID-19 is a new virus that hasn't caused illness in humans before. Worldwide, COVID-19 has resulted in thousands of infections, causing illness and in some cases death. Cases have spread to countries throughout the world, with more cases reported daily. The Maryland Department of Health reports daily on COVID-19 cases by county.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is based on the Github repository maintained by OpenZH. Data has been enriched with geographical data for the cantons, in order to produce visualisations.Field NameDescriptionFormatNote
updateDate and time of notification YYYY-MM-DD-HH-MM
nameName of the reporting cantonTextabbreviation_canton_and_fl Abbreviation of the reporting canton
Text
ncumul_testedReported number of tests performed as of dateNumberIrrespective of canton of residence
ncumul_confReported number of confirmed cases as of dateNumberOnly cases that reside in the current canton
current_hosp (formerly ncumul_hosp) *Reported number of hospitalised patients on dateNumberIrrespective of canton of residencecurrent_icu (formerly ncumul_icu) *Reported number of hospitalised patients in ICUs on dateNumberIrrespective of canton of residencecurrent_vent(formerly ncumul_vent) *Reported number of patients requiring ventilation on dateNumberIrrespective of canton of residencencumul_released Reported number of patients released from hospitals or reported recovered as of date
NumberIrrespective of canton of residence
ncumul_deceasedReported number of deceased as of dateNumberOnly cases that reside in the current cantonnew_hosp *Number of new hospitalisations since last dateNumberIrrespective of canton of residence
sourceSource of the informationURL linkgeo_point_2dGeographical centroid of the cantongeo_point_2dcurrent_isolatedReported number of isolated persons on dateNumberInfected persons, who are not hospitalisedcurrent_quarantinedReported number of quarantined persons on dateNumberPersons, who were in 'close contact' with an infected person, while that person was infectious, and are not hospitalised themselvescurrent_quarantined_riskareatravelReported number of quarantined persons on dateNumberPeople arriving in Switzerland from certain countries and areas, required to go into quarantine (introduced in May 2021)*These variables were affected by the format change on April 9th, 2020, which consists in:- new variable "new_hosp"- variables "ncumul_hosp", "ncumul_icu", "ncumul_vent" have been renamed to "current_hosp", "current_icu", "current_vent", to fit with their nature. To ensure compatibility with already made dashboards or reuses, these fields have been duplicated to avoid errors when their old names are used; but we strongly recommand to replace their old names by the new as soon as possible.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘COVID-19: Holidays of countries’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vbmokin/covid19-holidays-of-countries on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This research is devoted to the analysis of the impact of holidays on the statistics of confirmed coronavirus diseases. The Prophet using the holidays library with holidays of countries and their regions. As of 30 June 2020, only 62 countries (some with regions) are available in the holidays library:
['AR', 'AT', 'AU', 'BD', 'BE', 'BG', 'BR', 'BY', 'CA', 'CH', 'CL', 'CN', 'CO', 'CZ', 'DE', 'DK', 'DO', 'EE', 'EG', 'ES', 'FI', 'FR', 'GB', 'GR', 'HN', 'HR', 'HU', 'ID', 'IE', 'IL', 'IN', 'IS', 'IT', 'JP', 'KE', 'KR', 'LT', 'LU', 'MX', 'MY', 'NG', 'NI', 'NL', 'NO', 'NZ', 'PE', 'PH', 'PK', 'PL', 'PT', 'PY', 'RS', 'RU', 'SE', 'SG', 'SI', 'SK', 'TH', 'TR', 'UA', 'US', 'ZA'] or ['Argentina', 'Australia', 'Austria', 'Bangladesh', 'Belarus', 'Belgium', 'Brazil', 'Bulgaria', 'Canada', 'Chile', 'China', 'Colombia', 'Croatia', 'Czechia', 'Denmark', 'Dominican Republic', 'Egypt', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Ireland', 'Israel', 'Italy', 'Japan', 'Kenya', 'Korea, Republic of', 'Lithuania', 'Luxembourg', 'Malaysia', 'Mexico', 'Netherlands', 'New Zealand', 'Nicaragua', 'Nigeria', 'Norway', 'Pakistan', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Russian Federation', 'Serbia', 'Singapore', 'Slovakia', 'Slovenia', 'South Africa', 'Spain', 'Sweden', 'Switzerland', 'Thailand', 'Turkey', 'Ukraine', 'United Kingdom', 'United States']
I will note at once that the list of available countries in the description of the holidays library contains a lot of mistakes, which I wrote to the authors.
When I asked if this list would expand, the Prophet team made it clear that they were waiting for help from the community with holidays library expand.
As of Jan 2021 (version 8.4.1), 67 countries (some with regions) are available in the holidays library: a number of data have been refined and countries ['BI', 'LV', 'MA', 'RO', 'VN' - two-letter country codes or alpha_2 of the country (ISO 3166)] added.
Unfortunately, the format of the holidays library is not very suitable for coronavirus problems, as it has a number of disadvantages. First, the names of the countries are given in one word, which makes it difficult for many of them to identify them according to their common names (ISO 3166). It is best that the dataset contains the common name and two-letter abbreviation in English according to ISO 3166 (see pycountry). Second, the dates are not adapted to the potential impact of the holidays on coronavirus statistics. It is known that after the moment of infection, the active manifestation of symptoms occurs with a delay of 4-10 days, that is a person is likely to get into the statistics on the number of diseases only after 4-7 days. Therefore, it is advisable to use the dates window of impacts: ``` Lower_window = [4, 7] Upper_window = [7, 10]
`Lower_window <= 0`
But my [request](https://github.com/facebook/prophet/issues/1588#issue-661098613) to allow positive numbers in this parameter [was refused](https://github.com/facebook/prophet/issues/1588#issuecomment-661984730) by the Prophet team and [advised](https://github.com/facebook/prophet/issues/1588#issuecomment-661984730) to simply move the dates themselves.
Therefore, it is advisable to shift the holiday dates by 7 days. If the researcher thinks that 7 is too much and enough is 4 days, then he simply indicates "Lower" of the window in -3. Actually, by default, it makes sense to specify parameters:
Lower_window = -3 Upper_window = 3
If necessary, these settings are easy to change
### Content
This dataset:
1. Contains ISO codes, ISO names (common and official) (ISO 3166) of **70** countries (3 European countries **['Albania' - 'AL', 'Georgia' - 'GE', 'Moldova' - 'MD']** have been added).
2. Contains imported dates from the holidays library for 2020-01-20-2021-12-31 (all countries from holidays library as of Jan 2021), and the same dates, but moved 7 days forward.
3. Holidays of countries that are not in the list of holidays of the library, but which are in the data of the World Health Organization and on which considerable statistics of diseases on coronavirus are already collected.
4. Parameters for Prophet model:
`lower_window, upper_window, prior_scale`
If you find errors, please write to the [Discussion](https://www.kaggle.com/vbmokin/covid19-holidays-of-countries/discussion).
It is planned to periodically update (and, if necessary, correct) this dataset.
### Acknowledgements
Thanks to the authors of the information resources
* [https://github.com/dr-prodigy/python-holidays](https://github.com/dr-prodigy/python-holidays)
* [https://en.wikipedia.org/wiki/List_of_holidays_by_country](https://en.wikipedia.org/wiki/List_of_holidays_by_country)
about the dates and names of holidays in different countries, which I used.
Thanks for the image to <a href="https://pixabay.com/ru/users/iXimus-2352783/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=5062659">iXimus</a> from <a href="https://pixabay.com/ru/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=5062659">Pixabay</a>
### Inspiration
The main task for which this dataset was created is to study the impact of holidays on the accuracy of predicting coronavirus diseases, identifying new patterns, and forming optimal solutions to counteract or minimize its spread.
Tasks that need to be solved to improve this dataset in order to increase the accuracy of modeling the impact of holidays on the number of coronavirus patients:
1) Expanding the list of countries
2) Clarification of holiday dates
3) Clarification of parameters
`lower_window, upper_window, prior_scale`
they must be unique for each country and each holiday.
Also, it is advisable to carry out similar work for each region of countries, but this will not be done in this dataset.
--- Original source retains full ownership of the source dataset ---
2019 Novel Coronavirus COVID-19 (2019-nCoV) Visual Dashboard and Map:
https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6
Downloadable data:
https://github.com/CSSEGISandData/COVID-19
Additional Information about the Visual Dashboard:
https://systems.jhu.edu/research/public-health/ncov
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To create the dataset, the top 10 countries leading in the incidence of COVID-19 in the world were selected as of October 22, 2020 (on the eve of the second full of pandemics), which are presented in the Global 500 ranking for 2020: USA, India, Brazil, Russia, Spain, France and Mexico. For each of these countries, no more than 10 of the largest transnational corporations included in the Global 500 rating for 2020 and 2019 were selected separately. The arithmetic averages were calculated and the change (increase) in indicators such as profitability and profitability of enterprises, their ranking position (competitiveness), asset value and number of employees. The arithmetic mean values of these indicators for all countries of the sample were found, characterizing the situation in international entrepreneurship as a whole in the context of the COVID-19 crisis in 2020 on the eve of the second wave of the pandemic. The data is collected in a general Microsoft Excel table. Dataset is a unique database that combines COVID-19 statistics and entrepreneurship statistics. The dataset is flexible data that can be supplemented with data from other countries and newer statistics on the COVID-19 pandemic. Due to the fact that the data in the dataset are not ready-made numbers, but formulas, when adding and / or changing the values in the original table at the beginning of the dataset, most of the subsequent tables will be automatically recalculated and the graphs will be updated. This allows the dataset to be used not just as an array of data, but as an analytical tool for automating scientific research on the impact of the COVID-19 pandemic and crisis on international entrepreneurship. The dataset includes not only tabular data, but also charts that provide data visualization. The dataset contains not only actual, but also forecast data on morbidity and mortality from COVID-19 for the period of the second wave of the pandemic in 2020. The forecasts are presented in the form of a normal distribution of predicted values and the probability of their occurrence in practice. This allows for a broad scenario analysis of the impact of the COVID-19 pandemic and crisis on international entrepreneurship, substituting various predicted morbidity and mortality rates in risk assessment tables and obtaining automatically calculated consequences (changes) on the characteristics of international entrepreneurship. It is also possible to substitute the actual values identified in the process and following the results of the second wave of the pandemic to check the reliability of pre-made forecasts and conduct a plan-fact analysis. The dataset contains not only the numerical values of the initial and predicted values of the set of studied indicators, but also their qualitative interpretation, reflecting the presence and level of risks of a pandemic and COVID-19 crisis for international entrepreneurship.
SummaryThe number of cases interviewed who had a completed answer to the question asking if they visited or worked at any of a list of high risk locations in the 14 days before they became ill (or had a positive test) during their covidLINK interviews.DescriptionMD COVID-19 - Contact Tracing Cases High Risk Locations layer reflects the number of cases interviewed who had a completed answer to the question asking if they visited or worked at any of a list of high risk locations in the 14 days before they became ill (or had a positive test) during their covidLINK interviews. Respondents may indicate that they visited or worked at more than one category of high risk location. For a variety of reasons, some individuals choose not to answer particular questions during the course of their interview.Events and locations where there is prolonged exposure to other people — including weddings, parties, stores, restaurants, etc. — are considered “high risk” for COVID-19 transmission. The more interaction at a gathering or location, the more likely a person may be to transmit or become infected with the virus. More information about considerations for events and gatherings — including how to assess risk levels and promote healthy behaviors that reduce spread — is available from the Centers for Disease Control and Prevention.Answers to interview questions do not provide evidence of cause and effect. Due to the nature of COVID-19 and the wide range of scenarios in which a person can become infected, most of the time it will not be possible to pinpoint exactly where and when a case became infected. Though a person may report attendance at a particular location, that does not mean that transmission happened at that location.The covidLINK interview questionnaire is updated as necessary to capture relevant information related to case exposure and potential onward transmission. These revisions should be taken into consideration when evaluating trends in case responses over time.COVID-19 is a disease caused by a respiratory virus first identified in Wuhan, Hubei Province, China in December 2019. COVID-19 is a new virus that hasn't caused illness in humans before. Worldwide, COVID-19 has resulted in thousands of infections, causing illness and in some cases death. Cases have spread to countries throughout the world, with more cases reported daily. The Maryland Department of Health reports daily on COVID-19 cases by county.