Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Twitter is a gold mine of data. Unlike other social platforms, almost every user’s tweets are completely public and pullable.
This is a huge plus if you’re trying to get a large amount of data to run analytics on. Twitter data is also pretty specific.
Twitter’s API allows you to do complex queries like pulling every tweet about a certain topic within the last twenty minutes, or pull a certain user’s non-retweeted tweets.
This dataset contains the tweets of users regarding covid vaccines. It contains 9127 tweets with 36 feature columns. This dataset can be used for sentiment analysis to understand people's thinking towards covid vaccines. The tweets were fetched using twint.
Thanks to twint for their advanced Twitter scraping tool.
Covid is taking a heavy toll on our life. Different countries have started their vaccine treatments. But sometimes misinformations are oberved in social media regarding vaccines. People have a different attitude towards vaccines. I hope doing analysis on this dataset will act as a solution in stopping the spread of misinformation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Coronavirus infection is currently the most important health topic. It surely tested and continues to test to the fullest extent the healthcare systems around the world. Although big progress is made in handling this pandemic, a tremendous number of questions are needed to be answered. I hereby present to you the local Bulgarian COVID-19 dataset with some context. It could be used as a comparator because it stands out compared to other countries and deserves analysis.
Context for Bulgarian population: Population - 6 948 445 Median age - 44.7 years Aged >65 - 20.801 % Aged >70 - 13.272%
Summary of the results: - first pandemic wave was weak, probably because of the early state of emergency (5 days after the first confirmed case). Whether this was a good decision or it was too early and just postpone the inevitable is debatable. -healthcare system collapses (probably due to delayed measures) in the second and third waves which resulted in Bulgaria gaining the top ranks for mortality and morbidity tables worldwide and in the EU. - low percentage of vaccinated people results in a prolonged epidemic and delaying the lifting of the preventive measures.
Some of the important moments that should be considered when interpreting the data: 08.03.2020 - Bulgaria confirmed its first two cases. The government issued a nationwide ban on closed-door public events (first lockdown); 13.03.2020- after 16 reported cases in one day, Bulgaria declared a state of emergency for one month until 13.04.2020. Schools, shopping centres, cinemas, restaurants, and other places of business were closed. All sports events were suspended. Only supermarkets, food markets, pharmacies, banks, and gas stations remain open. 03.04.2020 - The National Assembly approved the government's proposal to extend the state of emergency by one month until 13.05.2020; 14.05.2020 - the national emergency was lifted, and in its place was declared a state of an emergency epidemic situation. Schools and daycares remain closed, as well as shopping centers and indoor restaurants; 18.05.2020 - Shopping malls and fitness centers opened; 01.06.2020 - Restaurants and gaming halls opened; 10.07.2020 - discos and bars are closed, the sports events are without an audience; 29.10.2020 - High school and college students are transitioning to online learning; 27.11.2020 - the whole education is online, restaurants, nightclubs, bars, and discos are closed (second lockdown 27.11 - 21.12); 05.12.2020 - the 14-day mortality rate is the highest in the world; 16.01.2021 - some of the students went back to school; 01.03.2021 - restaurants and casinos opened; 22.03.2021 - restaurants, shopping malls, fitness centers, and schools are closed (third lockdown for 10 days - 22.03 - 31.03); 19.04.2021 - children daycare facilities, fitness centers, and nightclubs are opened;
This dataset consists of 447 rows with 29 columns and covers the period 08.03.2020 - 28.05.2021. In the beginning, there are some missing values until the proper statistical report was established.
A publication proposal is sent to anyone who wishes to collaborate. Based on the results and the value of the findings and the relevance of the topic it is expected to publish: - in a local journal (guaranteed); - in a SCOPUS journal (highly probable); - in an IF journal (if the results are really insightful).
The topics could be, but not limited to: - descriptive analysis of the pandemic outbreak in the country; - prediction of the pandemic or the vaccination rate; - discussion about the numbers compared to other countries/world; - discussion about the government decisions; - estimating cut-off values for step-down or step-up of the restrictions.
If you find an error, have a question, or wish to make a suggestion, I encourage you to reach me.
Facebook
TwitterPrepared by Lan Thuong Nguyen, a PhD. Candidate from the International Doctoral Program in Asia-Pacific Studies (IDAS) at National Chengchi University (NCCU), at the Center for Asia-Pacific Resilience and Innovation (CAPRi).
Lan Thuong Nguyen is a co-author of this project alongside an American researcher, Dr. Yen Pottinger, who has clearly defined responsibilities. Her role is sourcing and analyzing documents related to public health policies during the COVID-19 pandemic, vaccination promotion programs, communication strategies against COVID-19, and research articles and reports on vaccine acceptance rates among the Vietnamese population. Additionally, she examines public sentiment regarding the government's COVID-19 strategies and other relevant information. As a result, she searched, curated, and compiled the datasets and stored them in the depositar. She is also responsible for overseeing the storage, management, and, if necessary, customization of these data. The management process does not require additional resources or incur storage or data preparation costs. The datasets will be shared via the repository, with access requests managed by Lan Thuong Nguyen. No personal data is included in the datasets.
The project titled "Misinformation, Disinformation, and Vaccine Hesitancy in Vietnam" forms part of a broader series of studies analyzing vaccine hesitancy across various countries in the Asia-Pacific region. This research examines both the historical context and the impact of the COVID-19 pandemic, with a particular focus on the influence of misinformation and disinformation on governmental and civil society efforts to promote vaccination. It belongs to the Center for Asia-Pacific Resilience and Innovation (CAPRi). The project has been completed and posted on the Center for Asia-Pacific Resilience and Innovation (CAPRi) website.
In this case, the project aims to analyze the factors contributing to vaccine hesitancy in Vietnam, with a particular focus on the influence of misinformation and disinformation. It will examine the historical context, the role of digital and social media, and the effectiveness of governmental and public health responses in addressing these challenges during the COVID-19 pandemic. The project contains metadata on the Vietnamese vaccination program and focuses on the country's public health policy, communication strategies, and vaccination experiences.
The dataset below is part of this project. It introduces the COVID-19 prevention policies, provides an overview of the current status, and compiles academic research on vaccine acceptance, the prevalence of misinformation, and how governments are addressing these issues.
Files must be downloaded to use the entire dataset (depositar only provides limited data previews). This dataset comprises one ZIP file, one XLSX spreadsheet, and one PDF file. The ZIP files contain academic research and documents on experiences propagating COVID-19 vaccination in Vietnamese and English. They are collected for reference in this project, and each article/ research paper/ report is attached with links in this ZIP file. The XLSX spreadsheet is a collection of public health policies applicable to the country made by the author to understand how the Vietnamese government prevented, combated, and governed the anti-COVID-19 campaign. It is used for reference purposes. The PDF file is a literature review written by the author with detailed citations and references. It is conducted based on the requirements of the project manager to have an overview of Vietnam's public health policy.
In its present state, the dataset is presented primarily in Vietnamese and English.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.
However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.
2 Data-set Introduction
2.1 Data Collection
We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:
The headline must have one or more words directly or indirectly related to COVID-19.
The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.
The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.
Avoid taking duplicate reports.
Maintain a time frame for the above mentioned newspapers.
To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.
2.2 Data Pre-processing and Statistics
Some pre-processing steps performed on the newspaper report dataset are as follows:
Remove hyperlinks.
Remove non-English alphanumeric characters.
Remove stop words.
Lemmatize text.
While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.
The primary data statistics of the two dataset are shown in Table 1 and 2.
Table 1: Covid-News-USA-NNK data statistics
No of words per headline
7 to 20
No of words per body content
150 to 2100
Table 2: Covid-News-BD-NNK data statistics No of words per headline
10 to 20
No of words per body content
100 to 1500
2.3 Dataset Repository
We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.
3 Literature Review
Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.
Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].
Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.
Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.
4 Our experiments and Result analysis
We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:
In February, both the news paper have talked about China and source of the outbreak.
StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.
Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.
Washington Post discussed global issues more than StarTribune.
StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.
While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.
We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases
where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
Facebook
TwitterOn March 10, 2023, the Johns Hopkins Coronavirus Resource Center ceased collecting and reporting of global COVID-19 data. For updated cases, deaths, and vaccine data please visit the following sources:Global: World Health Organization (WHO)U.S.: U.S. Centers for Disease Control and Prevention (CDC)For more information, visit the Johns Hopkins Coronavirus Resource Center.This feature layer contains the most up-to-date COVID-19 cases for the US and Canada. Data sources: WHO, CDC, ECDC, NHC, DXY, 1point3acres, Worldometers.info, BNO, state and national government health departments, and local media reports. This layer is created and maintained by the Center for Systems Science and Engineering (CSSE) at the Johns Hopkins University. This feature layer is supported by the Esri Living Atlas team and JHU Data Services. This layer is opened to the public and free to share. Contact Johns Hopkins.IMPORTANT NOTICE: 1. Fields for Active Cases and Recovered Cases are set to 0 in all locations. John Hopkins has not found a reliable source for this information at the county level but will continue to look and carry the fields.2. Fields for Incident Rate and People Tested are placeholders for when this becomes available at the county level.3. In some instances, cases have not been assigned a location at the county scale. those are still assigned a state but are listed as unassigned and given a Lat Long of 0,0.Data Field Descriptions by Alias Name:Province/State: (Text) Country Province or State Name (Level 2 Key)Country/Region: (Text) Country or Region Name (Level 1 Key)Last Update: (Datetime) Last data update Date/Time in UTCLatitude: (Float) Geographic Latitude in Decimal Degrees (WGS1984)Longitude: (Float) Geographic Longitude in Decimal Degrees (WGS1984)Confirmed: (Long) Best collected count of Confirmed Cases reported by geographyRecovered: (Long) Not Currently in Use, JHU is looking for a sourceDeaths: (Long) Best collected count for Case Deaths reported by geographyActive: (Long) Confirmed - Recovered - Deaths (computed) Not Currently in Use due to lack of Recovered dataCounty: (Text) US County Name (Level 3 Key)FIPS: (Text) US State/County CodesCombined Key: (Text) Comma separated concatenation of Key Field values (L3, L2, L1)Incident Rate: (Long) People Tested: (Long) Not Currently in Use Placeholder for additional dataPeople Hospitalized: (Long) Not Currently in Use Placeholder for additional data
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Childhood immunization and summary statistics by GHSI scores (2019).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Twitter is a gold mine of data. Unlike other social platforms, almost every user’s tweets are completely public and pullable.
This is a huge plus if you’re trying to get a large amount of data to run analytics on. Twitter data is also pretty specific.
Twitter’s API allows you to do complex queries like pulling every tweet about a certain topic within the last twenty minutes, or pull a certain user’s non-retweeted tweets.
This dataset contains the tweets of users regarding covid vaccines. It contains 9127 tweets with 36 feature columns. This dataset can be used for sentiment analysis to understand people's thinking towards covid vaccines. The tweets were fetched using twint.
Thanks to twint for their advanced Twitter scraping tool.
Covid is taking a heavy toll on our life. Different countries have started their vaccine treatments. But sometimes misinformations are oberved in social media regarding vaccines. People have a different attitude towards vaccines. I hope doing analysis on this dataset will act as a solution in stopping the spread of misinformation.