100+ datasets found

Most searched communication keywords during COVID-19 outbreak in the UK 2020...
statista.com
Updated Dec 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). Most searched communication keywords during COVID-19 outbreak in the UK 2020 [Dataset]. https://www.statista.com/statistics/1126907/communication-search-terms-during-covid-19-in-the-uk/
Explore at:
Dataset updated
Dec 2, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2020 - Apr 2020
Area covered
United Kingdom
Description
According to data from Pi Datametrics, the most searched term relating to insurance on Google from January to April 2020 in the United Kingdom was "broadband speed test". "iphone xr" was the second most searched term, followed by "iphone".
Most searched keywords on google during COVID-19 outbreak in the UK 2020
statista.com
Updated Aug 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2023). Most searched keywords on google during COVID-19 outbreak in the UK 2020 [Dataset]. https://www.statista.com/statistics/1125526/most-searched-terms-during-covid-19-in-the-uk/
Explore at:
Dataset updated
Aug 29, 2023
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2020 - Apr 2020
Area covered
United Kingdom
Description
According to data from Pi Datametrics, the most searched term on Google from January to April 2020 in the United Kingdom (UK) was "airpods".
Z
INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET
data.niaid.nih.gov
zenodo.org
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Sadman (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Kishor Datta Gupta
Nishat Anjum
Nafiz Sadman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh, United States
Description
Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
Dataset covidgilance signals
zenodo.org
bin, csv +3
Updated Sep 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaudinat Arnaud; Gaudinat Arnaud (2020). Dataset covidgilance signals [Dataset]. http://doi.org/10.5281/zenodo.4048460
Explore at:
csv, tsv, bin, text/x-python, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4048460
Dataset updated
Sep 25, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gaudinat Arnaud; Gaudinat Arnaud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Research datasets about top signals for covid 19 (coronavirus) for study into Google Trends (GT) and with SEO metrics

Website

The study is currently published on https://covidgilance.org website (in french)

Datasets description

covid signals -> |selection| -> 4 dataset -> |serp.py| -> 4 serp datasets -> |aggregate_serp.pl| -> 4 aggregated dataset of serp -> |prepare datasets| -> 4 ranked top seo dataset

Original lists of signals (mainly covid symptoms) - dataset

Description: contain the original relevant list of signals for covid19 (here list of queries where you can see, in GT, a relevant signal during the covid 19 period of time)
Name: covid_signal_list.tsv

List of content:

- id: unique id for the topic
- topic-fr: name of the topic in French
- topic-en: name of the topic in English
- topic-id: GT topic id
- keyword fr: one or several keywords in French for GT
- keyword en: one or several keywords in English for GT
- fr-topic-url-12M: link to 12-months French query topic in GT in France
- en-topic-url-12M: link to 12-months English query topic in GT in US
- fr-url-12M: link to 12-months French queries in GT in France
- en-url-12M: link to 12-months English queries topic in GT in US
- fr-topic-url-5M: link to 5-months French query topic in GT in France
- en-topic-url-5M: link to 5-months English query topic in GT in US
- fr-url-5M: link to 5-months French queries in GT in France
- en-url-5M: link to 5-months English queries topic in GT in US

Tool to get SERP of covid signals - tool

Description: query google with a list of covid signals and obtain a list of serps in csv (tsv in fact) file format
Name: serper.py

python serper.py

SERP files - datasets

Description Serp results for 4 datesets of queries Names: simple version of covid signals from google.ch in French: serp_signals_20_ch_fr.csv
simple version of covid signals from google.com in English: serp_signals_20_en.csv
amplified version of covid signals from google.ch in French: serp_signals_covid_20_ch_fr.csv
amplified version of covid signals from google.com in English: serp_signals_covid_20_en.csv

amplified version means that for each query we create two queries one with the keywords "covid" and one with "coronavirus"

Tool to aggregate SERP results - tool

Description: load csv serp data and aggregate the data to create a new csv file where each line is a website and each column is a query. Name: aggregate_serp.pl

`perl aggregate_serp.pl> aggregated_signals_20_en.csv

datasets of top website from the SERP results - dataset

Description a aggregated version of the SERP where each line is a website and each column a query
Names:
aggregated_signals_20_ch_fr.csv
aggregated_signals_20_en.csv
aggregated_signals_covid_20_ch_fr.csv
aggregated_signals_covid_20_en.csv

List of content:

- domain: domain name of the website
- signal 1: Position of the query 1 (signal 1) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
- signal ...: Position of the query (signal) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
- signal n: Position of the query n (signal n) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
- total: average position (total of all position /divided by the number of queries)
- missing: Total number of missing results in the SERP for this website

datasets ranked top seo - dataset

Description a ranked (by weighted average position) version of the aggregated version of the SERP where each line is a website and each column a query. TOP 20 have more information about the type and HONcode validity (from the date of collect: September 2020)

Names:
ranked_signals_20_ch_fr.csv
ranked_signals_20_en.csv
ranked_signals_covid_20_ch_fr.csv
ranked_signals_covid_20_en.csv

List of content:

- domain: domain name of the website
- signal 1: Position of the query 1 (signal 1) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
- signal ...: Position of the query (signal) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
- signal n: Position of the query n (signal n) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
- avg position: average position (total of all position /divided by the number of queries)
- nb missing: Total number of missing results in the SERP for this website
- % presence: % of presence
- weighted avg postion: combination of avg position and % of presence for final ranking
- honcode: status of the Honcode certificate for this website (none/valid/expired)
- type: type of the website (health, gov, edu or media)
B
COVID-19 Twitter Dataset
borealisdata.ca
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anatoliy Gruzd; Philip Mai (2020). COVID-19 Twitter Dataset [Dataset]. http://doi.org/10.5683/SP2/PXF2CU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/PXF2CU
Dataset updated
Nov 10, 2020
Dataset provided by
Borealis
Authors
Anatoliy Gruzd; Philip Mai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The current dataset contains 237M Tweet IDs for Twitter posts that mentioned "COVID" as a keyword or as part of a hashtag (e.g., COVID-19, COVID19) between March and July of 2020. Sampling Method: hourly requests sent to Twitter Search API using Social Feed Manager, an open source software that harvests social media data and related content from Twitter and other platforms. NOTE: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). 3) This dataset, like most datasets collected via the Twitter Search API, is a sample of the available tweets on this topic and is not meant to be comprehensive. Some COVID-related tweets might not be included in the dataset either because the tweets were collected using a standardized but intermittent (hourly) sampling protocol or because tweets used hashtags/keywords other than COVID (e.g., Coronavirus or #nCoV). 4) To broaden this sample, consider comparing/merging this dataset with other COVID-19 related public datasets such as: https://github.com/thepanacealab/covid19_twitter https://ieee-dataport.org/open-access/corona-virus-covid-19-tweets-dataset https://github.com/echen102/COVID-19-TweetIDs
Most searched E-learning keywords during COVID-19 outbreak in the UK 2020
statista.com
Updated Dec 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). Most searched E-learning keywords during COVID-19 outbreak in the UK 2020 [Dataset]. https://www.statista.com/statistics/1126652/e-learning-search-terms-during-covid-19-in-the-uk/
Explore at:
Dataset updated
Dec 2, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2020 - Apr 2020
Area covered
United Kingdom
Description
According to data from Pi Datametrics, the most searched term relating to E-learning on Google from January to April 2020 in the United Kingdom was "online courses". "Elearning"" was the second most searched term, followed by "e-learning for health".
H
Data from: A Large-Scale Dataset of Twitter Chatter About Online Learning...
dataverse.harvard.edu
data.niaid.nih.gov
+1more
Updated Aug 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). A Large-Scale Dataset of Twitter Chatter About Online Learning During The Current COVID-19 Omicron Wave [Dataset]. http://doi.org/10.7910/DVN/GBHOD9
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GBHOD9
Dataset updated
Aug 9, 2022
Dataset provided by
Harvard Dataverse
Authors
Nirmalya Thakur
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset: N. Thakur, “A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave,” Journal of Data, vol. 7, no. 8, p. 109, Aug. 2022, doi: 10.3390/data7080109 Abstract The COVID-19 Omicron variant, reported to be the most immune evasive variant of COVID-19, is resulting in a surge of COVID-19 cases globally. This has caused schools, colleges, and universities in different parts of the world to transition to online learning. As a result, social media platforms such as Twitter are seeing an increase in conversations, centered around information seeking and sharing, related to online learning. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions, perspectives, attitudes, and feedback towards online learning during the current surge of COVID-19 cases caused by the Omicron variant. Therefore this work presents a large-scale public Twitter dataset of conversations about online learning since the first detected case of the COVID-19 Omicron variant in November 2021. The dataset files contain the raw version that comprises 52,868 Tweet IDs (that correspond to the same number of Tweets) and the cleaned and preprocessed version that contains 46,208 unique Tweet IDs. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management. Data Description The dataset comprises 7 .txt files. The raw version of this dataset comprises 6 .txt files (TweetIDs_Corona Virus.txt, TweetIDs_Corona.txt, TweetIDs_Coronavirus.txt, TweetIDs_Covid.txt, TweetIDs_Omicron.txt, and TweetIDs_SARS CoV2.txt) that contain Tweet IDs grouped together based on certain synonyms or terms that were used to refer to online learning and the Omicron variant of COVID-19 in the respective tweets. The cleaned and preprocessed version of this dataset is provided in the .txt file - TweetIDs_Duplicates_Removed.txt. The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download the application: https://github.com/DocNow/hydrator/releases and link to a step-by-step tutorial: https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweetsr) may be used. The list of all the synonyms or terms that were used for the dataset development is as follows: COVID-19: Omicron, COVID, COVID19, coronavirus, coronaviruspandemic, COVID-19, corona, coronaoutbreak, omicron variant, SARS CoV-2, corona virus online learning: online education, online learning, remote education, remote learning, e-learning, elearning, distance learning, distance education, virtual learning, virtual education, online teaching, remote teaching, virtual teaching, online class, online classes, remote class, remote classes, distance class, distance classes, virtual class, virtual classes, online course, online courses, remote course, remote courses, distance course, distance courses, virtual course, virtual courses, online school, virtual school, remote school, online college, online university, virtual college, virtual university, remote college, remote university, online lecture, virtual lecture, remote lecture, online lectures, virtual lectures, remote lectures A description of the dataset files is provided below: TweetIDs_Corona Virus.txt – Contains 321 Tweet IDs correspond to tweets that comprise the keywords – "corona virus" and one or more keywords/terms that refer to online learning TweetIDs_Corona.txt – Contains 1819 Tweet IDs correspond to tweets that comprise the keyword – "corona" or "coronaoutbreak" and one or more keywords/terms that refer to online learning TweetIDs_Coronavirus.txt – Contains 1429 Tweet IDs correspond to tweets that comprise the keywords – "coronavirus" or "coronaviruspandemic" and one or more keywords/terms that refer to online learning TweetIDs_Covid.txt – Contains 41088 Tweet IDs correspond to tweets that comprise the keywords – "COVID" or "COVID19" or "COVID-19" and one or more keywords/terms that refer to online learning TweetIDs_Omicron.txt – Contains 8198 Tweet IDs correspond to tweets that comprise the keywords – "omicron" or "omicron variant" and one or more keywords/terms that refer to online learning TweetIDs_SARS CoV2.txt – Contains 13 Tweet IDs correspond to tweets that comprise the keyword – "SARS-CoV-2" and one or more keywords/terms that refer to online learning TweetIDs_Duplicates_Removed.txt - A collection of 46208 unique Tweet IDs from all the 6 .txt files mentioned above after...
E
Digital Narratives of Covid-19: a Twitter Dataset
live.european-language-grid.eu
ri.conicet.gov.ar
+3more
txt
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Digital Narratives of Covid-19: a Twitter Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7603
Explore at:
txtAvailable download formats
Dataset updated
Mar 28, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are releasing a Twitter dataset connected to our project Digital Narratives of Covid-19 (DHCOVID) that -among other goals- aims to explore during one year (May 2020-2021) the narratives behind data about the coronavirus pandemic.In this first version, we deliver a Twitter dataset organized as follows:
Each folder corresponds to daily data (one folder for each day): YEAR-MONTH-DAYIn every folder there are 9 different plain text files named with ""dhcovid"", followed by date (YEAR-MONTH-DAY), language (""en"" for English, and ""es"" for Spanish), and region abbreviation (""fl"", ""ar"", ""mx"", ""co"", ""pe"", ""ec"", ""es""):dhcovid_YEAR-MONTH-DAY_es_fl.txt: Dataset containing tweets geolocalized in South Florida. The geo-localization is tracked by tweet coordinates, by place, or by user information.dhcovid_YEAR-MONTH-DAY_en_fl.txt: We are gathering only tweets in English that refer to the area of Miami and South Florida. The reason behind this choice is that there are multiple projects harvesting English data, and, our project is particularly interested in this area because of our home institution (University of Miami) and because we aim to study public conversations from a bilingual (EN/ES) point of view.dhcovid_YEAR-MONTH-DAY_es_ar.txt: Dataset containing tweets from Argentina.dhcovid_YEAR-MONTH-DAY_es_mx.txt: Dataset containing tweets from Mexico.dhcovid_YEAR-MONTH-DAY_es_co.txt: Dataset containing tweets from Colombia.dhcovid_YEAR-MONTH-DAY_es_pe.txt: Dataset containing tweets from Perú.dhcovid_YEAR-MONTH-DAY_es_ec.txt: Dataset containing tweets from Ecuador.dhcovid_YEAR-MONTH-DAY_es_es.txt: Dataset containing tweets from Spain.dhcovid_YEAR-MONTH-DAY_es.txt: This dataset contains all tweets in Spanish, regardless of its geolocation.

For English, we collect all tweets with the following keywords and hashtags: covid, coronavirus, pandemic, quarantine, stayathome, outbreak, lockdown, socialdistancing. For Spanish, we search for: covid, coronavirus, pandemia, quarentena, confinamiento, quedateencasa, desescalada, distanciamiento social.The corpus of tweets consists of a list of Tweet Ids; to obtain the original tweets, you can use ""Twitter hydratator"" which takes the id and download for you all metadata in a csv file.We started collecting this Twitter dataset on April 24th, 2020 and we are adding daily data to our GitHub repository. There is a detected problem with file 2020-04-24/dhcovid_2020-04-24_es.txt, which we couldn't gather the data due to technical reasons.For more information about our project visit https://covid.dh.miami.edu/ For more updated datasets and detailed criteria, check our GitHub Repository: https://github.com/dh-miami/narratives_covid19/
g
Coronavirus (Covid-19) Data in the United States
github.com
openicpsr.org
+3more
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
Explore at:
csvAvailable download formats
Dataset provided by
New York Times
License
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
Description
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
Food and drink keyword growth during COVID-19 in the UK 2020
statista.com
Updated Dec 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). Food and drink keyword growth during COVID-19 in the UK 2020 [Dataset]. https://www.statista.com/statistics/1126623/food-and-drink-search-term-growth-during-covid-19-uk/
Explore at:
Dataset updated
Dec 2, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United Kingdom
Description
According to data from Pi Datametrics, the food and drink related search term that saw the biggest year-on-year growth on Google in the UK from January to April 2020 compared to the same period in 2019 was "grocery delivery near me". "wine delivery" saw the second highest increase, followed by "dried yeast".
The top 30 keywords with respect to average TF-IDF score for cessation...
plos.figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The top 30 keywords with respect to average TF-IDF score for cessation trials, completion trials, and the keywords with the largest TF-IDF score difference between cessation vs. completion trials, respectively. [Dataset]. https://plos.figshare.com/articles/dataset/The_top_30_keywords_with_respect_to_average_TF-IDF_score_for_cessation_trials_completion_trials_and_the_keywords_with_the_largest_TF-IDF_score_difference_between_cessation_i_vs_i_completion_trials_respectively_/14961274
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0253789.t006
Dataset updated
Jun 9, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Magdalyn E. Elkin; Xingquan Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The top 30 keywords with respect to average TF-IDF score for cessation trials, completion trials, and the keywords with the largest TF-IDF score difference between cessation vs. completion trials, respectively.
Coronavirus Twitter Data: A collection of COVID-19 tweets with automated...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Apr 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaolei Huang; Xiaolei Huang; Amelia Jamison; Amelia Jamison; David Broniatowski; David Broniatowski; Sandra Quinn; Sandra Quinn; Mark Dredze; Mark Dredze (2023). Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations [Dataset]. http://doi.org/10.5281/zenodo.3782831
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3782831
Dataset updated
Apr 3, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiaolei Huang; Xiaolei Huang; Amelia Jamison; Amelia Jamison; David Broniatowski; David Broniatowski; Sandra Quinn; Sandra Quinn; Mark Dredze; Mark Dredze
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains tweets related to COVID-19. The dataset contains Twitter ids, from which you can download the original data directly from Twitter. Additionally, we include the date, keywords related to COVID-19 and the inferred geolocation. Check detailed information at http://twitterdata.covid19dataresources.org/index.
TRACES Bulgarian Twitter Dataset on Covid-19 Annotated with Linguistic...
zenodo.org
data.niaid.nih.gov
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irina Temnikova; Irina Temnikova; Silvia Gargova; Veneta Kireva; Tsvetelina Stefanova; Silvia Gargova; Veneta Kireva; Tsvetelina Stefanova (2024). TRACES Bulgarian Twitter Dataset on Covid-19 Annotated with Linguistic Markers of Lies [Dataset]. http://doi.org/10.5281/zenodo.7614247
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7614247
Dataset updated
Dec 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Irina Temnikova; Irina Temnikova; Silvia Gargova; Veneta Kireva; Tsvetelina Stefanova; Silvia Gargova; Veneta Kireva; Tsvetelina Stefanova
Description
This dataset has been created within Project TRACES (more information: https://traces.gate-ai.eu/). The dataset contains 61411 tweet IDs of tweets, written in Bulgarian, with annotations. The dataset can be used for general use or for building lies and disinformation detection applications.

Note: this dataset is not fact-checked, the social media messages have been retrieved via keywords. For fact-checked datasets, see our other datasets.

The tweets (written between 1 Jan 2020 and 28 June 2022) have been collected via Twitter API under academic access in June 2022 with the following keywords:

(Covid OR коронавирус OR Covid19 OR Covid-19 OR Covid_19) - without replies and without retweets

(Корона OR корона OR Corona OR пандемия OR пандемията OR Spikevax OR SARS-CoV-2 OR бустерна доза) - with replies, but without retweets

Explanations of which fields can be used as markers of lies (or of intentional disinformation) are provided in our forthcoming paper (please cite it when using this dataset):

Irina Temnikova, Silvia Gargova, Ruslana Margova, Veneta Kireva, Ivo Dzhumerov, Tsvetelina Stefanova and Hristiana Nikolaeva (2023) New Bulgarian Resources for Detecting Disinformation. 10th Language and Technology
Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC'23). Poznań. Poland.
c
TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19...
datacatalogue.cessda.eu
search.gesis.org
+4more
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimitrov, Dimitar; Baran, Erdal; Fafalios, Pavlos; Yu, Ran; Zhu, Xiaofei; Zloch, Matthäus; Dietze, Stefan (2024). TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19 Pandemic (Part 4, January 2021 - August 2022) [Dataset]. http://doi.org/10.7802/2470
Explore at:
Unique identifier
https://doi.org/10.7802/2470
Dataset updated
Oct 19, 2024
Dataset provided by
GESIS - Leibniz-Institut für Sozialwissenschaften & Heinrich-Heine-University Düsseldorf, Germany & L3S Research Center, Hannover, Germany
Institute of Computer Science, FORTH-ICS, Heraklion, Greece
GESIS - Leibniz-Institut für Sozialwissenschaften
Chongqing University of Technology, Chongqing, China
Authors
Dimitrov, Dimitar; Baran, Erdal; Fafalios, Pavlos; Yu, Ran; Zhu, Xiaofei; Zloch, Matthäus; Dietze, Stefan
Measurement technique
Web Scraping
Description
TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. Metadata information about the tweets as well as extracted entities, sentiments, hashtags, user mentions, and resolved URLs are exposed in RDF using established RDF/S vocabularies (for the sake of privacy, we anonymize user IDs and we do not provide the text of the tweets). More information are available through TweetsCOV19's home page: https://data.gesis.org/tweetscov19/.

We also provide a tab-separated values (tsv) version of the dataset. Each line contains features of a tweet instance. Features are separated by tab character ("\t"). The following list indicate the feature indices:

Tweet Id: Long.

Username: String. Encrypted for privacy issues.

Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ).

Followers: Integer.

Friends: Integer.

Retweets: Integer.

Favorites: Integer.

Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;".

Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1").

Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;".

Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;".

URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"

To extract the dataset from TweetsKB, we compiled a seed list of 268 COVID-19-related keywords.

You can find the previous part 3 at https://doi.org/10.5281/zenodo.4593523 .
d
Replication Data for: Two years of Covid-19 pandemic : A higher prevalence...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Errasfa, Mourad (2023). Replication Data for: Two years of Covid-19 pandemic : A higher prevalence of the disease was associated with higher geographic latitudes, lower temperatures, and unfavorable epidemiologic and demographic conditions. [Dataset]. http://doi.org/10.7910/DVN/JYYZEI
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/JYYZEI
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Errasfa, Mourad
Description
ABSTRACT Background : The Covid-19 pandemic associated with the SARS-CoV-2 has caused very high death tolls in many countries, while it has had less prevalence in other countries of Africa and Asia. Climate and geographic conditions, as well as other epidemiologic and demographic conditions, were a matter of debate on whether or not they could have an effect on the prevalence of Covid-19. Objective : In the present work, we sought a possible relevance of the geographic location of a given country on its Covid-19 prevalence. On the other hand, we sought a possible relation between the history of epidemiologic and demographic conditions of the populations and the prevalence of Covid-19 across four continents (America, Europe, Africa, and Asia). We also searched for a possible impact of pre-pandemic alcohol consumption in each country on the two year death tolls across the four continents. Methods : We have sought the death toll caused by Covid-19 in 39 countries and obtained the registered deaths from specialized web pages. For every country in the study, we have analysed the correlation of the Covid-19 death numbers with its geographic latitude, and its associated climate conditions, such as the mean annual temperature, the average annual sunshine hours, and the average annual UV index. We also analyzed the correlation of the Covid-19 death numbers with epidemiologic conditions such as cancer score and Alzheimer score, and with demographic parameters such as birth rate, mortality rate, fertility rate, and the percentage of people aged 65 and above. In regard to consumption habits, we searched for a possible relation between alcohol intake levels per capita and the Covid-19 death numbers in each country. Correlation factors and determination factors, as well as analyses by simple linear regression and polynomial regression, were calculated or obtained by Microsoft Exell software (2016). Results : In the present study, higher numbers of deaths related to Covid-19 pandemic were registered in many countries in Europe and America compared to other countries in Africa and Asia. The analysis by polynomial regression generated an inverted bell-shaped curve and a significant correlation between the Covid-19 death numbers and the geographic latitude of each country in our study. Higher death numbers were registered in the higher geographic latitudes of both hemispheres, while lower scores of deaths were registered in countries located around the equator line. In a bell shaped curve, the latitude levels were negatively correlated to the average annual levels (last 10 years) of temperatures, sunshine hours, and UV index of each country, with the highest scores of each climate parameter being registered around the equator line, while lower levels of temperature, sunshine hours, and UV index were registered in higher latitude countries. In addition, the linear regression analysis showed that the Covid-19 death numbers registered in the 39 countries of our study were negatively correlated with the three climate factors of our study, with the temperature as the main negatively correlated factor with Covid-19 deaths. On the other hand, cancer and Alzheimer's disease scores, as well as advanced age and alcohol intake, were positively correlated to Covid-19 deaths, and inverted bell-shaped curves were obtained when expressing the above parameters against a country’s latitude. Instead, the (birth rate/mortality rate) ratio and fertility rate were negatively correlated to Covid-19 deaths, and their values gave bell-shaped curves when expressed against a country’s latitude. Conclusion : The results of the present study prove that the climate parameters and history of epidemiologic and demographic conditions as well as nutrition habits are very correlated with Covid-19 prevalence. The results of the present study prove that low levels of temperature, sunshine hours, and UV index, as well as negative epidemiologic and demographic conditions and high scores of alcohol intake may worsen Covid-19 prevalence in many countries of the northern hemisphere, and this phenomenon could explain their high Covid-19 death tolls. Keywords : Covid-19, Coronavirus, SARS-CoV-2, climate, temperature, sunshine hours, UV index, cancer, Alzheimer disease, alcohol.
A summary and comparison between different methods (including proposed...
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Magdalyn E. Elkin; Xingquan Zhu (2023). A summary and comparison between different methods (including proposed research) for clinical trial study. [Dataset]. http://doi.org/10.1371/journal.pone.0253789.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0253789.t001
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Magdalyn E. Elkin; Xingquan Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A summary and comparison between different methods (including proposed research) for clinical trial study.
r
COVID-19 Health Related Data Classification
researchdata.edu.au
Updated 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashad Kabir; Anik Das; Md Rakibul Hassan Chowdory; Mahathir Mohammad Bishal; Data Science and Engineering Research Unit (2021). COVID-19 Health Related Data Classification [Dataset]. https://researchdata.edu.au/covid-19-health-data-classification/3475650
Explore at:
Dataset updated
2021
Dataset provided by
Charles Sturt University
Cell Press
Authors
Ashad Kabir; Anik Das; Md Rakibul Hassan Chowdory; Mahathir Mohammad Bishal; Data Science and Engineering Research Unit
Description
We have used a publicly available dataset, COVID-19 Tweets Dataset, consisting of an extensive collection of 1,091,515,074 tweet IDs, and continuously expanding. The dataset was compiled by tracking over 90 distinct keywords and hashtags commonly associated with discussions about the COVID-19 pandemic. From this massive dataset, we focused on a specific time frame, encompassing data from August 05, 2020, to August 26, 2020, to meet our research objectives. As this dataset contains only tweet IDs, we have used the Twitter developer API to retrieve the corresponding tweets from Twitter. This retrieval process involved searching for tweet IDs and extracting the associated tweet texts, and it was implemented using the Twython library. In total, we successfully collected 21,890 tweets during this data extraction phase.

Following guidelines set by the CDC and WHO, we categorized tweets into five distinct classes for classification: health risks, prevention, symptoms, transmission, and treatment. Specifically, individuals aged over sixty, or those with pre-existing health conditions such as heart disease, lung problems, weakened immune systems, or diabetes, are at higher risk of severe COVID-19 complications. Therefore, tweets categorized as ‘health risks’ pertain to the elevated risks associated with COVID-19 due to age or specific health conditions. ‘Prevention’ related tweets encompass discussions on preventive and precautionary measures regarding the COVID-19 pandemic. Tweets discussing common COVID-19 symptoms, including cough, congestion, breathing issues, fever, body aches, and more, are classified as ‘symptoms’ related tweets. Conversations pertaining to the spread of COVID-19 between individuals, between animals and humans, and contact with virus-contaminated objects or surfaces are categorized as ‘transmission’ related tweets. Lastly, tweets indicating vaccine development and drugs used for COVID-19 treatment fall under the ‘treatment’ related category.

We determined specific keywords for each of the five classes (health risks, prevention, symptoms, transmission, and treatment) based on the definitions provided by the CDC and WHO on their official websites. These definitions, along with their associated keywords, are detailed in Table 1. For instance, the CDC and WHO indicate that individuals over the age of sixty with conditions like heart disease, lung problems, weak immune systems, or diabetes face a higher risk of severe COVID-19 complications. In accordance with this definition, we selected relevant keywords such as “lung disease”, “heart disease”, “diabetes”, “weak immunity”, and others to identify tweets related to health risks within the larger tweet dataset. This approach was consistently applied to define keywords for the remaining four classes. Subsequently, we filtered the initial dataset of 21,890 tweets to extract tweets relevant to our predefined classes, resulting in a total of 6,667 tweets based on the selected keywords.

To ensure the accuracy of our dataset, two separate annotators individually assigned the 6,667 tweets to the five classes. A third annotator, a natural language expert, meticulously cross-checked the dataset and provided necessary corrections. Subsequently, the two annotators resolved any discrepancies through mutual agreement, resulting in the final annotated dataset. Our dataset comprises a total of 6,667 data points categorized into five classes: 978, 2046, 1402, 802, and 1439 tweets annotated as ‘health risk’, ‘prevention’, ‘symptoms’, ‘transmission’, and ‘treatment’, respectively
i
COVID-19 tweets dataset for Spanish language
ieee-dataport.org
Updated Jun 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Avishek Garain (2020). COVID-19 tweets dataset for Spanish language [Dataset]. http://doi.org/10.21227/gcys-3z77
Explore at:
Unique identifier
https://doi.org/10.21227/gcys-3z77
Dataset updated
Jun 30, 2020
Dataset provided by
IEEE Dataport
Authors
Avishek Garain
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is very vast and contains Spanish tweets related to COVID-19. There are 18958 unique tweet-ids in the whole dataset that ranges from December 2019 till May 2020 . The keywords that have been used to crawl the tweets are 'corona', , 'covid ' , 'sarscov2 ', 'covid19', 'coronavirus '. For getting the other 33 fields of data drop a mail at "avishekgarain@gmail.com". Code snippet is given in Documentation file. Sharing Twitter data other than Tweet ids publicly violates Twitter regulation policies.
H
Twitter Conversations About The COVID-19 Omicron Variant: A Large Scale...
dataverse.harvard.edu
zenodo.org
Updated Jul 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). Twitter Conversations About The COVID-19 Omicron Variant: A Large Scale Dataset Of More Than 500,000 Tweets [Dataset]. http://doi.org/10.7910/DVN/SELYUR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/SELYUR
Dataset updated
Jul 25, 2022
Dataset provided by
Harvard Dataverse
Authors
Nirmalya Thakur
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset: N. Thakur and C.Y. Han, “An Exploratory Study of Tweets about the SARS-CoV-2 Omicron Variant: Insights from Sentiment Analysis, Language Interpretation, Source Tracking, Type Classification, and Embedded URL Detection,” Journal of COVID, 2022, Volume 5, Issue 3, pp. 1026-1049 Abstract This dataset is one of the salient contributions of the above-mentioned paper. It presents a total of 522,886 Tweet IDs of the same number of Tweets about the SARS-CoV-2 Omicron Variant posted on Twitter since the first detected case of this variant on November 24, 2021. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management. Data Description The Tweet IDs are presented in 7 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. The data collection followed a keyword-based approach and tweets comprising the "omicron" keyword were filtered, collected, and added to this dataset. Filename: TweetIDs_November.txt (No. of Tweet IDs: 16471, Date Range of the Tweet IDs: November 24, 2021 to November 30, 2021) Filename: TweetIDs_December.txt (No. of Tweet IDs: 99288, Date Range of the Tweet IDs: December 1, 2021 to December 31, 2021) Filename: TweetIDs_January.txt (No. of Tweet IDs: 92860, Date Range of the Tweet IDs: January 1, 2022 to January 31, 2022) Filename: TweetIDs_February.txt (No. of Tweet IDs: 89080, Date Range of the Tweet IDs: February 1, 2022 to February 28, 2022) Filename: TweetIDs_March.txt (No. of Tweet IDs: 97844, Date Range of the Tweet IDs: March 1, 2022 to March 31, 2022) Filename: TweetIDs_April.txt (No. of Tweet IDs: 91587, Date Range of the Tweet IDs: April 1, 2022 to April 20, 2022) Filename: TweetIDs_May.txt (No. of Tweet IDs: 35756, Date Range of the Tweet IDs: May 1, 2022 to May 12, 2022) Here, the last date for May is May 12 as it was the most recent date at the time of data collection. The dataset would be updated soon to incorporate more recent tweets. The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. The Hydrator application (link to download the application: https://github.com/DocNow/hydrator/releases and link to a step-by-step tutorial: https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweets) or any similar application may be used for hydrating this dataset.
i
Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...
ieee-dataport.org
Updated Jun 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Qazi (2020). GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information [Dataset]. http://doi.org/10.21227/et8d-w881
Explore at:
Unique identifier
https://doi.org/10.21227/et8d-w881
Dataset updated
Jun 24, 2020
Dataset provided by
IEEE Dataport
Authors
Umair Qazi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract:We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.The dataset was collected using more than 800 multilingual keywords and hashtags. The complete list of keywords can be downloaded from here: https://crisisnlp.qcri.org/covid19 For more details, please refer to this paper: https://arxiv.org/abs/2005.11177Explore interesting trends in GeoCoV19 dataset using our new service: https://covid19-trends.qcri.org/

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2022). Most searched communication keywords during COVID-19 outbreak in the UK 2020 [Dataset]. https://www.statista.com/statistics/1126907/communication-search-terms-during-covid-19-in-the-uk/

Most searched communication keywords during COVID-19 outbreak in the UK 2020

Explore at:

Dataset updated

Dec 2, 2022

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

Jan 2020 - Apr 2020

Area covered

United Kingdom

Description

According to data from Pi Datametrics, the most searched term relating to insurance on Google from January to April 2020 in the United Kingdom was "broadband speed test". "iphone xr" was the second most searched term, followed by "iphone".

Clear search

Close search

Google apps

Main menu

Most searched communication keywords during COVID-19 outbreak in the UK 2020...

Most searched keywords on google during COVID-19 outbreak in the UK 2020

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

Dataset covidgilance signals

COVID-19 Twitter Dataset

Most searched E-learning keywords during COVID-19 outbreak in the UK 2020

Data from: A Large-Scale Dataset of Twitter Chatter About Online Learning...

Digital Narratives of Covid-19: a Twitter Dataset

Coronavirus (Covid-19) Data in the United States

Food and drink keyword growth during COVID-19 in the UK 2020

The top 30 keywords with respect to average TF-IDF score for cessation...

Coronavirus Twitter Data: A collection of COVID-19 tweets with automated...

TRACES Bulgarian Twitter Dataset on Covid-19 Annotated with Linguistic...

TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19...

Followers: Integer.

Friends: Integer.

Retweets: Integer.

Favorites: Integer.

Replication Data for: Two years of Covid-19 pandemic : A higher prevalence...

A summary and comparison between different methods (including proposed...

COVID-19 Health Related Data Classification

COVID-19 tweets dataset for Spanish language

Twitter Conversations About The COVID-19 Omicron Variant: A Large Scale...

Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...

Most searched communication keywords during COVID-19 outbreak in the UK 2020