Version 48 of the dataset. Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (948,493,362 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (238,771,950 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/ More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688) As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used. This dataset will be updated bi-weekly at least with additional tweets, look at the github repo for these updates. Release: We have standardized the name of the resource to match our pre-print manuscript and to not have to update it every week.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (152,920,832 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (30,990,645 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. The need to be hydrated to be used.
This project aims to present a large dataset for researchers to discover public conversation on Twitter surrounding the COVID-19 pandemic. As strong concerns and emotions are expressed in the publicly available tweets, we annotated seventeen latent semantic attributes for each public tweet using natural language processing techniques and machine-learning based algorithms. The latent semantic attributes include: 1) ten attributes indicating the tweet’s relevance to ten detected topics, 2) five quantitative attributes indicating the degree of intensity in the valence (i.e., unpleasantness/pleasantness) and emotional intensities across four primary emotions of fear, anger, sadness and joy, and 3) two qualitative attributes indicating the sentiment category and the most dominant emotion category, respectively.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COVID-19 pandemic initiated over a year ago continues to spread around the globe and the ongoing research regarding COVID-19 is on a continues growth as well. The online discourse on social media regarding COVID-19 has been growing along with the timeline of the pandemic.
Open data on Twitter have been released and offer the research community the opportunity for new findings and resolving this new threat. In this dataset, we open a corpus of Twitter's data from March 2020 till today, that is being updated every day based on the two most important hashtags regarding COVID-19. This dataset will offer the research community the opportunity to explore the social extensions of this pandemic including topic analysis, hate speech sentiment analysis, regarding either the opinion of the users on the pandemic, the comments on the public discourse, or the vaccination releases. The dataset has been collected by retrieving all the tweets that contain the hashtags: #coronavirus and #COVID19 including approximately 208M tweets for hashtags #coronavirus and 392M tweets for hashtag #COVID-19, resulting in a total of 600M tweets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
hashtag] relations from 190 countries and territories
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected and processed a dataset and make it available for the research community to study the COVD-19 pandemic in multiple possibilities.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
I am sharing covid 19 Twitter dataset to the research community containing large Tweets. I hope this data set will enable the study of online conversation dynamics in the context of a global outbreak of unprecedented proportions and implications. I have collected this dataset using Trackmyhashtag, an affordable platform.I hope researchers find it helpful. If you need more datasets, let me know.
In response to the Coronavirus disease (COVID-19) outbreak and the Transportation Research Board’s (TRB) urgent need for work related to transportation and pandemics, this paper contributes with a sense of urgency and provides a starting point for research on the topic. The main goal of this paper is to support transportation researchers and the TRB community during this COVID-19 pandemic by reviewing the performance of software models used for extracting large-scale data from Twitter streams related to COVID-19. The study extends the previous research efforts in social media data mining by providing a review of contemporary tools, including their computing maturity and their potential usefulness. The paper also includes an open repository for the processed data frames to facilitate the quick development of new transportation research studies. The output of this work is recommended to be used by the TRB community when deciding to further investigate topics related to COVID-19 and social media data mining tools.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset presents a large-scale collection of millions of Twitter posts related to the coronavirus pandemic in Spanish language. The collection was built by monitoring public posts written in Spanish containing a diverse set of hashtags related to the COVID-19, as well as tweets shared by the official Argentinian government offices, such as ministries and secretaries at different levels. Data was collected between March and October 2020 using the Twitter API, and will be periodically updated.
In addition to tweets IDs, the dataset includes information about mentions, retweets, media, URLs, hashtags, replies, users and content-based user relations, allowing the observation of the dynamics of the shared information. Data is presented in different tables that can be analysed separately or combined.
The dataset aims at serving as source for studying several coronavirus effects in people through social media, including the impact of public policies, the perception of risk and related disease consequences, the adoption of guidelines, the emergence, dynamics and propagation of disinformation and rumours, the formation of communities and other social phenomena, the evolution of health related indicators (such as fear, stress, sleep disorders, or children behaviour changes), among other possibilities. In this sense, the dataset can be useful for multi-disciplinary researchers related to the different fields of data science, social network analysis, social computing, medical informatics, social sciences, among others.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets of the figures in the paper "Tracking the Twitter attention around the research effort on the COVID-19 pandemic".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
From 24 January to 31 July in 2021, we collected data that anyone can view on Twitter by using the free Twitter API. By using the keywords “vaccine”, “vaccination”, “vaccinated”, “vaxxer”, “vaxxers”, “#CovidVaccine”, “covid denier”, “pfizer”, “moderna”, “astra” and “zeneca”, “sinopharm”, “sputnik”, we collected 33K tweets published by popular Twitter accounts. For each tweet, the following variables were recorded: their author (user ID), the author's categorization (healthcare professional, news media source, other accounts with thousands of followers), the date of publication (to the precision of seconds), the vaccine mentioned, the language, and the general sentiment of the tweet text on a scale from 1 to 5. For multilingual sentiment analysis, we used an open-source BERT model from Huggingface (https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment). When multiple vaccines are mentioned in a tweet, in our data, it is recorded as multiple tweets, one for each vaccine.To uphold the privacy policy for publishing Twitter data, the tweet texts, as well as the original user identifiers for the authors of the tweets, are not disclosed. Instead, we encoded the user information with random integers. To access the complete content of these tweets, researchers may utilize the Twitter search API by referencing the provided tweet identifiers.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 22nd which yielded over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (40,823,816 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (7,479,940 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
no. 8
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
2020
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of tweets by and about COVID-aware publics from the 'X' (Twitter) social media platform collected by the author. The dataset consists of 344 textual tweets regarding COVID-related material practices gathered during the research period Jan 2023 - Sep 2024, yet the dataset also includes tweets created before this date.The textual data has been rewritten to fully anonymise the people who made the tweets, and identifiable contexts have been removed. In addition, all date/time metadata and hashtags, as well as any attached images, have been removed. Square brackets have been used for editorial edits to obfuscate entities or add context to tweets. The dataset consists of a structured comma-separated text file that can be read in any spreadsheet software to maximise accessibility.The research dataset was created with Open university HREC approval: HREC/4557/Nold
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
2022
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the tweet ids of 354,903,485 tweets related to Coronavirus or COVID-19. They were collected between March 3, 2020 and December 3, 2020 from the Twitter API using Social Feed Manager. Please note that this is VERSION 9 of this data set. See the Versions tab below for all versions. Version 1 contains tweets from March 3, 2020 through March 19, 2020. Version 2 contains tweets from March 3, 2020 through March 31, 2020. Version 4 contains tweets from March 3, 2020 through April 16, 2020. Version 5 contains tweets from March 3 through May 1, 2020. Version 6 contains tweets from March 3 through May 27, 2020. Version 7 contains tweets from March 3 through June 9, 2020. Version 8 contains tweets from March 3, 2020 through July 27, 2022 These tweets were collected using the POST statuses/filter method of the Twitter Stream API, using the track parameter with the following keywords: #Coronavirus, #Coronaoutbreak, #COVID19 Because of the size of the collection, the list of identifiers is split into 36 files of up to 10 million lines each, with a tweet identifier on each line. There is a covid19filter-README.txt file containing additional documentation on how the tweets were collected. Data from the first and last days of the collection do not represent complete days. The GET statuses/lookup method supports retrieving the complete tweet for a tweet id (known as hydrating). Tools such as Twarc or Hydrator can be used to hydrate tweets. Per Twitter’s Developer Policy, tweet ids may be publicly shared for academic purposes; tweets may not. This dataset contains only tweet ids, not the actual tweets. We intend to continue updating this dataset periodically, as the collection is ongoing. Please check the Versions tab below for new versions. Questions about this dataset can be sent to sfm@gwu.edu. George Washington University researchers should contact us for access to the tweets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Social behavior has a fundamental impact on the dynamics of infectious diseases (such as COVID-19), challenging public health mitigation strategies and possibly the political consensus. The widespread use of the traditional and social media on the Internet provides us with an invaluable source of information on societal dynamics during pandemics. With this dataset, we aim to understand mechanisms of COVID-19 epidemic-related social behavior in Poland deploying methods of computational social science and digital epidemiology. We have collected and analyzed COVID-19 perception on the Polish language Internet during 15.01-31.07(06.08) and labeled data quantitatively (Twitter, Youtube, Articles) and qualitatively (Facebook, Articles and Comments of Article) in the Internet by infomediological approach.
-manually labelled 1000 most popular tweets (twits_annotated.xlsx) with cathegories is_fake (categorical and numeric) topic and sentiment;
-extracted 57,306 representative articles (articles_till_06_08.zip) in Polish using Eventregitry.org tool in language Polish and topic "Coronavirus" in article body;
extracted 1,015,199 (tweets_till_31_07_users.zip and tweets_till_31_07_text.zip) and Tweets from #Koronawirus in language Polish using Twitter API.
collected 1,574 videos (youtube_comments_till_31_07.zip and youtube_movie.csv) with keyword: Koronawirus on YouTube and 247,575 comments on them using Google API;
We supplemented the media observations with an analysis of 244 social empirical studies till 25.05 on COVID-19 in Poland (empirical_social_studies.csv).
Reports and analyzes and coding books can be found in Polish at: http://www.infodemia-koronawirusa.pl
Main report (in Polish) https://depot.ceon.pl/handle/123456789/19215
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset gives a cursory glimpse at the overall sentiment trend of the public discourse regarding the COVID-19 pandemic on Twitter. The live scatter plot of this dataset is available as The Overall Trend block at https://live.rlamsal.com.np. The trend graph reveals multiple peaks and drops that need further analysis. The n-grams during those peaks and drops can prove beneficial for better understanding the discourse.
Version 48 of the dataset. Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (948,493,362 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (238,771,950 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/ More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688) As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used. This dataset will be updated bi-weekly at least with additional tweets, look at the github repo for these updates. Release: We have standardized the name of the resource to match our pre-print manuscript and to not have to update it every week.