Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo
Version 67 of the dataset.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,116,738,914 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (283,307,680 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.
**Link to access the dataset **- https://doi.org/10.5281/zenodo.3723939
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
Facebook
TwitterVersion 162 of the dataset. NOTES: Data for 3/15 - 3/18 was not extracted due to unexpected and unannounced downtime of our university infrastructure. We will try to backfill those days by next release. FUTURE CHANGES: Due to the imminent paywalling of Twitter's API access this might be the last full update of this dataset. If the API access is not blocked, we will be stopping updates for this dataset with release 165 - a bit more than 3 years after our initial release. It's been a joy seeing all the work that uses this resource and we are glad that so many found it useful.
The dataset files: full_dataset.tsv.gz and full_dataset_clean.tsv.gz have been split in 1 GB parts using the Linux utility called Split. So make sure to join the parts before unzipping. We had to make this change as we had huge issues uploading files larger than 2GB's (hence the delay in the dataset releases). The peer-reviewed publication for this dataset has now been published in Epidemiologia an MDPI journal, and can be accessed here: https://doi.org/10.3390/epidemiologia2030024. Please cite this when using the dataset.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,395,222,801 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (361,748,721 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.
Facebook
TwitterDue to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 22nd which yielded over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (40,823,816 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (7,479,940 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
I streamed live tweets from the twitter after WHO declared Covid-19 as a pandemic. Since this Covid-19 epidemic has affected the entire world, I collected worldwide Covid-19 related English tweets at a rate of almost 10k per day in three phases starting from April-June, 2020, August-October, 2020 and April-June, 2021. I prepared the first phase dataset of about 235k tweets collected from 19th April to 20th June 2020. After one month I again start collecting tweets from Twitter as at that time the pandemic was spreading with its fatal intensity. I collected almost 320k tweets in the period August 20 to October 20, 2020, for the second phase dataset. Finally, after six months collected almost 489k tweets in the period 26th April to 27th June 2021 for the third phase dataset.
The datasets I developed contain important information about most of the tweets and their attributes. The main attributes of both of these datasets are: - Tweet ID - Creation Date & Time - Source Link - Original Tweet - Favorite Count - Retweet Count - Original Author - Hashtags - User Mentions - Place
Finally, I collected 2,35,240, 3,20,316, and 4,89,269 tweets for first, second, and third phase datasets containing the hash-tagged keywords like - #covid-19, #coronavirus, #covid, #covaccine, #lockdown, #homequarantine, #quarantinecenter, #socialdistancing, #stayhome, #staysafe, etc. Here I represented an overview of the collected dataset.
I pre-processed these collected data by developing a user-defined pre-processing function based on NLTK (Natural Language Toolkit, a Python library for NLP). At the initial stage, it converts all the tweets into lowercase. Then it removes all extra white spaces, numbers, special characters, ASCII characters, URLs, punctuations & stopwords from the tweets. Then it converts all ‘covid’ words into ‘covid19’ as we already removed all numbers from the tweets. Using stemming the pre-processing function has reduced inflected words to their word stem.
I calculated the sentiment polarity of each cleaned and pre-processed tweet using the NLTK-based Sentiment Analyzer and get the sentiment scores for positive, negative, and neutral categories to calculate the compound sentiment score for each tweet. I classified the tweets on the basis of the compound sentiment scores into three different classes i.e., Positive, Negative, and Neutral. Then we assigned the sentiment polarity ratings for each tweet based on the following algorithm-
Algorithm Sentiment Classification of Tweets (compound, sentiment): 1. for each tweet in the dataset: 2. if tweet[compound] < 0: 3. tweet[sentiment] = 0.0 # assigned 0.0 for Negative Tweets 4. elif tweet[compound] > 0: 5. tweet[sentiment] = 1.0 # assigned 1.0 for Positive Tweets 6. else: 7. tweet[sentiment] = 0.5 # assigned 0.5 for Neutral Tweets 8. end
I wouldn't be here without the help of my project guide Dr. Anup Kumar Kolya, Assistant Professor, Dept of Computer Science and Engineering, RCCIIT whose kind and valuable suggestions and excellent guidance enlightened to give me the best opportunity in preparing these datasets. If you owe any attributions or thanks, include him here along with any citations of past research.
This datasets are the part of the publications entitled:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes CSV files that contain IDs and sentiment scores of the tweets related to the COVID-19 pandemic. The tweets have been collected by an on-going project deployed at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. This dataset has been wholly re-designed on March 20, 2020, to comply with the content redistribution policy set by Twitter.The paper associated with this dataset is available here: Design and analysis of a large-scale COVID-19 tweets dataset-------------------------------------Related datasets:(a) Tweets Originating from India During COVID-19 Lockdowns(b) Coronavirus (COVID-19) Tweets Sentiment Trend (Global)-------------------------------------Below is the quick overview of this dataset.— Dataset name: COV19Tweets Dataset— Number of tweets : 857,809,018 tweets— Coverage : Global— Language : English (EN)— Dataset usage terms : By using this dataset, you agree to (i) use the content of this dataset and the data generated from the content of this dataset for non-commercial research only, (ii) remain in compliance with Twitter's Developer Policy and (iii) cite the following paper:Lamsal, R. Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence (2020). https://doi.org/10.1007/s10489-020-02029-z— Geo-tagged Version: Coronavirus (COVID-19) Geo-tagged Tweets Dataset (GeoCOV19Tweets Dataset)— Dataset updates : Everyday— Active keywords and hashtags (archive: keywords.tsv) : "corona", "#corona", "coronavirus", "#coronavirus", "covid", "#covid", "covid19", "#covid19", "covid-19", "#covid-19", "sarscov2", "#sarscov2", "sars cov2", "sars cov 2", "covid_19", "#covid_19", "#ncov", "ncov", "#ncov2019", "ncov2019", "2019-ncov", "#2019-ncov", "pandemic", "#pandemic" "#2019ncov", "2019ncov", "quarantine", "#quarantine", "flatten the curve", "flattening the curve", "#flatteningthecurve", "#flattenthecurve", "hand sanitizer", "#handsanitizer", "#lockdown", "lockdown", "social distancing", "#socialdistancing", "work from home", "#workfromhome", "working from home", "#workingfromhome", "ppe", "n95", "#ppe", "#n95", "#covidiots", "covidiots", "herd immunity", "#herdimmunity", "pneumonia", "#pneumonia", "chinese virus", "#chinesevirus", "wuhan virus", "#wuhanvirus", "kung flu", "#kungflu", "wearamask", "#wearamask", "wear a mask", "vaccine", "vaccines", "#vaccine", "#vaccines", "corona vaccine", "corona vaccines", "#coronavaccine", "#coronavaccines", "face shield", "#faceshield", "face shields", "#faceshields", "health worker", "#healthworker", "health workers", "#healthworkers", "#stayhomestaysafe", "#coronaupdate", "#frontlineheroes", "#coronawarriors", "#homeschool", "#homeschooling", "#hometasking", "#masks4all", "#wfh", "wash ur hands", "wash your hands", "#washurhands", "#washyourhands", "#stayathome", "#stayhome", "#selfisolating", "self isolating"Dataset Files (the local time mentioned below is GMT+5:45)corona_tweets_01.csv + corona_tweets_02.csv + corona_tweets_03.csv: 2,475,980 tweets (March 20, 2020 01:37 AM - March 21, 2020 09:25 AM)corona_tweets_04.csv: 1,233,340 tweets (March 21, 2020 09:27 AM - March 22, 2020 07:46 AM)corona_tweets_05.csv: 1,782,157 tweets (March 22, 2020 07:50 AM - March 23, 2020 09:08 AM)corona_tweets_06.csv: 1,771,295 tweets (March 23, 2020 09:11 AM - March 24, 2020 11:35 AM)corona_tweets_07.csv: 1,479,651 tweets (March 24, 2020 11:42 AM - March 25, 2020 11:43 AM)corona_tweets_08.csv: 1,272,592 tweets (March 25, 2020 11:47 AM - March 26, 2020 12:46 PM)corona_tweets_09.csv: 1,091,429 tweets (March 26, 2020 12:51 PM - March 27, 2020 11:53 AM)corona_tweets_10.csv: 1,172,013 tweets (March 27, 2020 11:56 AM - March 28, 2020 01:59 PM)corona_tweets_11.csv: 1,141,210 tweets (March 28, 2020 02:03 PM - March 29, 2020 04:01 PM)corona_tweets_12.csv: 793,417 tweets (March 30, 2020 02:01 PM - March 31, 2020 10:16 AM)corona_tweets_13.csv: 1,029,294 tweets (March 31, 2020 10:20 AM - April 01, 2020 10:59 AM)corona_tweets_14.csv: 920,076 tweets (April 01, 2020 11:02 AM - April 02, 2020 12:19 PM)corona_tweets_15.csv: 826,271 tweets (April 02, 2020 12:21 PM - April 03, 2020 02:38 PM)corona_tweets_16.csv: 612,512 tweets (April 03, 2020 02:40 PM - April 04, 2020 11:54 AM)corona_tweets_17.csv: 685,560 tweets (April 04, 2020 11:56 AM - April 05, 2020 12:54 PM)corona_tweets_18.csv: 717,301 tweets (April 05, 2020 12:56 PM - April 06, 2020 10:57 AM)corona_tweets_19.csv: 722,921 tweets (April 06, 2020 10:58 AM - April 07, 2020 12:28 PM)corona_tweets_20.csv: 554,012 tweets (April 07, 2020 12:29 PM - April 08, 2020 12:34 PM)corona_tweets_21.csv: 589,679 tweets (April 08, 2020 12:37 PM - April 09, 2020 12:18 PM)corona_tweets_22.csv: 517,718 tweets (April 09, 2020 12:20 PM - April 10, 2020 09:20 AM)corona_tweets_23.csv: 601,199 tweets (April 10, 2020 09:22 AM - April 11, 2020 10:22 AM)corona_tweets_24.csv: 497,655 tweets (April 11, 2020 10:24 AM - April 12, 2020 10:53 AM)corona_tweets_25.csv: 477,182 tweets (April 12, 2020 10:57 AM - April 13, 2020 11:43 AM)corona_tweets_26.csv: 288,277 tweets (April 13, 2020 11:46 AM - April 14, 2020 12:49 AM)corona_tweets_27.csv: 515,739 tweets (April 14, 2020 11:09 AM - April 15, 2020 12:38 PM)corona_tweets_28.csv: 427,088 tweets (April 15, 2020 12:40 PM - April 16, 2020 10:03 AM)corona_tweets_29.csv: 433,368 tweets (April 16, 2020 10:04 AM - April 17, 2020 10:38 AM)corona_tweets_30.csv: 392,847 tweets (April 17, 2020 10:40 AM - April 18, 2020 10:17 AM)> With the addition of some more coronavirus specific keywords, the number of tweets captured day has increased significantly, therefore, the CSV files hereafter will be zipped. Lets save some bandwidth.corona_tweets_31.csv: 2,671,818 tweets (April 18, 2020 10:19 AM - April 19, 2020 09:34 AM)corona_tweets_32.csv: 2,393,006 tweets (April 19, 2020 09:43 AM - April 20, 2020 10:45 AM)corona_tweets_33.csv: 2,227,579 tweets (April 20, 2020 10:56 AM - April 21, 2020 10:47 AM)corona_tweets_34.csv: 2,211,689 tweets (April 21, 2020 10:54 AM - April 22, 2020 10:33 AM)corona_tweets_35.csv: 2,265,189 tweets (April 22, 2020 10:45 AM - April 23, 2020 10:49 AM)corona_tweets_36.csv: 2,201,138 tweets (April 23, 2020 11:08 AM - April 24, 2020 10:39 AM)corona_tweets_37.csv: 2,338,713 tweets (April 24, 2020 10:51 AM - April 25, 2020 11:50 AM)corona_tweets_38.csv: 1,981,835 tweets (April 25, 2020 12:20 PM - April 26, 2020 09:13 AM)corona_tweets_39.csv: 2,348,827 tweets (April 26, 2020 09:16 AM - April 27, 2020 10:21 AM)corona_tweets_40.csv: 2,212,216 tweets (April 27, 2020 10:33 AM - April 28, 2020 10:09 AM)corona_tweets_41.csv: 2,118,853 tweets (April 28, 2020 10:20 AM - April 29, 2020 08:48 AM)corona_tweets_42.csv: 2,390,703 tweets (April 29, 2020 09:09 AM - April 30, 2020 10:33 AM)corona_tweets_43.csv: 2,184,439 tweets (April 30, 2020 10:53 AM - May 01, 2020 10:18 AM)corona_tweets_44.csv: 2,223,013 tweets (May 01, 2020 10:23 AM - May 02, 2020 09:54 AM)corona_tweets_45.csv: 2,216,553 tweets (May 02, 2020 10:18 AM - May 03, 2020 09:57 AM)corona_tweets_46.csv: 2,266,373 tweets (May 03, 2020 10:09 AM - May 04, 2020 10:17 AM)corona_tweets_47.csv: 2,227,489 tweets (May 04, 2020 10:32 AM - May 05, 2020 10:17 AM)corona_tweets_48.csv: 2,218,774 tweets (May 05, 2020 10:38 AM - May 06, 2020 10:26 AM)corona_tweets_49.csv: 2,164,251 tweets (May 06, 2020 10:35 AM - May 07, 2020 09:33 AM)corona_tweets_50.csv: 2,203,686 tweets (May 07, 2020 09:55 AM - May 08, 2020 09:35 AM)corona_tweets_51.csv: 2,250,019 tweets (May 08, 2020 09:39 AM - May 09, 2020 09:49 AM)corona_tweets_52.csv: 2,273,705 tweets (May 09, 2020 09:55 AM - May 10, 2020 10:11 AM)corona_tweets_53.csv: 2,208,264 tweets (May 10, 2020 10:23 AM - May 11, 2020 09:57 AM)corona_tweets_54.csv: 2,216,845 tweets (May 11, 2020 10:08 AM - May 12, 2020 09:52 AM)corona_tweets_55.csv: 2,264,472 tweets (May 12, 2020 09:59 AM - May 13, 2020 10:14 AM)corona_tweets_56.csv: 2,339,709 tweets (May 13, 2020 10:24 AM - May 14, 2020 11:21 AM)corona_tweets_57.csv: 2,096,878 tweets (May 14, 2020 11:38 AM - May 15, 2020 09:58 AM)corona_tweets_58.csv: 2,214,205 tweets (May 15, 2020 10:13 AM - May 16, 2020 09:43 AM)> The server and the databases have been optimized; therefore, there is a significant rise in the number of tweets captured per day.corona_tweets_59.csv: 3,389,090 tweets (May 16, 2020 09:58 AM - May 17, 2020 10:34 AM)corona_tweets_60.csv: 3,530,933 tweets (May 17, 2020 10:36 AM - May 18, 2020 10:07 AM)corona_tweets_61.csv: 3,899,631 tweets (May 18, 2020 10:08 AM - May 19, 2020 10:07 AM)corona_tweets_62.csv: 3,767,009 tweets (May 19, 2020 10:08 AM - May 20, 2020 10:06 AM)corona_tweets_63.csv: 3,790,455 tweets (May 20, 2020 10:06 AM - May 21, 2020 10:15 AM)corona_tweets_64.csv: 3,582,020 tweets (May 21, 2020 10:16 AM - May 22, 2020 10:13 AM)corona_tweets_65.csv: 3,461,470 tweets (May 22, 2020 10:14 AM - May 23, 2020 10:08 AM)corona_tweets_66.csv: 3,477,564 tweets (May 23, 2020 10:08 AM - May 24, 2020 10:02 AM)corona_tweets_67.csv: 3,656,446 tweets (May 24, 2020 10:02 AM - May 25, 2020 10:10 AM)corona_tweets_68.csv: 3,474,952 tweets (May 25, 2020 10:11 AM - May 26, 2020 10:22 AM)corona_tweets_69.csv: 3,422,960 tweets (May 26, 2020 10:22 AM - May 27, 2020 10:16 AM)corona_tweets_70.csv: 3,480,999 tweets (May 27, 2020 10:17 AM - May 28, 2020 10:35 AM)corona_tweets_71.csv: 3,446,008 tweets (May 28, 2020 10:36 AM - May 29, 2020 10:07 AM)corona_tweets_72.csv: 3,492,841 tweets (May 29, 2020 10:07 AM - May 30, 2020 10:14 AM)corona_tweets_73.csv: 3,098,817 tweets (May 30, 2020 10:15 AM - May 31, 2020 10:13 AM)corona_tweets_74.csv: 3,234,848 tweets (May 31, 2020 10:13 AM - June 01, 2020 10:14 AM)corona_tweets_75.csv: 3,206,132 tweets (June 01, 2020 10:15 AM - June 02, 2020 10:07 AM)corona_tweets_76.csv: 3,206,417 tweets (June 02, 2020 10:08 AM - June 03, 2020 10:26 AM)corona_tweets_77.csv: 3,256,225 tweets (June 03, 2020
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Tweets by Elon Musk are very popular. He is currently one of the most followed users on Twitter, with >100M followers. He is also constantly tweeting, so the content generated is interesting.
This dataset is collected daily using snscrape. The source of the dataset is public tweets by Elon Musk.
The following columns are included:
You can use this dataset (daily updated) to test your skills with NLP tools and techniques. If you find this helpful, drop a like :)
Facebook
TwitterDue to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full dataset, and a cleaned version with no retweets. There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms, the top 1000 bigrams, and the top 1000 trigrams. Some general statistics per day are included for both datasets. We will continue to update the dataset every two days here and weekly in Zenodo. For more information on processing and visualizations please visit: www.panacealab.org/covid19
Facebook
Twitterhttps://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.
Facebook
TwitterAuthor: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the tweet ids of approximately 280 million tweets related to the 2016 United States presidential election. They were collected between July 13, 2016 and November 10, 2016 from the Twitter API using Social Feed Manager. These tweet ids are broken up into 12 collections. Each collection was collected either from the GET statuses/user_timeline method of the Twitter REST API or the POST statuses/filter method of the Twitter Stream API. The collections are: Candidates and key election hashtags (Twitter filter): election-filter[1-6].txt Democratic candidates (Twitter user timeline): democratic-candidate-timelines.txt Democratic Convention (Twitter filter): democratic-convention-filter.txt Democratic Party (Twitter user timeline): democratic-party-timelines.txt Election Day (Twitter filter): election-day.txt First presidential debate (Twitter filter): first-debate.txt GOP Convention (Twitter filter): republican-convention-filter.txt Republican candidates (Twitter user timeline): republican-candidate-timelines.txt Republican Party (Twitter user timeline): republican-party-timelines.txt Second presidential debate (Twitter filter): second-debate.txt Third presidential debate (Twitter filter): third-debate.txt Vice Presidential debate (Twitter filter): vp-debate.txt There is also a README.txt file for each collection containing additional documentation on how it was collected. The GET statuses/lookup method supports retrieving the complete tweet for a tweet id (known as hydrating). Tools such as Twarc or Hydrator can be used to hydrate tweets. When hydrating be aware that: Twitter limits hydration to 900 requests of 100 tweet ids per 15 minute window per set of user credentials. This works out to 8,640,000 tweets per day, so hydrating this entire dataset will take 32 days. The Twitter API will not return tweets that have been deleted or belong to accounts that have been suspended, deleted, or made private. You should expect a large number of these tweets to be unavailable. There may be duplicate tweets across collections. Also, according to the Twitter documentation, duplicate tweets are possible for tweets collected from the Twitter filter stream. For tweets collected from the Twitter filter stream, this is not a complete set of tweets that match the filter. Gaps may exist because: Twitter limits the number of tweets returned by the filter at any point in time. Social Feed Manager stops and starts the Twitter filter stream every 30 minutes. In Social Feed Manager, collecting is turned off while a user is making changes to the collection criteria. There were some operational issues, e.g., network interruptions, during the collection period. Since some of the terms used to collect from the Twitter filter stream were broad (e.g., “election”), it may contain tweets from elections other than the U.S. presidential election, including state elections, local elections, or elections in other countries. Per Twitter’s Developer Policy, tweet ids may be publicly shared; tweets may not. Questions about this dataset can be sent to sfm@gwu.edu. George Washington University researchers should contact us for access to the tweets. This work is supported by grant #NARDI-14-50017-14 from the National Historical Publications and Records Commission.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The is a database of tweets from identified Australian accounts, originally set up through the , and now managed by the . This dataset includes 3.7 million Australian Twitter accounts, with 1.8 billion tweets captured to date. Since the beginning of 2019, there have been about 800,000 new tweets per day, from 100,000 daily active users. The 100 days of COVID-19 in the Australian Twittersphere dataset consists of 2.8 million tweet IDs corresponding to tweets from the Australian Twittersphere that mention the COVID-19 pandemic, either through Coronavirus specific hashtags or keywords. The tweets were created on or after 20 January 2020, and up until 23 May 2020 (the 15 weeks that form the first ‘100 days’ of COVID-19 in Australia). This dataset provides a glimpse of the experiences and attitudes of Australians presently living through this global pandemic. We are all in this together and as such this dataset has been released as rapidly as possible to enable use by the broader research community.
The SQL used to extract the tweets from the Australian Twittersphere database is as follows:
SELECT tweet_id
FROM oz_twitter.tweet
WHERE created_at >= '2020-01-19 14:00:00' -- corresponds to >= '2020-01-20 00:00:00' in Brisbane time
AND created_at < '2020-05-03 14:00:00' -- corresponds to < '2020-05-04 00:00:00' in Brisbane time
AND multiMatchAny(lower(text), ['covid', 'corona', 'flattenthecurve', 'socialdistancing', 'stayhome', 'lockdown', 'wuhan', 'pandemic'])
AND notEmpty(hashtags) = 1;
Access to the Australian Twittersphere database is managed by the QUT Digital Observatory.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used in the manuscript "Scaling laws and dynamics of hashtags on Twitter"..
The Twitter data was obtained from a sample of 10% of all public tweets, provided by the Twitter streaming application programming interface. We extracted the hashtags from each tweet and counted how many times they were used in different time intervals. Time intervals of three different lengths were used: days, hours, and minutes. The tweets were published between November 1st 2015 and November 30th 2016, but not all time intervals between these dates are available.
The four files in this dataset correspond each to one folder (collected using tar). Each folder contains compressed .csv files (compressed using gzip). The content of the .csv files in each folder are:
hashtags_frequency_day.tar Counts of hashtags in each day. The name of each file in the folder indicates the date (GMT). The entries in each file are the hashtag and the count in the interval.
hashtags_frequency_hour.tar Counts of hashtags in each hour. The name of each file in the folder indicates the date (GMT). The entries in each file are the hashtag and the count in the interval.
hashtags_frequency_minutes.tar Counts of hashtags in each minute. The name of each file in the folder indicates the date (GMT, only a fraction of all days is available). The entries in each file are the hashtag and the count in the interval.
number_of_tweets.tar Counts of the number of tweets in each minute. The name of each file in the folder indicates the day. The entries in each file are the minute in the day (GMT) and count of tweets in our dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.
The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).
It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).
Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.
The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.
To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.
In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).
In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:
March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).
June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).
September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).
December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).
March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).
June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).
September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).
December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).
March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).
June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).
September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).
December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).
March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).
June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).
The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.
At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.
In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted and non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).
Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.
For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).
If you use this dataset in any way please cite that preprint (in addition to the dataset itself).
If you need to contact me you can find me as @PFCdgayo in Twitter.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides an extensive analysis of Twitter retweet activities, focusing on various attributes that can influence and describe the nature of retweets. It consists of multiple rows of data, each representing a unique Twitter retweet instance with detailed information on its characteristics.
Weekday: The day of the week when the retweet occurred.
Hour: The hour of the day when the retweet was made, in 24-hour format.
Day: The day of the month when the retweet was posted.
Lang: The language code of the tweet that was retweeted.
Reach: The estimated number of users who have seen the retweet.
RetweetCount: The number of times the retweeted tweet has been retweeted further.
Likes: The number of likes received by the retweeted tweet.
Klout: The Klout score of the user who posted the original tweet, which is a measure of their influence on social media.
Sentiment: The sentiment score of the retweeted tweet, indicating the overall emotional tone.
LocationID: A numerical identifier representing the geographical location of the user who posted the retweet.
This dataset can be utilized for various analyses, including: - Identifying peak times for retweets - Analyzing the influence of tweet attributes on retweet rates - Sentiment analysis of popular retweets - Geographical distribution of retweet activity - Correlating Klout scores with retweet reach and engagement
Researchers, marketers, and social media analysts can use this dataset to gain insights into Twitter retweet behavior, optimize social media strategies, and understand the factors contributing to the virality of tweets.
Facebook
TwitterEnglish Version:
Two datasets are published as part of my Bachelor's final thesis on hate speech, titled Hate Speech on Twitter: Analysis of LGBTIQ-phobia Before and After Elon Musk:
Main.csv: This dataset contains the IDs of 695,000 tweets in Spanish related to the LGBTIQ+ community, collected using specific keywords. The tweets correspond to each June 28th from 2015 to 2025.
Benchmark.csv: This dataset includes the IDs of 657,000 random tweets in Spanish, obtained through a selection of keywords. The tweets represent a 6-minute sample from every hour on each June 28th from 2015 to 2025.
Due to the X Developer Agreement policy, we can only share the tweet IDs.
For further details, you can find the code for processing and analysis in the project's GitHub repository.
Acknowledgements
We would like to acknowledge the use of tools and support provided by twitterapi.io for data extraction, as well as the Perspective API, which played a crucial role in analyzing tweet toxicity. These resources were indispensable for the successful completion of this project.
Versión en Español:
Se publican dos conjuntos de datos como parte de mi trabajo de fin de grado (TFG) sobre el discurso de odio, titulado Discurso de odio en Twitter: Análisis de la LGTBIQ-fobia antes y después de Elon Musk:
Main.csv: Este conjunto de datos contiene los IDs de 695,000 tuits en español relacionados con la comunidad LGTBIQ+, recopilados mediante el uso de palabras clave. Los tuits corresponden a cada 28 de junio de cada año, desde 2015 hasta 2025.
Benchmark.csv: Este conjunto de datos incluye los IDs de 657,000 tuits aleatorios en español, obtenidos a partir de una selección de palabras clave. Los tuits representan una muestra de 6 minutos de cada hora, correspondiente a cada 28 de junio, desde 2015 hasta 2025.
Debido a la X Developer Agreement policy, únicamente podemos compartir los IDs de los tuits.
Para más detalles, puede consultar el código de procesamiento y análisis de los datos en el repositorio de GitHub del proyecto.
Agradecimientos
Queremos agradecer el apoyo y las herramientas proporcionadas por twitterapi.io para la extracción de datos, así como la Perspective API, que jugó un papel crucial en el análisis de la toxicidad de los tuits. Estos recursos fueron indispensables para la realización exitosa de este proyecto.
Facebook
Twitterhttps://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
This dataset consists of IDs of geotagged Twitter posts from within the United States. They are provided as files per day and state as well as per day and county. In addition, files containing the aggregated number of hashtags from these tweets are provided per day and state and per day and county. This data is organized as a ZIP-file per month containing several zip-files per day which hold the txt-files with the ID/hash information.
Also part of the dataset are two shapefiles for the US counties and states and Python scripts for the data collection and sorting geotags into counties.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We are releasing a Twitter dataset connected to our project Digital Narratives of Covid-19 (DHCOVID) that -among other goals- aims to explore during one year (May 2020-2021) the narratives behind data about the coronavirus pandemic.
In this first version, we deliver a Twitter dataset organized as follows:
Each folder corresponds to daily data (one folder for each day): YEAR-MONTH-DAY
In every folder there are 9 different plain text files named with "dhcovid", followed by date (YEAR-MONTH-DAY), language ("en" for English, and "es" for Spanish), and region abbreviation ("fl", "ar", "mx", "co", "pe", "ec", "es"):
dhcovid_YEAR-MONTH-DAY_es_fl.txt: Dataset containing tweets geolocalized in South Florida. The geo-localization is tracked by tweet coordinates, by place, or by user information.
dhcovid_YEAR-MONTH-DAY_en_fl.txt: We are gathering only tweets in English that refer to the area of Miami and South Florida. The reason behind this choice is that there are multiple projects harvesting English data, and, our project is particularly interested in this area because of our home institution (University of Miami) and because we aim to study public conversations from a bilingual (EN/ES) point of view.
dhcovid_YEAR-MONTH-DAY_es_ar.txt: Dataset containing tweets from Argentina.
dhcovid_YEAR-MONTH-DAY_es_mx.txt: Dataset containing tweets from Mexico.
dhcovid_YEAR-MONTH-DAY_es_co.txt: Dataset containing tweets from Colombia.
dhcovid_YEAR-MONTH-DAY_es_pe.txt: Dataset containing tweets from Perú.
dhcovid_YEAR-MONTH-DAY_es_ec.txt: Dataset containing tweets from Ecuador.
dhcovid_YEAR-MONTH-DAY_es_es.txt: Dataset containing tweets from Spain.
dhcovid_YEAR-MONTH-DAY_es.txt: This dataset contains all tweets in Spanish, regardless of its geolocation.
For English, we collect all tweets with the following keywords and hashtags: covid, coronavirus, pandemic, quarantine, stayathome, outbreak, lockdown, socialdistancing. For Spanish, we search for: covid, coronavirus, pandemia, quarentena, confinamiento, quedateencasa, desescalada, distanciamiento social.
The corpus of tweets consists of a list of Tweet Ids; to obtain the original tweets, you can use "Twitter hydratator" which takes the id and download for you all metadata in a csv file.
We started collecting this Twitter dataset on April 24th, 2020 and we are adding daily data to our GitHub repository. There is a detected problem with file 2020-04-24/dhcovid_2020-04-24_es.txt, which we couldn't gather the data due to technical reasons.
For more information about our project visit https://covid.dh.miami.edu/
For more updated datasets and detailed criteria, check our GitHub Repository: https://github.com/dh-miami/narratives_covid19/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset for the article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario".
Abstract:
Museums are embracing social technologies in the attempt to broaden their audience and to engage people. Although social communication seems an easy task, media managers know how hard it is to reach millions of people with a simple message. Indeed, millions of posts are competing every day to get visibility in terms of likes and shares and very little research focused on museums communication to identify best practices. In this paper, we focus on Twitter and we propose a novel method that exploits interpretable machine learning techniques to: (a) predict whether a tweet will likely be appreciated by Twitter users or not; (b) present simple suggestions that will help enhancing the message and increasing the probability of its success. Using a real-world dataset of around 40,000 tweets written by 23 world famous museums, we show that our proposed method allows identifying tweet features that are more likely to influence the tweet success.
Code to run a selection of experiments is available at https://github.com/rmartoglia/predict-twitter-ch
Dataset structure
The dataset contains the dataset used in the experiments of the above research paper. Only the extracted features for the museum tweet threads (and not the message full text) are provided and needed for the analyses.
We selected 23 well known world spread art museums and grouped them into five groups: G1 (museums with at least three million of followers); G2 (museums with more than one million of followers); G3 (museums with more than 400,000 followers); G4 (museums with more that 200,000 followers); G5 (Italian museums). From these museums, we analyzed ca. 40,000 tweets, with a number varying from 5k ca. to 11k ca. for each museum group, depending on the number of museums in each group.
Content features: these are the features that can be drawn form the content of the tweet itself. We further divide such features in the following two categories:
– Countable: these features have a value ranging into different intervals. We take into consideration: the number of hashtags (i.e., words preceded by #) in the tweet, the number of URLs (i.e., links to external resources), the number of images (e.g., photos and graphical emoticons), the number of mentions (i.e., twitter accounts preceded by @), the length of the tweet;
– On-Off : these features have binary values in {0, 1}. We observe whether the tweet has exclamation marks, question marks, person names, place names, organization names, other names. Moreover, we also take into consideration the tweet topic density: assuming that the involved topics correspond to the hashtags mentioned in the text, we define a tweet as dense of topics if the number of hashtags it contains is greater than a given threshold, set to 5. Finally, we observe the tweet sentiment that might be present (positive or negative) or not (neutral).
Context features: these features are not drawn form the content of the tweet itself and might give a larger picture of the context in which the tweet was sent. Namely, we take into consideration the part of the day in which the tweet was sent (morning, afternoon, evening and night respectively from 5:00am to 11:59am, from 12:00pm to 5:59pm, from 6:00pm to 10:59pm and from 11pm to 4:59am), and a boolean feature indicating whether the tweet is a retweet or not.
User features: these features are proper of the user that sent the tweet, and are the same for all the tweets of this user. Namely we consider the name of the museum and the number of followers of the user.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
eBackgroundThe Digital Humanities 2016 conference is taking/took place in Kraków, Poland, between Sunday 11 July and Saturday 16 July 2016. #DH2016 is/was the conference official hashtag.What This Output IsThis is a CSV file containing a total of 4175 Tweets publicly published with the hashtag #DH2016 on Wednesday 13 July 2016 GMT.The archive starts with a Tweet published on Wednesday July 13 2016 at 00:31:16 +0000 and ends with a Tweet published on Wednesday July 13 2016 at 23:58:33 +0000. Previous days have been shared on a different output. A breakdown of Tweets per day so far:Sunday 10 July 2016: 179 TweetsMonday 11 July 2016: 981 TweetsTuesday 12 July 2016: 2318 TweetsWednesday 13 July 2016: 4175 Tweets Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using Martin Hawksey's TAGS 6.0. Only users with at least 1 follower were included in the archive. Retweets have been included (Retweets count as Tweets). The collection spreadsheet was customised to reflect the time zone and geographical location of the conference.The profile_image_url and entities_str metadata were removed before public sharing in this archive. Please bear in mind that the conference hashtag has been spammed so some Tweets colllected may be from spam accounts. Some automated refining has been performed to remove Tweets not related to the conference but the data is likely to require further refining and deduplication. Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet tagged with #dh2016 during the indicated period, and the dataset is shared for archival, comparative and indicative educational research purposes only.Only content from public accounts is included and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.Each Tweet and its contents were published openly on the Web with the queried hashtag and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually. No private personal information is shared in this dataset. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road. This dataset is shared to archive, document and encourage open educational research into scholarly activity on Twitter. Other ConsiderationsTweets published publicly by scholars during academic conferences are often tagged (labeled) with a hashtag dedicated to the conference in question.The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag. Though every reason for Tweeters' use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences. As Twitter users, conference Twitter hashtag contributors have agreed to Twitter's Privacy and data sharing policies. Professional associations like the Modern Language Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter's search API has well-known temporal limitations for retrospective historical search and collection.Beyond individual tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. To date, collecting in real time is the only relatively accurate method to archive tweets at a small scale. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time.The CC-BY license has been applied to the output in the repository as a curated dataset. Authorial/curatorial/collection work has been performed on the file in order to make it available as part of the scholarly record. The data contained in the deposited file is otherwise freely available elsewhere through different methods and anyone not wishing to attribute the data to the creator of this output is needless to say free to do their own collection and clean their own data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset of Elon musk tweets updated and recorded automatically every day starting from September 2, 2021 (due to a limit of Twitter API)
tweets only (if the tweet is not a reply of any other tweet)
tweets.csv
both tweets and replies
tweets_and_replies.csv
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo
Version 67 of the dataset.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,116,738,914 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (283,307,680 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.
**Link to access the dataset **- https://doi.org/10.5281/zenodo.3723939
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.