Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset consists of tweet identifiers for tweets harvested between November 28, 2016, following the election of Donald Trump through the end of the first 100 days of his administration. Data collection ended May 1, 2017.
Tweets were harvested using multiple methods described below. The total dataset consists of 218,273,152 tweets. Because of the different methods used to harvest tweets, there may be some duplication.
Methods Data were harvested from the Twitter API using the following endpoints:
search
timeline
filter
Three tweet sets were harvested using the search endpoint, which returns tweets that include a specific search term, user mention, hashtag, etc. The table below provides the search term, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented.
Search term
Dates collected
Count tweets
Count unique users
@realDonaldTrump user mention
2016-11-28 - 2017-05-01
4,597,326
1,501,806
"Trump" in tweet text
2017-01-18 - 2017-05-01
11,055,772
2,648,849
#MAGA hashtag
2017-01-23 - 2017-05-01
1,169,897
236,033
Two tweet sets were harvested using the timeline endpoint, which returns tweets published by specific users. The table below provides the user whose timeline was harvested, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented. Note that in these cases, tweets were necessarily limited to the one unique user whose tweets were harvested.
User
Dates collected
Count tweets
Count unique users
realDonaldTrump
2016-12-21 - 2017-05-01
902
1
trumpRegrets
2017-01-15 - 2017-05-01
1,751
1
The largest tweet set was harvested using the filter endpoint, which allows for streaming data access in near real time. Requests made to this API can be filtered to include tweets that meet specific criteria. The table below provides the filters used, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented.
Filtering via the API uses a default "OR," so the tweets included in this set satisfied any of the filter terms.
The script used to harvest streaming data from the filter API was built using the Python tweepy library.
Filter terms
Dates collected
Count tweets
Count unique users
tweets by realDonaldTrump
tweet mentions @realDonaldTrump
'maga' in text
'trump' in text
'potus' in text
2017-01-26 - 2017-05-01
201,447,504
12,489,255
Harvested tweets, including all corresponding metadata, were stored in individual JSON files (one file per tweet).
Data Processing: Conversion to CSV format
Per the terms of Twitter's developer API, tweet datasets may be shared for academic research use. Sharing tweet data is limited to sharing the identifiers of tweets, which must be re-harvested to account for deletions and/or modifications of individual tweets. It is not permitted to share the originally harvested tweets in JSON format.
Tweet identifiers have been extracted from the JSON data and saved as plain text CSV files. The CSV files all have a single column:
id_str (string): A tweet identifier
The data include one tweet identifier per row.
Facebook
TwitterDue to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 22nd which yielded over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (40,823,816 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (7,479,940 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset represents version 2 of this dataset. The previous version was published on June 30, 2020.This dataset contains the tweet ids of 39,373,774 tweets, which are part of the Schlesinger Library #metoo Digital Media Collection. This second version of the dataset represents the full set of tweets collected throughout the project, tweets range from October 15, 2017 to December 31, 2022. The previous version of this dataset extended to March 31, 2020. Tweets between October 15, 2017 and December 10, 2018 were licensed from Twitter's Historical PowerTrack and received through GNIP. Tweets after December 10, 2018 were collected weekly from the Twitter API through Social Feed Manager using the POST statuses/filter method of the Twitter Stream API.The following list of 76 terms includes the hashtags used to collect data for this dataset : #metoo, #timesup, #metoostem, #sciencetoo, #metoophd, #shittymediamen, #churchtoo, #ustoo, #metooMVMT, #ARmetoo, #TimesUpAR, #metooSociology, #metooSexScience, #timesupAcademia, #metooMedicine, #MyCampusToo, #howiwillchange, #iwill, #believewomen, #GoTeal, #BelieveChristine, #IStandWithDrFord, #IStandWithChristineBlaseyFord, #believesurvivors, #whyididntreport, #himtoo, #istandwithbrett, #confirmkavanaguhnow, #metooMcdonalds, #metoomovement, #muteRKelly, #WeBelieveDrFord, #WeBelieveSurvivors, #HandsOffPantsOn, #MeAt14, #HeToo, #MeTooLiars, #metoolynchings, #metoohucksters, #metoohustle, #ItWasMe, #Ihave, #TimesUpTech, #GoogleWalkout, #mosquemetoo, #faithandmetoo, #SilenceIsNotSpiritual, #HealMeToo, #TimesUpHarvard, #NoCarveOut, #TimesUpx2, #MeetingsToo, #metoonatsec, #healmetoo, #GamAni, #ShulToo, #harvardhearsyou, #metooarcheology, #TimesUpPayUp, #metooarcheology, #metooHBCU, #TimesUpHC, #aidtoo, #garmentmetoo, #mutemetoo, #mutetimesup, #metoopolisci, #copstoo, #TimesUpBiden, #MeTooNoMatterWho, #IBelieveTara, #BelieveAllWomen, #metoomilitary, #harvard38, #comaroff, and #harvardletter.The final four hashtags in this list were first crawled on February 10, 2022.Because of the size of the files, the list of identifiers are split in 41 files containing up to 1,000,000 ids each.Per Twitter's Developer Policy, tweet ids may be publicly shared for academic purposes; tweets may not. Therefore, this dataset only contains tweet ids. In order to retrieve tweets still available (not deleted by users) tools like Hydrator are available.Subsets of only the #metoo seed are also available by quarterly datasets.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
I streamed live tweets from the twitter after WHO declared Covid-19 as a pandemic. Since this Covid-19 epidemic has affected the entire world, I collected worldwide Covid-19 related English tweets at a rate of almost 10k per day in three phases starting from April-June, 2020, August-October, 2020 and April-June, 2021. I prepared the first phase dataset of about 235k tweets collected from 19th April to 20th June 2020. After one month I again start collecting tweets from Twitter as at that time the pandemic was spreading with its fatal intensity. I collected almost 320k tweets in the period August 20 to October 20, 2020, for the second phase dataset. Finally, after six months collected almost 489k tweets in the period 26th April to 27th June 2021 for the third phase dataset.
The datasets I developed contain important information about most of the tweets and their attributes. The main attributes of both of these datasets are: - Tweet ID - Creation Date & Time - Source Link - Original Tweet - Favorite Count - Retweet Count - Original Author - Hashtags - User Mentions - Place
Finally, I collected 2,35,240, 3,20,316, and 4,89,269 tweets for first, second, and third phase datasets containing the hash-tagged keywords like - #covid-19, #coronavirus, #covid, #covaccine, #lockdown, #homequarantine, #quarantinecenter, #socialdistancing, #stayhome, #staysafe, etc. Here I represented an overview of the collected dataset.
I pre-processed these collected data by developing a user-defined pre-processing function based on NLTK (Natural Language Toolkit, a Python library for NLP). At the initial stage, it converts all the tweets into lowercase. Then it removes all extra white spaces, numbers, special characters, ASCII characters, URLs, punctuations & stopwords from the tweets. Then it converts all ‘covid’ words into ‘covid19’ as we already removed all numbers from the tweets. Using stemming the pre-processing function has reduced inflected words to their word stem.
I calculated the sentiment polarity of each cleaned and pre-processed tweet using the NLTK-based Sentiment Analyzer and get the sentiment scores for positive, negative, and neutral categories to calculate the compound sentiment score for each tweet. I classified the tweets on the basis of the compound sentiment scores into three different classes i.e., Positive, Negative, and Neutral. Then we assigned the sentiment polarity ratings for each tweet based on the following algorithm-
Algorithm Sentiment Classification of Tweets (compound, sentiment): 1. for each tweet in the dataset: 2. if tweet[compound] < 0: 3. tweet[sentiment] = 0.0 # assigned 0.0 for Negative Tweets 4. elif tweet[compound] > 0: 5. tweet[sentiment] = 1.0 # assigned 1.0 for Positive Tweets 6. else: 7. tweet[sentiment] = 0.5 # assigned 0.5 for Neutral Tweets 8. end
I wouldn't be here without the help of my project guide Dr. Anup Kumar Kolya, Assistant Professor, Dept of Computer Science and Engineering, RCCIIT whose kind and valuable suggestions and excellent guidance enlightened to give me the best opportunity in preparing these datasets. If you owe any attributions or thanks, include him here along with any citations of past research.
This datasets are the part of the publications entitled:
Facebook
Twitterhttps://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Banda, Juan M.; Tekumalla, Ramya; Wang, Guanyu; Yu, Jingyuan; Liu, Tuo; Ding, Yuning; Artemova, Katya; Tutubalina, Elena; Chowell, Gerardo
Version 67 of the dataset.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,116,738,914 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (283,307,680 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.
**Link to access the dataset **- https://doi.org/10.5281/zenodo.3723939
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an Excel workbook containing two sheets. The first sheet contains 503 rows corresponding to 503 Tweet id strings from_user_id_str 25073877 and the following corresponding metadata:created_at time user_lang in_reply_to_user_id_str f from_user_id_str in_reply_to_status_id_str source user_followers_count user_friends_countTweet texts, URLs and other metadata such as profile_image_url, status_url and entities_str have not been included.An attempt to remove duplicated entries was made but duplicates might have remained so further data refining might be required prior to analyses.The second sheet contains 400 rows corresponding to the most frequent terms in the dataset's Tweets' texts. The text analysis was performed with the Terms Tool from Voyant Tools by Stéfan Sinclair & Geoffrey Rockwell (2017). An edited English stop words list was applied to remove Twitter data specific terms such as t.co, https, user names, etc. The analysed Tweets contained emojis and other special characters; due to character encoding these will be reflected in the terms list as character combinations. Motivations to Share this DataArchived Tweets can provide interesting insights for the study of contemporary history of media, politics, diplomacy, etc. The queried account is a public account widely agreed to be of exceptional national and international public interest. Though they provide public access to tweeted content in real time, Twitter Web and mobile clients are not suited for appropriate Tweet corpus analysis. For anyone researching social media, access to the data is absolutely essential in order to perform, review and reproduce studies. Archiving Tweets of public interest due to their historic significance is a means to both preserve and enable reproducible study of this form of rapid online communication that otherwise can very likely become unretrievable as time passes. Due to Twitter's current business model and API limits, to date collecting in real time is the only relatively reliable method to archive Tweets at a small scale.So far Twitter data analysis and visualisation has been done without researchers providing access to the source data that would allow reproducibility. It is appreciated that an Excel workbook is far from ideal as a file format, but due to the small scale the intention is to make this data human readable and available to researchers in a variety of non-technical fields. Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using a Python script. The data collection search query was from:realdonaldtrump. A trigger was scheduled to collect atuomatically every hour, this means that any Tweets immediately deleted after publication have not been collected. The original data harvesting was refined to delete duplications, to subscribe to Twitter's Terms and Conditions and so that the data was sorted in chronological order.Duplication of data due to the automated collection is possible so further data refining might be required. The file may not contain data from Tweets deleted by the queried user account immediately after original publication. Both research and experience show that the Twitter search API is not 100% reliable. (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet posted by the queried account during the indicated period. This file dataset is shared for archival, comparative and indicative educational research purposes only. The content included is from a public Twitter account and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.The original Tweets, their contents and associated metadata were published openly on the Web from the queried public account and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually. The license on this output applies to the data collection; third-party content should be attributed to the original authors and copyright owners. Please note that usernames, user profile pictures and full text of the Tweets collected have not been included in this file. No private personal information is shared in this dataset. As indicated above this dataset does not contain the text of the Tweets. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road.This dataset is shared to archive, document and encourage open educational research into political activity on Twitter.Other ConsiderationsAll Twitter users agree to Twitter's Privacy and data sharing policies. Social media research remains in its infancy and though work has been done to develop best practices there is yet no agreement on a series of grey areas relating to reseach methodologies including ad hoc social media specific research ethics guidelines for reproducible research. It is understood that public figures Tweet publicly with the conscious intention to have their Tweets publicly accessed and discussed. It is assumed that a public figure Tweeting publicly is of public interest and that such figure, as a Twitter user, has given implicit consent, by agreeing explicitly to Twitter's Terms and Conditions, for their Tweets to be publicly accessed and discussed, including critical analysis, without the need for prior written permission. There is therefore no difference between collecting data and performing data analysis from a public printed or online publication and collecting data and performing data analysis of a dataset containing Twitter data from a public account from a public user in a public role. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time. Reproducibility is considered here a key value for robust and trustworthy research. Different scholarly professional associations like the Modern Language Association recognise Tweets, datasets and other online and digital resources as citeable scholarly outputs.The data contained in the deposited file is otherwise available elsewhere through different methods.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a CSV file containing Tweet IDs of 3,805 Tweets from user ID 25073877 posted publicly between Thursday February 25 2016 16:35:12 +0000 to Monday April 03 2017 12:51:01 +0000.This file does not include Tweets' texts nor URLs. Columns in the file areid_strfrom_user_id_str created_at time source user_followers_count user_friends_count Motivations to Share this DataArchived Tweets can provide interesting insights for the study of contemporary history of media, politics, diplomacy, etc. The queried account is a public account widely agreed to be of exceptional national and international public interest. Though they provide public access to tweeted content in real time, Twitter Web and mobile clients are not suited for appropriate Tweet corpus analysis. For anyone researching social media, access to the data is absolutely essential in order to perform, review and reproduce studies. Archiving Tweets of public interest due to their historic significance is a means to both preserve and enable reproducible study of this form of rapid online communication that otherwise can very likely become unretrievable as time passes. Due to Twitter's current business model and API limits, to date collecting in real time is the only relatively reliable method to archive Tweets at a small scale. Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using a Python script. The data collection search query was from:realdonaldtrump. A trigger was scheduled to collect atuomatically every hour. The original data harvesting was refined to delete duplications, to subscribe to Twitter's Terms and Conditions and so that the data was sorted in chronological order.Duplication of data due to the automated collection is possible so further data refining might be required. The file may not contain data from Tweets deleted by the queried user account immediately after original publication. Both research and experience show that the Twitter search API is not 100% reliable. (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet posted by the queried account during the indicated period. This file dataset is shared for archival, comparative and indicative educational research purposes only. The content included is from a public Twitter account and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.The original Tweets, their contents and associated metadata were published openly on the Web from the queried public account and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually.No private personal information is shared in this dataset. As indicated above this dataset does not contain the text of the Tweets. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road.This dataset is shared to archive, document and encourage open educational research into political activity on Twitter.Other ConsiderationsAll Twitter users agree to Twitter's Privacy and data sharing policies. Social media research remains in its infancy and though work has been done to develop best practices there is yet no agreement on a series of grey areas relating to reseach methodologies including ad hoc social media specific research ethics guidelines for reproducible research. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time. Reproducibility is considered here a key value for robust and trustworthy research. Different scholarly professional associations like the Modern Language Association recognise Tweets, datasets and other online and digital resources as citeable scholarly outputs.The data contained in the deposited file is otherwise available elsewhere through different methods.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This study delves into analyzing social media data sourced from Twitter within the context of Singapore, forming a crucial component of a broader social listening initiative. We provide a decade’s worth of social data from Singapore, offering invaluable insights for the research community. This work presents two analytical approaches utilizing this dataset: sentiment analysis and bursty topic detection. Sentiment analysis for direct search is based on zero shot pretrained model while busrty topic analysis is based on biterm topic model. The detailed experiments demonstrate the efficacy of the approach for analyzing social trends using Twitter data. We collected all twitter data posted in Singapore from 2008 to 2023. The geocode setting as (1.346353, 103.807526, 25km) was used in Twitter API to cover the whole of Singapore. The total number of tweets in this dataset is 96,686,894. There are 3 data files: 1. place.json includes 10k detailed places information in Singapore.2.subzones.json includes 332 subzone information in Singapore 3.tweets.json includes 96M+tweets posted in Singapore. MongoDB was used as the database to store and manage the data.
Facebook
TwitterVersion 162 of the dataset. NOTES: Data for 3/15 - 3/18 was not extracted due to unexpected and unannounced downtime of our university infrastructure. We will try to backfill those days by next release. FUTURE CHANGES: Due to the imminent paywalling of Twitter's API access this might be the last full update of this dataset. If the API access is not blocked, we will be stopping updates for this dataset with release 165 - a bit more than 3 years after our initial release. It's been a joy seeing all the work that uses this resource and we are glad that so many found it useful.
The dataset files: full_dataset.tsv.gz and full_dataset_clean.tsv.gz have been split in 1 GB parts using the Linux utility called Split. So make sure to join the parts before unzipping. We had to make this change as we had huge issues uploading files larger than 2GB's (hence the delay in the dataset releases). The peer-reviewed publication for this dataset has now been published in Epidemiologia an MDPI journal, and can be accessed here: https://doi.org/10.3390/epidemiologia2030024. Please cite this when using the dataset.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,395,222,801 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (361,748,721 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected the data of a Twitter user using Tweepy to access the Twitter API. We crawled the list of each user account’s followers. Twitter allowed a request of a maximum of 200 tweets per time window and because of limitations of the Twitter API, we could only make a request every 15 minutes. Next, we obtained the most recent tweets of each user in the study. We extracted the most common hashtags used in the sample tweets and crawled the most recent 50 tweets that contained each hashtag and tweets that mentioned a particular user, for example ’@username.’ Initially, we chose 101 user accounts and documented the attributes of each user’s account (number of followers, a list of followers, and the recent tweets of each follower).
Facebook
TwitterDue to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full dataset, and a cleaned version with no retweets. There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms, the top 1000 bigrams, and the top 1000 trigrams. Some general statistics per day are included for both datasets. We will continue to update the dataset every two days here and weekly in Zenodo. For more information on processing and visualizations please visit: www.panacealab.org/covid19
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains tweet IDs and their 5 types of contextual information including 1) hashtags, 2) their categories, 3) entities obtained by NERD, 4) time-references normalized by Heideltime, and 5) Web categories for URLs attached with history-related hashtag that are related to history and that were collected for the purpose of analyzing how history-related content is disseminated in online social networks. Our IJDL paper shows the analysis results. The preliminary version of the analysis report is available here.
We used the Twitter official search API provided by Twitter to collect tweets. Note that three kinds of tweets are typically found in Twitter: tweets, retweets and quote tweets. Tweet is an original text issued as a post by a Twitter user. A retweet is a copy of an original tweet for the purpose of propagating the tweet content to more users (i.e., one's followers). Finally, a quote tweet copies the content of another tweet and allows also to add new content. A quote tweet is sometimes called a retweet with a comment. In this work, we simply treat all quote tweets as original tweets since they include additional information/text. There were however only 1,877 (0.2%) tweets recognized as quote tweets in our dataset.
To collect tweets that refer to the past or are related to collective memory of past events/entities, we performed hashtag based crawling together with bootstrapping procedure.
At the beginning, we gathered several historical hashtags selected by experts (e.g. #HistoryTeacher, #history, #WmnHist).
In addition, we prepared several hashtags that are commonly used when referring to the past: #onthisday, #thisdayinhistory, #throwbackthursday, #otd. We then collected tweets that contain these hashtags by using Twitter official search API.
The collected tweets were issued from 8 March 2016 to 2 July 2018.
Bootstrapping allowed us to search for other hashtags frequently used with the seed hashtags. The tweets tagged by such hashtags were then included into the seed set after the manual inspection of all the discovered hashtags as of their relation to the history, and filtering ones that are unrelated.
In total, we gathered 147 history-related hashtags which allowed us to collect 2,370,252 tweet IDs pointing to 882,977 tweets and 1,487,275 re-tweets.
Related papers:
Facebook
TwitterThe dataset used in this study is a collection of English tweets downloaded by querying Twitter API.
Primarily, we used tweepy which is a very useful library to use for collecting data from Twitter using Twitter API. As Twitter API supports language filtering and allows to specify the language of the retrieved posts, we set the optional language parameter in the Twitter Search URL to ‘en’ to extract English Tweets.
Total 3085 tweets are collected and labeled into four emotion classes: 1 for sad, 2 for happy, 3 for angry, and 4 for love.
Facebook
TwitterThis dataset is a collection of scraped public twitter updates used in coordination with an academic project to study the geolocation data related to twittering. From the explanatory PDF in the dataset collection: We provide both training set and test set (collected from September 2009 to January 2010) in the paper You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users in CIKM 2010. The training set contains 115,886 Twitter users and 3,844,612 updates from the users. All the locations of the users are self-labeled in United States in city-level granularity. The test set contains 5,136 Twitter users and 5,156,047 tweets from the users. All the locations of users are uploaded from their smart phones with the form of "UT: Latitude,Longitude". Please cite the following paper when using the dataset. Z. Cheng, J. Caverlee, and K. Lee. You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users. In Proceeding of the 19th ACM Conference on Information and Knowledge Management (CIKM), Toronto, Oct 2010. (Bibtex)
Facebook
Twitterhttps://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
This dataset consists of IDs of geotagged Twitter posts from within the United States. They are provided as files per day and state as well as per day and county. In addition, files containing the aggregated number of hashtags from these tweets are provided per day and state and per day and county. This data is organized as a ZIP-file per month containing several zip-files per day which hold the txt-files with the ID/hash information.
Also part of the dataset are two shapefiles for the US counties and states and Python scripts for the data collection and sorting geotags into counties.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SSH CENTRE (Social Sciences and Humanities for Climate, Energy aNd Transport Research Excellence) is a Horizon Europe project, engaging directly with stakeholders across research, policy, and business (including citizens) to strengthen social innovation, SSH-STEM collaboration, transdisciplinary policy advice, inclusive engagement, and SSH communities across Europe, accelerating the EU’s transition to carbon neutrality. SSH CENTRE is based in a range of activities related to Open Science, inclusivity and diversity – especially with regards Southern and Eastern Europe and different career stages – including: development of novel SSH-STEM collaborations to facilitate the delivery of the EU Green Deal; SSH knowledge brokerage to support regions in transition; and the effective design of strategies for citizen engagement in EU R&I activities. Outputs include action-led agendas and building stakeholder synergies through regular Policy Insight events.This is captured in a high-profile virtual SSH CENTRE generating and sharing best practice for SSH policy advice, overcoming fragmentation to accelerate the EU’s journey to a sustainable future.The documents uploaded here are part of WP2 whereby novel, interdisciplinary teams were provided funding to undertake activities to develop a policy recommendation related to EU Green Deal policy. Each of these policy recommendations, and the activities that inform them, will be written-up as a chapter in an edited book collection. Three books will make up this edited collection - one on climate, one on energy and one on mobility. As part of writing a chapter for the SSH CENTRE book on ‘Mobility’, we set out to analyse the sentiment of users on Twitter regarding shared and active mobility modes in Brussels. This involved us collecting tweets between 2017-2022. A tweet was collected if it contained a previously defined mobility keyword (for example: metro) and either the name of a (local) politician, a neighbourhood or municipality, or a (shared) mobility provider. The files attached to this Zenodo webpage is a csv files containing the tweets collected.”.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This datasets is an extract of a wider database aimed at collecting Twitter user's friends (other accound one follows). The global goal is to study user's interest thru who they follow and connection to the hashtag they've used.
It's a list of Twitter user's informations. In the JSON format one twitter user is stored in one object of this more that 40.000 objects list. Each object holds :
avatar : URL to the profile picture
followerCount : the number of followers of this user
friendsCount : the number of people following this user.
friendName : stores the @name (without the '@') of the user (beware this name can be changed by the user)
id : user ID, this number can not change (you can retrieve screen name with this service : https://tweeterid.com/)
friends : the list of IDs the user follows (data stored is IDs of users followed by this user)
lang : the language declared by the user (in this dataset there is only "en" (english))
lastSeen : the time stamp of the date when this user have post his last tweet.
tags : the hashtags (whith or without #) used by the user. It's the "trending topic" the user tweeted about.
tweetID : Id of the last tweet posted by this user.
You also have the CSV format which uses the same naming convention.
These users are selected because they tweeted on Twitter trending topics, I've selected users that have at least 100 followers and following at least 100 other account (in order to filter out spam and non-informative/empty accounts).
This data set is build by Hubert Wassner (me) using the Twitter public API. More data can be obtained on request (hubert.wassner AT gmail.com), at this time I've collected over 5 milions in different languages. Some more information can be found here (in french only) : http://wassner.blogspot.fr/2016/06/recuperer-des-profils-twitter-par.html
No public research have been done (until now) on this dataset. I made a private application which is described here : http://wassner.blogspot.fr/2016/09/twitter-profiling.html (in French) which uses the full dataset (Millions of full profiles).
On can analyse a lot of stuff with this datasets :
Feel free to ask any question (or help request) via Twitter : @hwassner
Enjoy! ;)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Description
This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.
Data Collection Method
Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.
Dataset Content
ID: A unique identifier for each tweet.
text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.
polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).
favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.
retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.
user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.
user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.
user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.
user_followers_count: The current number of followers the account has. It is a non-negative integer.
user_friends_count: The number of users that the account is following. It is a non-negative integer.
user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.
user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.
user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.
user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.
Cite as
Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.
Potential Use Cases
This dataset is aimed at academic researchers and practitioners with interests in:
Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.
Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.
Exploring correlations between user engagement metrics and sentiment in discussions about AI.
Data Format and File Type
The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.
License
The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Twitter data was collected using Twitter’s Application Programming Interface(API) and Tweepy, a python library to access the twitter API. Certain keywords related to COVID’19 like Coronavirus, ncov, Wuhan, China, Covid-19, Epidemic, Pandemic, SocialDistancing, etc. were used to collect the tweets. Only the tweets that were in English and the ones that had a geo-tag were collected. During the exploratory data analysis, we noticed that a number of tweets consisted of only certain words and not proper sentences and analyzing the emotion of such tweets might not give us a proper overview of the emotions. Thus, only the tweets with at least 6 words in them were used. This significantly reduced the number of tweets collected. Finally, we had over 1 million tweets over the span of February, March, April, May, and June. The tweets were then further processed to remove all the HTML text, ‘@’ mentions, URL links, and #hashtags.
The data was analyzed using a machine learning model and tweets were categorized into various emotions. The dataset provides the count of tweets per country per emotion for 5 months.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset consists of tweet identifiers for tweets harvested between November 28, 2016, following the election of Donald Trump through the end of the first 100 days of his administration. Data collection ended May 1, 2017.
Tweets were harvested using multiple methods described below. The total dataset consists of 218,273,152 tweets. Because of the different methods used to harvest tweets, there may be some duplication.
Methods Data were harvested from the Twitter API using the following endpoints:
search
timeline
filter
Three tweet sets were harvested using the search endpoint, which returns tweets that include a specific search term, user mention, hashtag, etc. The table below provides the search term, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented.
Search term
Dates collected
Count tweets
Count unique users
@realDonaldTrump user mention
2016-11-28 - 2017-05-01
4,597,326
1,501,806
"Trump" in tweet text
2017-01-18 - 2017-05-01
11,055,772
2,648,849
#MAGA hashtag
2017-01-23 - 2017-05-01
1,169,897
236,033
Two tweet sets were harvested using the timeline endpoint, which returns tweets published by specific users. The table below provides the user whose timeline was harvested, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented. Note that in these cases, tweets were necessarily limited to the one unique user whose tweets were harvested.
User
Dates collected
Count tweets
Count unique users
realDonaldTrump
2016-12-21 - 2017-05-01
902
1
trumpRegrets
2017-01-15 - 2017-05-01
1,751
1
The largest tweet set was harvested using the filter endpoint, which allows for streaming data access in near real time. Requests made to this API can be filtered to include tweets that meet specific criteria. The table below provides the filters used, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented.
Filtering via the API uses a default "OR," so the tweets included in this set satisfied any of the filter terms.
The script used to harvest streaming data from the filter API was built using the Python tweepy library.
Filter terms
Dates collected
Count tweets
Count unique users
tweets by realDonaldTrump
tweet mentions @realDonaldTrump
'maga' in text
'trump' in text
'potus' in text
2017-01-26 - 2017-05-01
201,447,504
12,489,255
Harvested tweets, including all corresponding metadata, were stored in individual JSON files (one file per tweet).
Data Processing: Conversion to CSV format
Per the terms of Twitter's developer API, tweet datasets may be shared for academic research use. Sharing tweet data is limited to sharing the identifiers of tweets, which must be re-harvested to account for deletions and/or modifications of individual tweets. It is not permitted to share the originally harvested tweets in JSON format.
Tweet identifiers have been extracted from the JSON data and saved as plain text CSV files. The CSV files all have a single column:
id_str (string): A tweet identifier
The data include one tweet identifier per row.