Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.
The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).
It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).
Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.
The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.
To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.
In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).
In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:
March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).
June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).
September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).
December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).
March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).
June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).
September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).
December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).
March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).
June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).
September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).
December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).
March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).
June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).
The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.
At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.
In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted and non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).
Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.
For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).
If you use this dataset in any way please cite that preprint (in addition to the dataset itself).
If you need to contact me you can find me as @PFCdgayo in Twitter.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains tweet IDs and their 5 types of contextual information including 1) hashtags, 2) their categories, 3) entities obtained by NERD, 4) time-references normalized by Heideltime, and 5) Web categories for URLs attached with history-related hashtag that are related to history and that were collected for the purpose of analyzing how history-related content is disseminated in online social networks. Our IJDL paper shows the analysis results. The preliminary version of the analysis report is available here.
We used the Twitter official search API provided by Twitter to collect tweets. Note that three kinds of tweets are typically found in Twitter: tweets, retweets and quote tweets. Tweet is an original text issued as a post by a Twitter user. A retweet is a copy of an original tweet for the purpose of propagating the tweet content to more users (i.e., one's followers). Finally, a quote tweet copies the content of another tweet and allows also to add new content. A quote tweet is sometimes called a retweet with a comment. In this work, we simply treat all quote tweets as original tweets since they include additional information/text. There were however only 1,877 (0.2%) tweets recognized as quote tweets in our dataset.
To collect tweets that refer to the past or are related to collective memory of past events/entities, we performed hashtag based crawling together with bootstrapping procedure.
At the beginning, we gathered several historical hashtags selected by experts (e.g. #HistoryTeacher, #history, #WmnHist).
In addition, we prepared several hashtags that are commonly used when referring to the past: #onthisday, #thisdayinhistory, #throwbackthursday, #otd. We then collected tweets that contain these hashtags by using Twitter official search API.
The collected tweets were issued from 8 March 2016 to 2 July 2018.
Bootstrapping allowed us to search for other hashtags frequently used with the seed hashtags. The tweets tagged by such hashtags were then included into the seed set after the manual inspection of all the discovered hashtags as of their relation to the history, and filtering ones that are unrelated.
In total, we gathered 147 history-related hashtags which allowed us to collect 2,370,252 tweet IDs pointing to 882,977 tweets and 1,487,275 re-tweets.
Related papers:
Facebook
TwitterA list of 10,538 Twitter IDs for tweets harvested between 4 January at 11am and 9 January at 11am using Social Feed Manager. As this used the search API, the 4 January at 11am crawl went back about 5-9 days. Tweet IDs included, as is a log of the decisions made to curate this dataset.
Facebook
TwitterPaper DOI: 10.51685/jqd.2025.011 Paper abstract: This paper introduces the Twitter History and Image Sharing (THIS) datasets. These four related datasets enable the study of Twitter \emph{without the release of tweets or user information}. Both are derived from a corpus of 14.596 billion geolocated tweets streamed from September 1, 2013 through March 14, 2023. Two Twitter History datasets provide data on the number of tweets, tweets by language, and user data by country from September 1, 2013 through March 14, 2023. A third Twitter History dataset provides data on the number of new user registrations by country from March 21, 2006, the start of Twitter, through March 14, 2023. Image Sharing is based on the 1.676 billion images shared during this period and the 956.049 million still available for download in early 2024. It provides data on the number of images shared and still available from September 1, 2013 through March 14, 2023. The THIS datasets enable the study of Twitter itself and its differential use across countries, including in response to specific events, and the paper demonstrates applications to correlates of image sharing and removal, behavior around national executive elections, event detection, and digital repression. While this paper is not the first to study Twitter, it is, as far as we are aware, the first to provide datasets enabling other researchers to do the same.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The US has historically been the target country for Twitter since its launch in 2006. This is the full breakdown of Twitter users by country.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background
Social media opinion has become a medium to quickly access large, valuable, and rich details of information on any subject matter within a short period. Twitter being a social microblog site, generate over 330 million tweets monthly across different countries. Analysing trending topics on Twitter presents opportunities to extract meaningful insight into different opinions on various issues.
Aim
This study aims to gain insights into the trending yahoo-yahoo topic on Twitter using content analysis of selected historical tweets.
Methodology
The widgets and workflow engine in the Orange Data mining toolbox were employed for all the text mining tasks. 5500 tweets were collected from Twitter using the “yahoo yahoo” hashtag. The corpus was pre-processed using a pre-trained tweet tokenizer, Valence Aware Dictionary for Sentiment Reasoning (VADER) was used for the sentiment and opinion mining, Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI) was used for topic modelling. In contrast, Multidimensional scaling (MDS) was used to visualize the modelled topics.
Results
Results showed that "yahoo" appeared in the corpus 9555 times, 175 unique tweets were returned after duplicate removal. Contrary to expectation, Spain had the highest number of participants tweeting on the 'yahoo yahoo' topic within the period. The result of Vader sentiment analysis returned 35.85%, 24.53%, 15.09%, and 24.53%, negative, neutral, no-zone, and positive sentiment tweets, respectively. The word yahoo was highly representative of the LDA topics 1, 3, 4, 6, and LSI topic 1.
Conclusion
It can be concluded that emojis are even more representative of the sentiments in tweets faster than the textual contents. Also, despite popular belief, a significant number of youths regard cybercrime as a detriment to society.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
On Aug 5 2019 there was an historical decision taken in India's History - Article 370 was abolished.While time will tell us if this was the right decision or not, there was a flooding on tweets on this topic when it was announced in the parliament and was sign of an new era in indian politics.
We have scraped data from period of 5th Aug to 10th Aug when the abolishment of Article 370 was announced.There are about 190985 tweets. The dataset does not include any retweets. The tweets contain information of when it was tweeted, what was the tweet about and who tweeted, number of replies, number of people who liked the tweet and also number of replies to the tweet,
The tweets were scrapped using the below github repo, which allows to scrap tweets without using the Twitter REST API
https://github.com/jonbakerfish/TweetScraper
Based on this data, here are some useful ways of deriving insights and analysis:
The data will be updated soon with more tweets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Twitter is ranked as the 12h most popular social media site in the world. The platform currently has 611 million active monthly users.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of metadata related to 24,508 news events, collected from Twitter spanning from August 2013 to June 2015. The events encompasses a total of 193,445,734 tweets produced by 26,127,624 different users.The files contain different aspects of the data.- components.tsv consists of the description of the events (components) of our dataset, consisting of 4 columns separated by tabs. The columns correspond to the component ID, the date of an event, the amount of tweets and a set of keywords describing the event, separated by commas (having a minimum of 2).- componentlocation.tsv consists of the description of the locations where the events happened (“protagonist locations”). The columns correspond to an ID, the component ID, the names of the locations, the frequency (how many times that location was mentioned in the component), the country code, and six more non-relevant columns. Note that one component can be in several rows, one per location being mentioned for that component.- country_protagonized-events.csv consists of the amount of events that one specific country is a protagonist of. It contains two columns, separated by comma, being the first the country code and the second the amount of events (components) that country is a protagonist of.- country_tweets.csv consists of the amount of tweets that one specific country has issued along all the events. It contains two columns, separated by comma, being the first the country code and the second the amount of tweets that country has issued.- participation_data.txt contains a matrix indicating the amount of tweets per country, per event. It contains one row per component ID, and one column per country (plus one column for the component ID); the cell value is the amount of tweets that country has issued for that event.- similarities_no_reciproco_percentile.csv corresponds to the similarity between co-protagonist countries. The columns are in the following order: Country 1, the amount of events Country 1 is a protagonist of, Country 2, the amount of events Country 2 is a protagonist of, the Jaccard Similarity between the two countries (where the country is represented by the set of the component IDs that country is a protagonist of), and the percentile of that similarity value (ranging from 0 to 1).- users_events_distinct.txt corresponds to the amount of unique users participating in an event. The columns are separated by tabs. The first columns is the component ID, the second is the amount of different users for that event, and the third is the amount of of different news sources for that event.- countries.txt is the mapping between country code and country name, separated by space.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The US has the largest number of Twitter users with over a 100 million users. They account for about 16.7% of all Twitter users worldwide.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">
Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?
Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.
Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.
You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)
The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv
Facebook
Twitterhttps://brightdata.com/licensehttps://brightdata.com/license
Our Twitter Sentiment Analysis Dataset provides a comprehensive collection of tweets, enabling businesses, researchers, and analysts to assess public sentiment, track trends, and monitor brand perception in real time. This dataset includes detailed metadata for each tweet, allowing for in-depth analysis of user engagement, sentiment trends, and social media impact.
Key Features:
Tweet Content & Metadata: Includes tweet text, hashtags, mentions, media attachments, and engagement metrics such as likes, retweets, and replies.
Sentiment Classification: Analyze sentiment polarity (positive, negative, neutral) to gauge public opinion on brands, events, and trending topics.
Author & User Insights: Access user details such as username, profile information, follower count, and account verification status.
Hashtag & Topic Tracking: Identify trending hashtags and keywords to monitor conversations and sentiment shifts over time.
Engagement Metrics: Measure tweet performance based on likes, shares, and comments to evaluate audience interaction.
Historical & Real-Time Data: Choose from historical datasets for trend analysis or real-time data for up-to-date sentiment tracking.
Use Cases:
Brand Monitoring & Reputation Management: Track public sentiment around brands, products, and services to manage reputation and customer perception.
Market Research & Consumer Insights: Analyze consumer opinions on industry trends, competitor performance, and emerging market opportunities.
Political & Social Sentiment Analysis: Evaluate public opinion on political events, social movements, and global issues.
AI & Machine Learning Applications: Train sentiment analysis models for natural language processing (NLP) and predictive analytics.
Advertising & Campaign Performance: Measure the effectiveness of marketing campaigns by analyzing audience engagement and sentiment.
Our dataset is available in multiple formats (JSON, CSV, Excel) and can be delivered via API, cloud storage (AWS, Google Cloud, Azure), or direct download.
Gain valuable insights into social media sentiment and enhance your decision-making with high-quality, structured Twitter data.
Facebook
TwitterAs of December 2022, X/Twitter's audience accounted for over *** million monthly active users worldwide. This figure was projected to ******** to approximately *** million by 2024, a ******* of around **** percent compared to 2022.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
If you use the dataset, cite the paper: https://doi.org/10.1016/j.eswa.2022.117541
The most comprehensive dataset to date regarding climate change and human opinions via Twitter. It has the heftiest temporal coverage, spanning over 13 years, includes over 15 million tweets spatially distributed across the world, and provides the geolocation of most tweets. Seven dimensions of information are tied to each tweet, namely geolocation, user gender, climate change stance and sentiment, aggressiveness, deviations from historic temperature, and topic modeling, while accompanied by environmental disaster events information. These dimensions were produced by testing and evaluating a plethora of state-of-the-art machine learning algorithms and methods, both supervised and unsupervised, including BERT, RNN, LSTM, CNN, SVM, Naive Bayes, VADER, Textblob, Flair, and LDA.
The following columns are in the dataset:
➡ created_at: The timestamp of the tweet. ➡ id: The unique id of the tweet. ➡ lng: The longitude the tweet was written. ➡ lat: The latitude the tweet was written. ➡ topic: Categorization of the tweet in one of ten topics namely, seriousness of gas emissions, importance of human intervention, global stance, significance of pollution awareness events, weather extremes, impact of resource overconsumption, Donald Trump versus science, ideological positions on global warming, politics, and undefined. ➡ sentiment: A score on a continuous scale. This scale ranges from -1 to 1 with values closer to 1 being translated to positive sentiment, values closer to -1 representing a negative sentiment while values close to 0 depicting no sentiment or being neutral. ➡ stance: That is if the tweet supports the belief of man-made climate change (believer), if the tweet does not believe in man-made climate change (denier), and if the tweet neither supports nor refuses the belief of man-made climate change (neutral). ➡ gender: Whether the user that made the tweet is male, female, or undefined. ➡ temperature_avg: The temperature deviation in Celsius and relative to the January 1951-December 1980 average at the time and place the tweet was written. ➡ aggressiveness: That is if the tweet contains aggressive language or not.
Since Twitter forbids making public the text of the tweets, in order to retrieve it you need to do a process called hydrating. Tools such as Twarc or Hydrator can be used to hydrate tweets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a CSV file containing Tweet IDs of 3,805 Tweets from user ID 25073877 posted publicly between Thursday February 25 2016 16:35:12 +0000 to Monday April 03 2017 12:51:01 +0000.This file does not include Tweets' texts nor URLs. Columns in the file areid_strfrom_user_id_str created_at time source user_followers_count user_friends_count Motivations to Share this DataArchived Tweets can provide interesting insights for the study of contemporary history of media, politics, diplomacy, etc. The queried account is a public account widely agreed to be of exceptional national and international public interest. Though they provide public access to tweeted content in real time, Twitter Web and mobile clients are not suited for appropriate Tweet corpus analysis. For anyone researching social media, access to the data is absolutely essential in order to perform, review and reproduce studies. Archiving Tweets of public interest due to their historic significance is a means to both preserve and enable reproducible study of this form of rapid online communication that otherwise can very likely become unretrievable as time passes. Due to Twitter's current business model and API limits, to date collecting in real time is the only relatively reliable method to archive Tweets at a small scale. Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using a Python script. The data collection search query was from:realdonaldtrump. A trigger was scheduled to collect atuomatically every hour. The original data harvesting was refined to delete duplications, to subscribe to Twitter's Terms and Conditions and so that the data was sorted in chronological order.Duplication of data due to the automated collection is possible so further data refining might be required. The file may not contain data from Tweets deleted by the queried user account immediately after original publication. Both research and experience show that the Twitter search API is not 100% reliable. (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet posted by the queried account during the indicated period. This file dataset is shared for archival, comparative and indicative educational research purposes only. The content included is from a public Twitter account and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.The original Tweets, their contents and associated metadata were published openly on the Web from the queried public account and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually.No private personal information is shared in this dataset. As indicated above this dataset does not contain the text of the Tweets. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road.This dataset is shared to archive, document and encourage open educational research into political activity on Twitter.Other ConsiderationsAll Twitter users agree to Twitter's Privacy and data sharing policies. Social media research remains in its infancy and though work has been done to develop best practices there is yet no agreement on a series of grey areas relating to reseach methodologies including ad hoc social media specific research ethics guidelines for reproducible research. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time. Reproducibility is considered here a key value for robust and trustworthy research. Different scholarly professional associations like the Modern Language Association recognise Tweets, datasets and other online and digital resources as citeable scholarly outputs.The data contained in the deposited file is otherwise available elsewhere through different methods.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset represents version 2 of this dataset. The previous version was published on June 30, 2020.This dataset contains the tweet ids of 39,373,774 tweets, which are part of the Schlesinger Library #metoo Digital Media Collection. This second version of the dataset represents the full set of tweets collected throughout the project, tweets range from October 15, 2017 to December 31, 2022. The previous version of this dataset extended to March 31, 2020. Tweets between October 15, 2017 and December 10, 2018 were licensed from Twitter's Historical PowerTrack and received through GNIP. Tweets after December 10, 2018 were collected weekly from the Twitter API through Social Feed Manager using the POST statuses/filter method of the Twitter Stream API.The following list of 76 terms includes the hashtags used to collect data for this dataset : #metoo, #timesup, #metoostem, #sciencetoo, #metoophd, #shittymediamen, #churchtoo, #ustoo, #metooMVMT, #ARmetoo, #TimesUpAR, #metooSociology, #metooSexScience, #timesupAcademia, #metooMedicine, #MyCampusToo, #howiwillchange, #iwill, #believewomen, #GoTeal, #BelieveChristine, #IStandWithDrFord, #IStandWithChristineBlaseyFord, #believesurvivors, #whyididntreport, #himtoo, #istandwithbrett, #confirmkavanaguhnow, #metooMcdonalds, #metoomovement, #muteRKelly, #WeBelieveDrFord, #WeBelieveSurvivors, #HandsOffPantsOn, #MeAt14, #HeToo, #MeTooLiars, #metoolynchings, #metoohucksters, #metoohustle, #ItWasMe, #Ihave, #TimesUpTech, #GoogleWalkout, #mosquemetoo, #faithandmetoo, #SilenceIsNotSpiritual, #HealMeToo, #TimesUpHarvard, #NoCarveOut, #TimesUpx2, #MeetingsToo, #metoonatsec, #healmetoo, #GamAni, #ShulToo, #harvardhearsyou, #metooarcheology, #TimesUpPayUp, #metooarcheology, #metooHBCU, #TimesUpHC, #aidtoo, #garmentmetoo, #mutemetoo, #mutetimesup, #metoopolisci, #copstoo, #TimesUpBiden, #MeTooNoMatterWho, #IBelieveTara, #BelieveAllWomen, #metoomilitary, #harvard38, #comaroff, and #harvardletter.The final four hashtags in this list were first crawled on February 10, 2022.Because of the size of the files, the list of identifiers are split in 41 files containing up to 1,000,000 ids each.Per Twitter's Developer Policy, tweet ids may be publicly shared for academic purposes; tweets may not. Therefore, this dataset only contains tweet ids. In order to retrieve tweets still available (not deleted by users) tools like Hydrator are available.Subsets of only the #metoo seed are also available by quarterly datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With over 611 million monthly active users, building a huge Twitter following is not an easy task. These are the top 25 accounts with the most followers on Twitter right now.
Facebook
TwitterBetween July and December 2024, over 335 million accounts on X (formerly Twitter) were suspended for reasons of spam or platform manipulation. User-informed labels were added to 66 million posts after being reported for spam.
Facebook
TwitterThere are total of 20 CSV files including tweets related to COVID-19 from 20 March 2020 to 08 April 2020.
For each file, the following columns are included. Columns: coordinates, created_at, hashtags, media, urls, favorite count, id, in_reply_to_screen_name, in_reply_to_status_id, in_reply_to_user_id.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
The relevant sections of Twitter's Terms of Service [1] and Developer Agreement [2]. ** According to Twitter's Developer Policy §6 [3]: "If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs" and "any Content provided to third parties via non-automated file download remains subject to this Policy". [1] https://twitter.com/tos?lang=en [2] https://dev.twitter.com/overview/terms/agreement [3] https://dev.twitter.com/overview/terms/policy#6.Update_Be_a_Good_Partner_to_Twitter
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This datasets is an extract of a wider database aimed at collecting Twitter user's friends (other accound one follows). The global goal is to study user's interest thru who they follow and connection to the hashtag they've used.
It's a list of Twitter user's informations. In the JSON format one twitter user is stored in one object of this more that 40.000 objects list. Each object holds :
avatar : URL to the profile picture
followerCount : the number of followers of this user
friendsCount : the number of people following this user.
friendName : stores the @name (without the '@') of the user (beware this name can be changed by the user)
id : user ID, this number can not change (you can retrieve screen name with this service : https://tweeterid.com/)
friends : the list of IDs the user follows (data stored is IDs of users followed by this user)
lang : the language declared by the user (in this dataset there is only "en" (english))
lastSeen : the time stamp of the date when this user have post his last tweet.
tags : the hashtags (whith or without #) used by the user. It's the "trending topic" the user tweeted about.
tweetID : Id of the last tweet posted by this user.
You also have the CSV format which uses the same naming convention.
These users are selected because they tweeted on Twitter trending topics, I've selected users that have at least 100 followers and following at least 100 other account (in order to filter out spam and non-informative/empty accounts).
This data set is build by Hubert Wassner (me) using the Twitter public API. More data can be obtained on request (hubert.wassner AT gmail.com), at this time I've collected over 5 milions in different languages. Some more information can be found here (in french only) : http://wassner.blogspot.fr/2016/06/recuperer-des-profils-twitter-par.html
No public research have been done (until now) on this dataset. I made a private application which is described here : http://wassner.blogspot.fr/2016/09/twitter-profiling.html (in French) which uses the full dataset (Millions of full profiles).
On can analyse a lot of stuff with this datasets :
Feel free to ask any question (or help request) via Twitter : @hwassner
Enjoy! ;)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.
The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).
It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).
Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.
The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.
To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.
In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).
In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:
March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).
June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).
September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).
December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).
March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).
June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).
September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).
December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).
March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).
June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).
September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).
December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).
March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).
June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).
The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.
At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.
In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted and non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).
Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.
For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).
If you use this dataset in any way please cite that preprint (in addition to the dataset itself).
If you need to contact me you can find me as @PFCdgayo in Twitter.