This statistic described the distribution of twitter usage in the Middle East and North Africa in 2016, by language. During 2016, the most used language on twitter in the MENA region was Arabic with ** percent.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection includes data for 30 different Twitter datasets associated with real world events. The datasets were collected between 2012 and 2016, always using the streaming API with a set of keywords.These datasets are released in accordance with Twitter's TOS, which allows sharing of tweet IDs and are intended for non-commercial research.Note: Twitter's developer policy doesn't allow sharing more than 1,500,000 tweet IDs (https://dev.twitter.com/overview/terms/policy#updated-policy), unless the author is affiliated with an academic institution (which is my case) and tweet IDs are solely used for non-commercial purposes (https://twittercommunity.com/t/policy-update-clarification-research-use-cases/87566). Hence, by downloading these datasets you agree that you will not use it for commercial purposes.Please cite the following paper if you make use of these datasets for your research: https://onlinelibrary.wiley.com/doi/full/10.1002/asi.24026See README file for more details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.
The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).
It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).
Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.
The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.
To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.
In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).
In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:
March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).
June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).
September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).
December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).
March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).
June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).
September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).
December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).
March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).
June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).
September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).
December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).
March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).
June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).
The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.
At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.
In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted and non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).
Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.
For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).
If you use this dataset in any way please cite that preprint (in addition to the dataset itself).
If you need to contact me you can find me as @PFCdgayo in Twitter.
This statistic described the distribution of twitter usage in Saudi Arabia in 2016, by device. During 2016, the most used device for twitter in Saudi Arabia was mobile with almost ** percent.
Inspired by open educational resources, open pedagogy, and open source software, the openness movement in education has different meanings for different people. In this study, we use Twitter data to examine the discourses surrounding openness as well as the people who participate in discourse around openness. By targeting hashtags related to open education, we gathered the most extensive dataset of historical open education tweets to date (n = 178,304 tweets and 23,061 users) and conducted a mixed methods analysis of openness from 2009 to 2016. Findings show that the diversity of participants has varied somewhat over time and that the discourse has predominantly revolved around open resources, although there are signs that an increase in interest around pedagogy, teaching, and learning is emerging.
This dataset contains statistics on the usage patterns of the official City of Seattle Twitter accounts, as well their outreach impact. Jun 2016 - Jan 2017
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TweetIds and Collections
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 8.0px Helvetica}
Erdem Beğenilmiş and Suzan Uskudarli. 2018. Organized Behavior Classification of Tweet Sets using Supervised Learning Methods. In WIMS ’18: 8th International Conference on Web Intelligence, Mining and Semantics, June 25–27, 2018, Novi Sad, Serbia. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3227609.3227665
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for more than 1.9 billion tweets, spanning more than 7 years (2013 - 2020). Metadata information about the tweets as well as extracted entities, sentiments, hashtags, user mentions and URLs are exposed in RDF using established RDF/S vocabularies*. Example queries and more information are available through TweetsKB's home page: https://data.gesis.org/tweetskb/.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains tweet IDs and their 5 types of contextual information including 1) hashtags, 2) their categories, 3) entities obtained by NERD, 4) time-references normalized by Heideltime, and 5) Web categories for URLs attached with history-related hashtag that are related to history and that were collected for the purpose of analyzing how history-related content is disseminated in online social networks. Our IJDL paper shows the analysis results. The preliminary version of the analysis report is available here.
We used the Twitter official search API provided by Twitter to collect tweets. Note that three kinds of tweets are typically found in Twitter: tweets, retweets and quote tweets. Tweet is an original text issued as a post by a Twitter user. A retweet is a copy of an original tweet for the purpose of propagating the tweet content to more users (i.e., one's followers). Finally, a quote tweet copies the content of another tweet and allows also to add new content. A quote tweet is sometimes called a retweet with a comment. In this work, we simply treat all quote tweets as original tweets since they include additional information/text. There were however only 1,877 (0.2%) tweets recognized as quote tweets in our dataset.
To collect tweets that refer to the past or are related to collective memory of past events/entities, we performed hashtag based crawling together with bootstrapping procedure.
At the beginning, we gathered several historical hashtags selected by experts (e.g. #HistoryTeacher, #history, #WmnHist).
In addition, we prepared several hashtags that are commonly used when referring to the past: #onthisday, #thisdayinhistory, #throwbackthursday, #otd. We then collected tweets that contain these hashtags by using Twitter official search API.
The collected tweets were issued from 8 March 2016 to 2 July 2018.
Bootstrapping allowed us to search for other hashtags frequently used with the seed hashtags. The tweets tagged by such hashtags were then included into the seed set after the manual inspection of all the discovered hashtags as of their relation to the history, and filtering ones that are unrelated.
In total, we gathered 147 history-related hashtags which allowed us to collect 2,370,252 tweet IDs pointing to 882,977 tweets and 1,487,275 re-tweets.
Related papers:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset features the training models, emotion classifications and emotion patterns before and after events, related to the paper:F. Kunneman, M. van Mulken and A. Van den Bosch, Anticipointment detection in event tweets (under review)Abstract of the study:We developed a system to detect positive expectation, disappointment, and satisfaction in tweets that refer to events automatically discovered in the Twitter stream. The emotional content shared on Twitter when referring to public events can provide insights into the presumed and experienced quality of the event. We expected to find a connection between positive expectation and disappointment, a succession that is sometimes referred to as anticipointment. The application of computational approaches makes it possible to detect the presence and strength of this hypothetical relation for a large number of events. We extracted events from a longitudinal data set of Dutch Twitter posts, and modeled classifiers to recognize emotion in the tweets related to those events by means of hashtag-labeled training data. After classifying all tweets before and after the events in our data set, we summarized the collective emotions by calculating the percentage of tweets classified with an emotion as well as ranking tweets based on the classifier confidence score for an emotion and selecting the 90th percentile. Only a weak correlation of around 0.2 was found between positive expectation and disappointment, while a higher correlation of 0.6 was found between positiveexpectation and satisfaction. The most anticipointing events were events with a clear loss, such as a canceled event or when the favored sports team had lost. We conclude that senders of Twitter posts might be more inclined to share satisfaction than disappointment after a much anticipated event.Subject period: January 1st 2011 until October 31st 2015 Date: start=2015-11-01; end=2016-02-28 (data collection)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
File contains a list of Twitter account IDs in ASCII format. These accounts were those which we sampled and then analysed in the paper. The data we used are available from Twitter with the REST API.
This statistic displays the results of a youth survey conducted among 15-34 year olds in across ** states across India in 2016 about the frequency of Twitter usage. A majority of respondents, about ** percent never used the news and social networking service, while about eight percent used it daily during the survey period.
CHECK THE OPEN ACCESS VERSION OF THIS DATASET: https://zenodo.org/record/1095592
Tweet IDs for tweets containing hashtags related to Canada at the 2016 Rio Olympics and Paralympics, held Aug. 5-21 and Sept. 7-18 respectively. These were captured as part of a larger web archiving project focused on Canada's involvement in the Rio Games. Tweet IDs can be hydrated using Ed Summers' twarc (https://github.com/edsu/twarc). Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter. / Tweets were collected July 29 - Sept. 23. Several hashtags were tracked, with new ones added as they were identified: Added July 29: #teamcanada,#equipecanada / Added Aug. 3: #CAN, #CanadaRED, #CanWNT / Added Aug. 8: #GoCanadaGo / Added Aug. 10: #LetsGoCanada, #Canadaolympics, #Flytheflag / Added Aug. 12: #Pennyoleksiak / Added Sept. 6: #Paratough, #Parafort /
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Input data and output of research conducted in the study described in the paper:F. Kunneman and A. Van den Bosch (2016), Open-domain extraction of future events from Twitter, Natural Language Engineering, doi: 10.1017/S1351324916000036The paper describes a system that extracts future referring time expressions and entities from Twitter messages, and subsequently detects events as a pair of a date and entity the are often mentioned in the same tweet. This dataset features the ids of a large set of Dutch tweets posted in August 2014, which was used as input to the system, as well as the time expression and / or entity that was extracted from each tweet, if any. Furthermore, the detected events are included, represented as a date, one or more describing terms, the tweetids that refer to it and the assessment of the event by human annotators.
CHECK THE OPEN ACCESS VERSION OF THIS DATASET: https://zenodo.org/record/579601
This statistic described the distribution of twitter usage in Qatar in 2016, by device. During 2016, the most used device for twitter in Qatar was mobile with almost ** percent.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Twitter has been widely used to share opinions and sentiments on various topics. Studies have found correlations between the sentiments on Twitter and social trends in the real world. Here, we collected the tweet IDs related to climate change, infectious diseases, and vaccines through Twitter Application Programming Interface (API), which can be useful for further research on different topics. The data ranges from October 30th, 2016, to April 24th, 2021, and is broken down by week.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dengue is a mosquito-borne viral disease which infects millions of people every year, specially in developing countries. Some of the main challenges facing the disease are reporting risk indicators and rapidly detecting outbreaks. Traditional surveillance systems rely on passive reporting from health-care facilities, often ignoring human mobility and locating each individual by their home address. Yet, geolocated data are becoming commonplace in social media, which is widely used as means to discuss a large variety of health topics, including the users' health status. In this dataset paper, we make available two large collections of dengue related labeled Twitter data. One is a set of tweets available through the Streaming API using the keywords dengue and aedes from 2010 to 2016. The other is the set of all geolocated tweets in Brazil during the year of 2015 (available also through the Streaming API). We detail the process of collecting and labeling each tweet containing keywords related to dengue in one of 5 categories: personal experience, information, opinion, campaign, and joke. This dataset can be useful for the development of models for spatial disease surveillance, but also scenarios such as understanding health-related content in a language other than English, and studying human mobility.
Natural Hazards is a natural disaster dataset with sentiment labels, which contains nearly 50,00 Twitter data about different natural disasters in the United States (e.g., a tornado in 2011, a hurricane named Sandy in 2012, a series of floods in 2013, a hurricane named Matthew in 2016, a blizzard in 2016, a hurricane named Harvey in 2017, a hurricane named Michael in 2018, a series of wildfires in 2018, and a hurricane named Dorian in 2019).
This statistic described the distribution of twitter usage in the Middle East and North Africa in 2016, by language. During 2016, the most used language on twitter in the MENA region was Arabic with ** percent.