The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.
The global number of Twitter users in was forecast to continuously increase between 2024 and 2028 by in total 74.3 million users (+17.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 503.42 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like South America and the Americas.
Worldwide Social Media User in 2021 (Quarterly)
Facebook: https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/ Twitter: https://investor.twitterinc.com/home/default.aspx Instagram: https://investor.fb.com/home/default.aspx
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Every tweet in the first dataset includes at least one name of a large city in the U.S. or elsewhere. The second dataset does not include city names outside the U.S., but contains the names of small, mid-sized, and large cities in the U.S.Two datasets of tweets.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Customer Support on Twitter dataset is a large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact.
https://i.imgur.com/nTv3Iuu.png" alt="Example Analysis - Inbound Volume for the Top 20 Brands">
Natural language remains the densest encoding of human experience we have, and innovation in NLP has accelerated to power understanding of that data, but the datasets driving this innovation don't match the real language in use today. The Customer Support on Twitter dataset offers a large corpus of modern English (mostly) conversations between consumers and customer support agents on Twitter, and has three important advantages over other conversational text datasets:
The size and breadth of this dataset inspires many interesting questions:
The dataset is a CSV, where each row is a tweet. The different columns are described below. Every conversation included has at least one request from a consumer and at least one response from a company. Which user IDs are company user IDs can be calculated using the inbound
field.
tweet_id
A unique, anonymized ID for the Tweet. Referenced by response_tweet_id
and in_response_to_tweet_id
.
author_id
A unique, anonymized user ID. @s in the dataset have been replaced with their associated anonymized user ID.
inbound
Whether the tweet is "inbound" to a company doing customer support on Twitter. This feature is useful when re-organizing data for training conversational models.
created_at
Date and time when the tweet was sent.
text
Tweet content. Sensitive information like phone numbers and email addresses are replaced with mask values like _email_
.
response_tweet_id
IDs of tweets that are responses to this tweet, comma-separated.
in_response_to_tweet_id
ID of the tweet this tweet is in response to, if any.
Know of other brands the dataset should include? Found something that needs to be fixed? Start a discussion, or email me directly at $FIRSTNAME
@$LASTNAME
.com!
A huge thank you to my friends who helped bootstrap the list of companies that do customer support on Twitter! There are many rocks that would have been left un-turned were it not for your suggestions!
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
This dataset comprises a set of Twitter accounts in Singapore that are used for social bot profiling research conducted by the Living Analytics Research Centre (LARC) at Singapore Management University (SMU). Here a bot is defined as a Twitter account that generates contents and/or interacts with other users automatically (at least according to human judgment). In this research, Twitter bots have been categorized into three major types:
Broadcast bot. This bot aims at disseminating information to general audience by providing, e.g., benign links to news, blogs or sites. Such bot is often managed by an organization or a group of people (e.g., bloggers). Consumption bot. The main purpose of this bot is to aggregate contents from various sources and/or provide update services (e.g., horoscope reading, weather update) for personal consumption or use. Spam bot. This type of bots posts malicious contents (e.g., to trick people by hijacking certain account or redirecting them to malicious sites), or promotes harmless but invalid/irrelevant contents aggressively.
This categorization is general enough to cater for new, emerging types of bot (e.g., chatbots can be viewed as a special type of broadcast bots). The dataset was collected from 1 January to 30 April 2014 via the Twitter REST and streaming APIs. Starting from popular seed users (i.e., users having many followers), their follow, retweet, and user mention links were crawled. The data collection proceeds by adding those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. Using this procedure, a total of 159,724 accounts have been collected. To identify bots, the first step is to check active accounts who tweeted at least 15 times within the month of April 2014. These accounts were then manually checked and labelled, of which 589 bots were found. As many more human users are expected in the Twitter population, the remaining accounts were randomly sampled and manually checked. With this, 1,024 human accounts were identified. In total, this results in 1,613 labelled accounts. Related Publication: R. J. Oentaryo, A. Murdopo, P. K. Prasetyo, and E.-P. Lim. (2016). On profiling bots in social media. Proceedings of the International Conference on Social Informatics (SocInfo’16), 92-109. Bellevue, WA. https://doi.org/10.1007/978-3-319-47880-7_6
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Description
This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.
Data Collection Method
Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.
Dataset Content
ID: A unique identifier for each tweet.
text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.
polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).
favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.
retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.
user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.
user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.
user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.
user_followers_count: The current number of followers the account has. It is a non-negative integer.
user_friends_count: The number of users that the account is following. It is a non-negative integer.
user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.
user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.
user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.
user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.
Cite as
Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.
Potential Use Cases
This dataset is aimed at academic researchers and practitioners with interests in:
Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.
Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.
Exploring correlations between user engagement metrics and sentiment in discussions about AI.
Data Format and File Type
The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.
License
The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
Cite as Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE. General Description This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others. Data Collection Method Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI. Dataset Content ID: A unique identifier for each tweet. text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters. polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral). favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer. retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer. user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False. user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False. user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False. user_followers_count: The current number of followers the account has. It is a non-negative integer. user_friends_count: The number of users that the account is following. It is a non-negative integer. user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer. user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer. user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False. user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False. Potential Use Cases This dataset is aimed at academic researchers and practitioners with interests in: Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language. Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers. Exploring correlations between user engagement metrics and sentiment in discussions about AI. Data Format and File Type The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments. License The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset represents the geographical distribution of Twitter users and tweets related to Coronavirus (COVID-19) pandemic across the world. It includes geographical distribution of tweets that show COVID-19 geo-tagged tweets, COVID-19 Twitter users, and most mentioned COVID-19 locations worldwide.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TweetNERD - End to End Entity Linking Benchmark for Tweets
Paper - Video - Neurips Page
This is the dataset described in the paper TweetNERD - End to End Entity Linking Benchmark for Tweets (accepted to Thirty-sixth Conference on Neural Information Processing Systems (Neurips) Datasets and Benchmarks Track).
Named Entity Recognition and Disambiguation (NERD) systems are foundational for information retrieval, question answering, event detection, and other natural language processing (NLP) applications. We introduce TweetNERD, a dataset of 340K+ Tweets across 2010-2021, for benchmarking NERD systems on Tweets. This is the largest and most temporally diverse open sourced dataset benchmark for NERD on Tweets and can be used to facilitate research in this area.
TweetNERD dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0) LICENSE.
The license only applies to the data files present in this dataset. See Data usage policy below.
Check out more details at https://github.com/twitter-research/TweetNERD
Usage
We provide the dataset split across the following tab seperated files:
part_*.public.tsv
: Remaining data split into parts in no particular order.Each file is tab separated and has has the following format:
tweet_id | phrase | start | end | entityId | score |
---|---|---|---|---|---|
22 | twttr | 20 | 25 | Q918 | 3 |
21 | twttr | 20 | 25 | Q918 | 3 |
1457198399032287235 | Diwali | 30 | 38 | Q10244 | 3 |
1232456079247736833 | NO_PHRASE | -1 | -1 | NO_ENTITY | -1 |
For tweets which don't have any entity, their column values for phrase, start, end, entityId, score
are set NO_PHRASE, -1, -1, NO_ENTITY, -1
respectively.
Description of file columns is as follows:
Column | Type | Missing Value | Description |
---|---|---|---|
tweet_id | string | ID of the Tweet | |
phrase | string | NO_PHRASE | entity phrase |
start | int | -1 | start offset of the phrase in text using UTF-16BE encoding |
end | int | -1 | end offset of the phrase in the text using UTF-16BE encoding |
entityId | string | NO_ENTITY | Entity ID. If not missing can be NOT FOUND, AMBIGUOUS, or Wikidata ID of format Q{numbers}, e.g. Q918 |
score | int | -1 | Number of annotators who agreed on the phrase, start, end, entityId information |
In order to use the dataset you need to utilize the tweet_id
column and get the Tweet text using the Twitter API (See Data usage policy section below).
Data stats
Split | Number of Rows | Number unique tweets |
---|---|---|
OOD | 34102 | 25000 |
Academic | 51685 | 30119 |
part_0 | 11830 | 10000 |
part_1 | 35681 | 25799 |
part_2 | 34256 | 25000 |
part_3 | 36478 | 25000 |
part_4 | 37518 | 24999 |
part_5 | 36626 | 25000 |
part_6 | 34001 | 24984 |
part_7 | 34125 | 24981 |
part_8 | 32556 | 25000 |
part_9 | 32657 | 25000 |
part_10 | 32442 | 25000 |
part_11 | 32033 | 24972 |
Data usage policy
Use of this dataset is subject to you obtaining lawful access to the Twitter API, which requires you to agree to the Developer Terms Policies and Agreements.
Please cite the following if you use TweetNERD in your paper:
@dataset{TweetNERD_Zenodo_2022_6617192, author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali}, title = {{TweetNERD - End to End Entity Linking Benchmark for Tweets}}, month = jun, year = 2022, note = {{Data usage policy Use of this dataset is subject to you obtaining lawful access to the [Twitter API](https://developer.twitter.com/en/docs /twitter-api), which requires you to agree to the [Developer Terms Policies and Agreements](https://developer.twitter.com/en /developer-terms/).}}, publisher = {Zenodo}, version = {0.0.0}, doi = {10.5281/zenodo.6617192}, url = {https://doi.org/10.5281/zenodo.6617192} } @inproceedings{TweetNERDNeurips2022, author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali}, booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks}, pages = {}, title = {TweetNERD - End to End Entity Linking Benchmark for Tweets}, volume = {2}, year = {2022}, eprint = {arXiv:2210.08129}, doi = {10.48550/arXiv.2210.08129} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The present dataset contains the Twitter communication of eight international organizations (IOs) in different policy areas that are known to be central in communicating about climate change. The IOs are comparable in their communication, all being parts of the United Nations (UN). The IOs under consideration are:
The tweets were downloaded and parsed via the Twitter Academic Research API (link). In total, the dataset contains 222,191 tweet IDs of the tweets posted by the above 8 UN organizations from their official accounts. This number represents the total number of tweets posted by these selected UN organizations since the beginning of their tweeting history until the end of 2019. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
The dataset consists of two parts:
Unlabeled tweet IDs
The corresponding 8 txt-files contain tweet IDs of the corresponding tweets posted by the UN organizations. The files are summarised in Table 1 below.
File | Organization | Account | Start date | End date | Tweet IDs |
tweet_ids_FAO_2009_2019.txt | FAO | @FAO | Jan. 2009 | Dec. 2019 | 28,630 |
tweet_ids_UNDP_2009_2019.txt | UNDP | @UNDP | Jul. 2009 | Dec. 2019 | 47,960 |
tweet_ids_UNDRR_2009_2019.txt | UNDRR | @UNDRR | Oct. 2010 | Dec. 2019 | 9,735 |
tweet_ids_UNEP_2009_2019.txt | UNEP | @UNEP | May 2009 | Dec. 2019 | 21,615 |
tweet_ids_Refugees_2008_2019.txt | UNHCR | @Refugees | Jun. 2008 | Dec. 2019 | 42,882 |
ttweet_ids_UNICEF_2009_2019.txt | UNICEF | @UNICEF | Jul. 2009 | Nov. 2019 | 34,288 |
tweet_ids_UNOCHA_2011_2019.txt | UNOCHA | @UNOCHA | Jul. 2011 | Jul. 2019 | 12,521 |
tweet_ids_WHO_2008_2019.txt | WHO | @WHO | May 2008 | Dec. 2019 | 24,560 |
Total | 222,191 |
The dataset contains only tweet IDs to ensure compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The tweet IDs need to be hydrated to be used. For hydrating the present dataset, the Hydrator application (link) may be used; see a step-by-step tutorial on how to use Hydrator (link).
Labeled dataset related to climate change
This is a subset of the entire dataset described above. Namely, 5,750 tweets are randomly selected from the entire dataset and labeled manually as either "climate change-related" or "not climate change-related". The dataset is available in the file dataset_UN_climate_change_labeled.csv and is summarised in Table 2 below.
Organization | Tweets |
FAO | 753 |
UNDP | 1,199 |
UNDRR | 256 |
UNEP | 540 |
UNHCR | 1,114 |
UNICEF | 910 |
UNOCHA | 366 |
WHO | 612 |
Total | 5,750 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains tweet IDs and their 5 types of contextual information including 1) hashtags, 2) their categories, 3) entities obtained by NERD, 4) time-references normalized by Heideltime, and 5) Web categories for URLs attached with history-related hashtag that are related to history and that were collected for the purpose of analyzing how history-related content is disseminated in online social networks. Our IJDL paper shows the analysis results. The preliminary version of the analysis report is available here.
We used the Twitter official search API provided by Twitter to collect tweets. Note that three kinds of tweets are typically found in Twitter: tweets, retweets and quote tweets. Tweet is an original text issued as a post by a Twitter user. A retweet is a copy of an original tweet for the purpose of propagating the tweet content to more users (i.e., one's followers). Finally, a quote tweet copies the content of another tweet and allows also to add new content. A quote tweet is sometimes called a retweet with a comment. In this work, we simply treat all quote tweets as original tweets since they include additional information/text. There were however only 1,877 (0.2%) tweets recognized as quote tweets in our dataset.
To collect tweets that refer to the past or are related to collective memory of past events/entities, we performed hashtag based crawling together with bootstrapping procedure.
At the beginning, we gathered several historical hashtags selected by experts (e.g. #HistoryTeacher, #history, #WmnHist).
In addition, we prepared several hashtags that are commonly used when referring to the past: #onthisday, #thisdayinhistory, #throwbackthursday, #otd. We then collected tweets that contain these hashtags by using Twitter official search API.
The collected tweets were issued from 8 March 2016 to 2 July 2018.
Bootstrapping allowed us to search for other hashtags frequently used with the seed hashtags. The tweets tagged by such hashtags were then included into the seed set after the manual inspection of all the discovered hashtags as of their relation to the history, and filtering ones that are unrelated.
In total, we gathered 147 history-related hashtags which allowed us to collect 2,370,252 tweet IDs pointing to 882,977 tweets and 1,487,275 re-tweets.
Related papers:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
If you use the dataset, cite the paper: https://doi.org/10.1016/j.eswa.2022.117541
The most comprehensive dataset to date regarding climate change and human opinions via Twitter. It has the heftiest temporal coverage, spanning over 13 years, includes over 15 million tweets spatially distributed across the world, and provides the geolocation of most tweets. Seven dimensions of information are tied to each tweet, namely geolocation, user gender, climate change stance and sentiment, aggressiveness, deviations from historic temperature, and topic modeling, while accompanied by environmental disaster events information. These dimensions were produced by testing and evaluating a plethora of state-of-the-art machine learning algorithms and methods, both supervised and unsupervised, including BERT, RNN, LSTM, CNN, SVM, Naive Bayes, VADER, Textblob, Flair, and LDA.
The following columns are in the dataset:
➡ created_at: The timestamp of the tweet. ➡ id: The unique id of the tweet. ➡ lng: The longitude the tweet was written. ➡ lat: The latitude the tweet was written. ➡ topic: Categorization of the tweet in one of ten topics namely, seriousness of gas emissions, importance of human intervention, global stance, significance of pollution awareness events, weather extremes, impact of resource overconsumption, Donald Trump versus science, ideological positions on global warming, politics, and undefined. ➡ sentiment: A score on a continuous scale. This scale ranges from -1 to 1 with values closer to 1 being translated to positive sentiment, values closer to -1 representing a negative sentiment while values close to 0 depicting no sentiment or being neutral. ➡ stance: That is if the tweet supports the belief of man-made climate change (believer), if the tweet does not believe in man-made climate change (denier), and if the tweet neither supports nor refuses the belief of man-made climate change (neutral). ➡ gender: Whether the user that made the tweet is male, female, or undefined. ➡ temperature_avg: The temperature deviation in Celsius and relative to the January 1951-December 1980 average at the time and place the tweet was written. ➡ aggressiveness: That is if the tweet contains aggressive language or not.
Since Twitter forbids making public the text of the tweets, in order to retrieve it you need to do a process called hydrating. Tools such as Twarc or Hydrator can be used to hydrate tweets.
Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.
The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.
How popular is Instagram?
Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.
Who uses Instagram?
Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.
Celebrity influencers on Instagram
Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
2020
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We have acquired the data from George Washington University Libraries Dataverse, the Climate Change Tweets Ids [Data set] . This dataset has been collected from the Twitter API using Social Feed Manager, and totalled to 39,622,026 tweets related to climate change. The tweets were collected between September 21, 2017 and May 17, 2019. However, there is a gap in data collection between January 7, 2019 and April 17, 2019. The tweets with the following hashtags and keywords were scraped: climatechange, #climatechangeisreal, #actonclimate, #globalwarming, #climatechangehoax, #climatedeniers, #climatechangeisfalse, #globalwarminghoax, #climatechangenotreal, climate change, global warming, climate hoax.Due to Twitter's Developer Policy, only the tweet IDs were shared in the database, not the full tweets. Therefore, we had to hydrate the tweet ids with the use of Hydrator application. Hydrating was carried out by us in June, 2020, and it allowed us to obtain 22,564,380 tweets (some tweets or user accounts are deleted or suspended by Twitter in its standard maintenance procedures). Challenges encountered during data hydration included dealing with deleted tweets or suspended user accounts, which is a common occurrence in Twitter's standard maintenance procedures. We addressed this by using the Hydrator application, which allowed us to recover as much data as possible within the constraints of Twitter's Developer Policy.In order to comprehensively diagnose Polish social networks and to enable automated classification of Twitter users in terms of their attitude towards vaccinations, we collected a balanced, importance-wise database of Twitter users for manual annotation. The most important keywords used by groups that spread anti-vaccination propaganda were identified. Using our programming pipeline, databases of Polish social media on the topic of the pandemic and attitudes towards vaccinations were obtained. The raw data contained over 5 million tweets from almost 3600 users with the following hashtags related to the COVID-19 pandemic in Poland and the war in Ukraine: stopsegregacjisanitarnej, nieszczepimysie, szczepimysie, szczepienie, szczepienia, koronawirus, koronawiruswpolsce, koronawiruspolska, rozliczymysanitarystow, stopss, covid, covid19, sanitaryzm, epidemia, pandemia, plandemia, zelensky, zelenski, wojna, muremzabraunem, konfederacja, wojnanaukrainie, putin, ukraina, ukraine, rosja, russia, wolyn, bandera, upa. Twelve annotators rated the scraped Twitter users based on their posts on a nine-point Likert scale. Samples evaluated by annotators were partially overlapped in order to examine their consistency and reliability. Statistical tests performed on data before and after binning (in three- and two-category versions) confirmed significant annotator agreement. Fleiss' kappa, Randolpha, Kirchendorff alpha, and intracorrelation coefficients indicate non-random agreement among the competent judges (annotators).Our initial data acquisition based on the abovementioned hashtags yielded 5,308,997 posts. To focus specifically on discussions related to COVID-19 and the war in Ukraine, we implemented a filtering process using Polish word stems relevant to these topics. This step reduced our dataset to 4,840,446 posts. The filtering was performed using regular expressions based on lemmatized versions of key terms. For war-related content, we used stems such as 'wojna' (war), 'inwazj' (invasion), 'ukrai' (Ukraine), and 'putin'. For COVID-related content, we used stems like 'mask' (mask), 'szczepi' (vaccine), and 'koronawirus' (coronavirus). This approach allowed us to capture various grammatical forms of these words.Following this initial filtering, we removed three users who had no posts related to either COVID-19 or the war in Ukraine. This step left us with 3,597 users and 4,839,995 posts. Finally, to ensure consistency in our analysis, we selected only posts in the Polish language. This final step resulted in our dataset of 3,577,040 posts from 3,597 users. Before the tweets content analysis was performed, text lemmatization had been performed, special characters, links, and low-importance words based on a stop list (e.g. conjunctions) had been removed.Data preprocessing has been carried out in Python programming language with the use of specific libraries and our original code. The hydrated tweets were further cleaned by removing duplicates and all tweets that had no English language label. Some characters and technical expressions were then replaced with natural language terms (e.g., changing “&” into “and”). We have also created a couple of versions of the database, for various purposes - in some of them we have replaced emoji pictures with their descriptions (using the demoji library and our original code), for other database versions we have removed the emojis, hyperlinks, and special characters. This caused the dataset to comprise 24,083,452 tweets (7,741,602 tweets without retweets), which makes it the biggest database of social media data referring to climate change analyzed to date.We created the social network directed graph with the use of RAPIDS cuGraph library in Python for most of the network statistics calculations, and also with the use of the graph-tool . The final graph visualization was created with the use of Gephi after preparing and filtering the data in Python. The final graph had 4,398,368 nodes and 18,595,472 edges, after removing duplicates and self-loops.The final label of "believer," "denier," or "neutral/unknown" was assigned to users present across annotators through the averaging of results from multiple annotators.In the Ukraine dataset, the term 'anti-group' refers to various tactics of information warfare aimed at discrediting Ukraine's sovereignty and legitimacy, whereas the 'pro-group' consists of tweets that support Ukraine's sovereignty and legitimacy. In the Vaccine dataset, 'anti' denotes a group of users who publish tweets against vaccination, while 'pro' users advocate for vaccination programs. In the Climate Change dataset, 'denier' users dismiss it as a conspiracy theory, while 'believer' users perceive climate change as a serious threat to the future of humanity.For ClimateChange dataset, the creationdate indicates when the connection between two users was established. The user1 and user2 fields are anonymized unique IDs representing the source and target users, respectively. Specifically, user1 is the unique ID of the source, while user2 is the unique ID of the target. The user1status denotes whether user1 is a believer (1), neutral (2), or denier (3). The creationday is a numeric value tied to the creation date. The onset and terminus fields mark the first and last days of any recorded interaction between user1 and user2, respectively, and duration captures the total time they have interacted. Finally, the w field indicates the number of interactions (such as replies, retweets, or direct messages) exchanged between them in a Twitter context.In the Ukraine war and Vaccine dataset, the “createdate” indicates the date of that interaction. The “likecount,” “retweetcount,” “replycount,” and “quotecount” columns capture various engagement metrics on Twitter—how many times a tweet is liked, retweeted, replied to, or quoted. The “user1” and “user2” fields store unique user IDs, whereas “user1proukraine,” “user1provaccine,” “user2proukraine,” and “user2provaccine” denote each user’s stance (e.g., pro, anti, or unknown) regarding Ukraine and vaccines. The “creationday” is a numeric value corresponding to the creation date, while “onset” and “terminus” mark the first and last recorded interactions between user1 and user2, respectively. Finally, “duration” shows the total time span across which these interactions took place.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1041505%2F0625876b77e55a56422bb5a37d881e0d%2Fawdasdw.jpg?generation=1595666545033847&alt=media" alt="">
Ever wondered what people are saying about certain countries? Whether it's in a positive/negative light? What are the most commonly used phrases/words to describe the country? In this dataset I present tweets where a certain country gets mentioned in the hashtags (e.g. #HongKong, #NewZealand). It contains around 150 countries in the world. I've added an additional field called polarity which has the sentiment computed from the text field. Feel free to explore! Feedback is much appreciated!
Each row represents a tweet. Creation Dates of Tweets Range from 12/07/2020 to 25/07/2020. Will update on a Monthly cadence. - The Country can be derived from the file_name field. (this field is very Tableau friendly when it comes to plotting maps) - The Date at which the tweet was created can be got from created_at field. - The Search Query used to query the Twitter Search Engine can be got from search_query field. - The Tweet Full Text can be got from the text field. - The Sentiment can be got from polarity field. (I've used the Vader Model from NLTK to compute this.)
There maybe slight duplications in tweet id's before 22/07/2020. I have since fixed this bug.
Thanks to the tweepy package for making the data extraction via Twitter API so easy.
Feel free to checkout my blog if you want to learn how I built the datalake via AWS or for other data shenanigans.
Here's an App I built using a live version of this data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a CSV file containing Tweet IDs of 3,805 Tweets from user ID 25073877 posted publicly between Thursday February 25 2016 16:35:12 +0000 to Monday April 03 2017 12:51:01 +0000.This file does not include Tweets' texts nor URLs. Columns in the file areid_strfrom_user_id_str created_at time source user_followers_count user_friends_count Motivations to Share this DataArchived Tweets can provide interesting insights for the study of contemporary history of media, politics, diplomacy, etc. The queried account is a public account widely agreed to be of exceptional national and international public interest. Though they provide public access to tweeted content in real time, Twitter Web and mobile clients are not suited for appropriate Tweet corpus analysis. For anyone researching social media, access to the data is absolutely essential in order to perform, review and reproduce studies. Archiving Tweets of public interest due to their historic significance is a means to both preserve and enable reproducible study of this form of rapid online communication that otherwise can very likely become unretrievable as time passes. Due to Twitter's current business model and API limits, to date collecting in real time is the only relatively reliable method to archive Tweets at a small scale. Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using a Python script. The data collection search query was from:realdonaldtrump. A trigger was scheduled to collect atuomatically every hour. The original data harvesting was refined to delete duplications, to subscribe to Twitter's Terms and Conditions and so that the data was sorted in chronological order.Duplication of data due to the automated collection is possible so further data refining might be required. The file may not contain data from Tweets deleted by the queried user account immediately after original publication. Both research and experience show that the Twitter search API is not 100% reliable. (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet posted by the queried account during the indicated period. This file dataset is shared for archival, comparative and indicative educational research purposes only. The content included is from a public Twitter account and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.The original Tweets, their contents and associated metadata were published openly on the Web from the queried public account and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually.No private personal information is shared in this dataset. As indicated above this dataset does not contain the text of the Tweets. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road.This dataset is shared to archive, document and encourage open educational research into political activity on Twitter.Other ConsiderationsAll Twitter users agree to Twitter's Privacy and data sharing policies. Social media research remains in its infancy and though work has been done to develop best practices there is yet no agreement on a series of grey areas relating to reseach methodologies including ad hoc social media specific research ethics guidelines for reproducible research. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time. Reproducibility is considered here a key value for robust and trustworthy research. Different scholarly professional associations like the Modern Language Association recognise Tweets, datasets and other online and digital resources as citeable scholarly outputs.The data contained in the deposited file is otherwise available elsewhere through different methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The first public large-scale multilingual Twitter dataset related to the FIFA World Cup 2022, comprising over 28 million posts in 69 unique spoken languages, including Arabic, English, Spanish, French, and many others. This dataset aims to facilitate research in future sentiment analysis, cross-linguistic studies, event-based analytics, meme and hate speech detection, fake news detection, and social manipulation detection.
The file 🚨Qatar22WC.csv🚨 contains tweet-level and user-level metadata for our collected tweets.
🚀Codebook for FIFA World Cup 2022 Twitter Dataset🚀
| Column Name | Description|
|-------------------------------- |----------------------------------------------------------------------------------------|
| day
, month
, year
| The date where the tweet posted |
| hou
, min
, sec
| Hour, minute, and second of tweet timestamp |
| age_of_the_user_account
| User Account age in days |
| tweet_count
| Total number of tweets posted by the user |
| location
| User-defined location field |
| follower_count
| Number of followers the user has |
| following_count
| Number of accounts the user is following |
| follower_to_Following
| Follower-following ratio |
| favouite_count
| Number of likes the user did|
| verified
| Boolean indicating if the user is verified (1 = Verified, 0 = Not Verified) |
| Avg_tweet_count
| Average tweets per day for the user activity|
| list_count
| Number of lists the user is a member |
| Tweet_Id
| Tweet ID |
| is_reply_tweet
| ID of the tweet being replied to (if applicable) |
| is_quote
| boolean representing if the tweet is a quote |
| retid
| Retweet ID if it's a retweet; NaN otherwise |
| lang
| Language of the tweet |
| hashtags
| The keyword or hashtag used to collect the tweet |
| is_image
, | Boolean indicating if the tweet associated with image|
| is_video
| Boolean indicating if the tweet associated with video |
|-------------------------------|----------------------------------------------------------------------------------------|
Examples of use case queries are described in the file 🚨fifa_wc_qatar22_examples_of_use_case_queries.ipynb🚨 and accessible via: https://github.com/khairied/Qata_FIFA_World_Cup_22
🚀 Please Cite This as: Daouadi, K. E., Boualleg, Y., Guehairia, O. & Taleb-Ahmed, A. (2025). Tracking the Global Pulse: The first public Twitter dataset from FIFA World Cup, Journal of Computational Social Science.
The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.