77 datasets found

Twitter users in the United States 2019-2028
statista.com
ai-chatbox.pro
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2025). Twitter users in the United States 2019-2028 [Dataset]. https://www.statista.com/topics/3196/social-media-usage-in-the-united-states/
Explore at:
Dataset updated
Jul 31, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
United States
Description
The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.
Twitter users worldwide 2019-2028
statista.com
ai-chatbox.pro
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2024). Twitter users worldwide 2019-2028 [Dataset]. https://www.statista.com/topics/2297/twitter-marketing/
Explore at:
Dataset updated
Dec 10, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Description
The global number of Twitter users in was forecast to continuously increase between 2024 and 2028 by in total 74.3 million users (+17.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 503.42 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like South America and the Americas.
Social Media Users 2021
kaggle.com
Updated Feb 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margaretha Martinez (2021). Social Media Users 2021 [Dataset]. https://www.kaggle.com/datasets/margarethamartinez/socialmedia2021
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Margaretha Martinez
Description
Context

Worldwide Social Media User in 2021 (Quarterly)

Acknowledgements

Facebook: https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/ Twitter: https://investor.twitterinc.com/home/default.aspx Instagram: https://investor.fb.com/home/default.aspx
f
Two datasets of tweets.
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Su Yeon Han; Ming-Hsiang Tsou; Keith C. Clarke (2023). Two datasets of tweets. [Dataset]. http://doi.org/10.1371/journal.pone.0132464.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0132464.t001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Su Yeon Han; Ming-Hsiang Tsou; Keith C. Clarke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Every tweet in the first dataset includes at least one name of a large city in the U.S. or elsewhere. The second dataset does not include city names outside the U.S., but contains the names of small, mid-sized, and large cities in the U.S.Two datasets of tweets.
Customer Support on Twitter
kaggle.com
zip
Updated Nov 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thought Vector (2017). Customer Support on Twitter [Dataset]. https://www.kaggle.com/thoughtvector/customer-support-on-twitter
Explore at:
zip(149959515 bytes)Available download formats
Dataset updated
Nov 27, 2017
Dataset authored and provided by
Thought Vector
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Customer Support on Twitter dataset is a large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact.

https://i.imgur.com/nTv3Iuu.png" alt="Example Analysis - Inbound Volume for the Top 20 Brands">

Context

Natural language remains the densest encoding of human experience we have, and innovation in NLP has accelerated to power understanding of that data, but the datasets driving this innovation don't match the real language in use today. The Customer Support on Twitter dataset offers a large corpus of modern English (mostly) conversations between consumers and customer support agents on Twitter, and has three important advantages over other conversational text datasets:

Focused - Consumers contact customer support to have a specific problem solved, and the manifold of problems to be discussed is relatively small, especially compared to unconstrained conversational datasets like the reddit Corpus.

Natural - Consumers in this dataset come from a much broader segment than those in the Ubuntu Dialogue Corpus and have much more natural and recent use of typed text than the Cornell Movie Dialogs Corpus.

Succinct - Twitter's brevity causes more natural responses from support agents (rather than scripted), and to-the-point descriptions of problems and solutions. Also, its convenient in allowing for a relatively low message limit size for recurrent nets.

Inspiration

The size and breadth of this dataset inspires many interesting questions:

Can we predict company responses? Given the bounded set of subjects handled by each company, the answer seems like yes!

Do requests get stale? How quickly do the best companies respond, compared to the worst?

Can we learn high quality dense embeddings or representations of similarity for topical clustering?

How does tone affect the customer support conversation? Does saying sorry help?

Can we help companies identify new problems, or ones most affecting their customers?

Content

The dataset is a CSV, where each row is a tweet. The different columns are described below. Every conversation included has at least one request from a consumer and at least one response from a company. Which user IDs are company user IDs can be calculated using the inbound field.

tweet_id

A unique, anonymized ID for the Tweet. Referenced by response_tweet_id and in_response_to_tweet_id.

author_id

A unique, anonymized user ID. @s in the dataset have been replaced with their associated anonymized user ID.

inbound

Whether the tweet is "inbound" to a company doing customer support on Twitter. This feature is useful when re-organizing data for training conversational models.

created_at

Date and time when the tweet was sent.

text

Tweet content. Sensitive information like phone numbers and email addresses are replaced with mask values like _email_.

response_tweet_id

IDs of tweets that are responses to this tweet, comma-separated.

in_response_to_tweet_id

ID of the tweet this tweet is in response to, if any.

Contributing

Know of other brands the dataset should include? Found something that needs to be fixed? Start a discussion, or email me directly at $FIRSTNAME@$LASTNAME.com!

Acknowledgements

A huge thank you to my friends who helped bootstrap the list of companies that do customer support on Twitter! There are many rocks that would have been left un-turned were it not for your suggestions!

Relevant Resources

NLTK - casual_tokenize for social media text tokenizing, vader sentiment analysis for social media text

SciKit Learn - BoW Count Vectorizer, Multinomial Naive Bayes Classifier

Topic Modeling via Phrase detection with gensim

facebook research - fastText text classifier
s
Twitter bot profiling
researchdata.smu.edu.sg
smu.edu.sg
+1more
pdf
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Living Analytics Research Centre (2023). Twitter bot profiling [Dataset]. http://doi.org/10.25440/smu.12062706.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.12062706.v1
Dataset updated
May 31, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
Living Analytics Research Centre
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
This dataset comprises a set of Twitter accounts in Singapore that are used for social bot profiling research conducted by the Living Analytics Research Centre (LARC) at Singapore Management University (SMU). Here a bot is defined as a Twitter account that generates contents and/or interacts with other users automatically (at least according to human judgment). In this research, Twitter bots have been categorized into three major types:

Broadcast bot. This bot aims at disseminating information to general audience by providing, e.g., benign links to news, blogs or sites. Such bot is often managed by an organization or a group of people (e.g., bloggers). Consumption bot. The main purpose of this bot is to aggregate contents from various sources and/or provide update services (e.g., horoscope reading, weather update) for personal consumption or use. Spam bot. This type of bots posts malicious contents (e.g., to trick people by hijacking certain account or redirecting them to malicious sites), or promotes harmless but invalid/irrelevant contents aggressively.

This categorization is general enough to cater for new, emerging types of bot (e.g., chatbots can be viewed as a special type of broadcast bots). The dataset was collected from 1 January to 30 April 2014 via the Twitter REST and streaming APIs. Starting from popular seed users (i.e., users having many followers), their follow, retweet, and user mention links were crawled. The data collection proceeds by adding those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. Using this procedure, a total of 159,724 accounts have been collected. To identify bots, the first step is to check active accounts who tweeted at least 15 times within the month of April 2014. These accounts were then manually checked and labelled, of which 589 bots were found. As many more human users are expected in the Twitter population, the remaining accounts were randomly sampled and manually checked. With this, 1,024 human accounts were identified. In total, this results in 1,613 labelled accounts. Related Publication: R. J. Oentaryo, A. Murdopo, P. K. Prasetyo, and E.-P. Lim. (2016). On profiling bots in social media. Proceedings of the International Conference on Social Informatics (SocInfo’16), 92-109. Bellevue, WA. https://doi.org/10.1007/978-3-319-47880-7_6
Z
Data from: IA Tweets Analysis Dataset (Spanish)
data.niaid.nih.gov
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muñoz, Andrés (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10821484
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
Muñoz, Andrés
Guerrero-Contreras, Gabriel
Balderas-Díaz, Sara
Serrano-Fernández, Alejandro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Description

This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

Data Collection Method

Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

Dataset Content

ID: A unique identifier for each tweet.

text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

user_followers_count: The current number of followers the account has. It is a non-negative integer.

user_friends_count: The number of users that the account is following. It is a non-negative integer.

user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

Cite as

Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

Potential Use Cases

This dataset is aimed at academic researchers and practitioners with interests in:

Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

Exploring correlations between user engagement metrics and sentiment in discussions about AI.

Data Format and File Type

The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

License

The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
o
Data from: IA Tweets Analysis Dataset (Spanish)
explore.openaire.eu
produccioncientifica.uca.es
Updated Mar 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Guerrero-Contreras; Sara Balderas-Díaz; Alejandro Serrano-Fernández; Andrés Muñoz (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. http://doi.org/10.5281/zenodo.10821485
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10821485
Dataset updated
Mar 15, 2024
Authors
Gabriel Guerrero-Contreras; Sara Balderas-Díaz; Alejandro Serrano-Fernández; Andrés Muñoz
Description
Cite as Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE. General Description This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others. Data Collection Method Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI. Dataset Content ID: A unique identifier for each tweet. text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters. polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral). favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer. retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer. user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False. user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False. user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False. user_followers_count: The current number of followers the account has. It is a non-negative integer. user_friends_count: The number of users that the account is following. It is a non-negative integer. user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer. user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer. user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False. user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False. Potential Use Cases This dataset is aimed at academic researchers and practitioners with interests in: Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language. Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers. Exploring correlations between user engagement metrics and sentiment in discussions about AI. Data Format and File Type The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments. License The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
Z
Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...
data.niaid.nih.gov
zenodo.org
Updated Jun 16, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Imran (2020). GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3878598
Explore at:
Dataset updated
Jun 16, 2020
Dataset provided by
Ferda Ofli
Muhammad Imran
Umair Qazi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.
M
COVID-19 Twitter Data Geographic Distribution
catalog.midasnetwork.us
xls
Updated Jul 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MIDAS Coordination Center (2023). COVID-19 Twitter Data Geographic Distribution [Dataset]. https://catalog.midasnetwork.us/collection/33
Explore at:
xlsAvailable download formats
Dataset updated
Jul 7, 2023
Dataset authored and provided by
MIDAS Coordination Center
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Time period covered
Feb 1, 2020 - Aug 31, 2020
Variables measured
media, disease, COVID-19, pathogen, Homo sapiens, social media, host organism, infectious disease, Severe acute respiratory syndrome coronavirus 2
Dataset funded by
National Institute of General Medical Sciences
Description
This dataset represents the geographical distribution of Twitter users and tweets related to Coronavirus (COVID-19) pandemic across the world. It includes geographical distribution of tweets that show COVID-19 geo-tagged tweets, COVID-19 Twitter users, and most mentioned COVID-19 locations worldwide.

Data from: TweetNERD - End to End Entity Linking Benchmark for Tweets

zenodo.org

bin, tsv

Updated Feb 3, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Shubhanshu Mishra; Shubhanshu Mishra; Aman Saini; Raheleh Makki; Sneha Mehta; Aria Haghighi; Ali Mollahosseini; Aman Saini; Raheleh Makki; Sneha Mehta; Aria Haghighi; Ali Mollahosseini (2023). TweetNERD - End to End Entity Linking Benchmark for Tweets [Dataset]. http://doi.org/10.5281/zenodo.6617192

Explore at:

tsv, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6617192

Dataset updated

Feb 3, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Shubhanshu Mishra; Shubhanshu Mishra; Aman Saini; Raheleh Makki; Sneha Mehta; Aria Haghighi; Ali Mollahosseini; Aman Saini; Raheleh Makki; Sneha Mehta; Aria Haghighi; Ali Mollahosseini

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

TweetNERD - End to End Entity Linking Benchmark for Tweets

Paper - Video - Neurips Page

This is the dataset described in the paper TweetNERD - End to End Entity Linking Benchmark for Tweets (accepted to Thirty-sixth Conference on Neural Information Processing Systems (Neurips) Datasets and Benchmarks Track).

Named Entity Recognition and Disambiguation (NERD) systems are foundational for information retrieval, question answering, event detection, and other natural language processing (NLP) applications. We introduce TweetNERD, a dataset of 340K+ Tweets across 2010-2021, for benchmarking NERD systems on Tweets. This is the largest and most temporally diverse open sourced dataset benchmark for NERD on Tweets and can be used to facilitate research in this area.

TweetNERD dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0) LICENSE.

The license only applies to the data files present in this dataset. See Data usage policy below.

Check out more details at https://github.com/twitter-research/TweetNERD

Usage

We provide the dataset split across the following tab seperated files:

OOD.public.tsv: OOD split of the data in the paper.
Academic.public.tsv: Academic split of the data described in the paper.
part_*.public.tsv: Remaining data split into parts in no particular order.

Each file is tab separated and has has the following format:

tweet_id	phrase	start	end	entityId	score
22	twttr	20	25	Q918	3
21	twttr	20	25	Q918	3
1457198399032287235	Diwali	30	38	Q10244	3
1232456079247736833	NO_PHRASE	-1	-1	NO_ENTITY	-1

For tweets which don't have any entity, their column values for phrase, start, end, entityId, score are set NO_PHRASE, -1, -1, NO_ENTITY, -1 respectively.

Description of file columns is as follows:

Column	Type	Missing Value	Description
tweet_id	string		ID of the Tweet
phrase	string	NO_PHRASE	entity phrase
start	int	-1	start offset of the phrase in text using `UTF-16BE` encoding
end	int	-1	end offset of the phrase in the text using `UTF-16BE` encoding
entityId	string	NO_ENTITY	Entity ID. If not missing can be NOT FOUND, AMBIGUOUS, or Wikidata ID of format Q{numbers}, e.g. Q918
score	int	-1	Number of annotators who agreed on the phrase, start, end, entityId information

In order to use the dataset you need to utilize the tweet_id column and get the Tweet text using the Twitter API (See Data usage policy section below).

Data stats

Split	Number of Rows	Number unique tweets
OOD	34102	25000
Academic	51685	30119
part_0	11830	10000
part_1	35681	25799
part_2	34256	25000
part_3	36478	25000
part_4	37518	24999
part_5	36626	25000
part_6	34001	24984
part_7	34125	24981
part_8	32556	25000
part_9	32657	25000
part_10	32442	25000
part_11	32033	24972

Data usage policy

Use of this dataset is subject to you obtaining lawful access to the Twitter API, which requires you to agree to the Developer Terms Policies and Agreements.

Please cite the following if you use TweetNERD in your paper:

@dataset{TweetNERD_Zenodo_2022_6617192,
 author    = {Mishra, Shubhanshu and
         Saini, Aman and
         Makki, Raheleh and
         Mehta, Sneha and
         Haghighi, Aria and
         Mollahosseini, Ali},
 title    = {{TweetNERD - End to End Entity Linking Benchmark 
          for Tweets}},
 month    = jun,
 year     = 2022,
 note     = {{Data usage policy Use of this dataset is subject 
          to you obtaining lawful access to the [Twitter
          API](https://developer.twitter.com/en/docs
          /twitter-api), which requires you to agree to the
          [Developer Terms Policies and
          Agreements](https://developer.twitter.com/en
          /developer-terms/).}},
 publisher  = {Zenodo},
 version   = {0.0.0},
 doi     = {10.5281/zenodo.6617192},
 url     = {https://doi.org/10.5281/zenodo.6617192}
}
@inproceedings{TweetNERDNeurips2022,
 author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali},
 booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 pages = {},
 title = {TweetNERD - End to End Entity Linking Benchmark for Tweets},
 volume = {2},
 year = {2022},
 eprint = {arXiv:2210.08129},
 doi = {10.48550/arXiv.2210.08129}
}

A Dataset of UN Agencies' Public Communication about Climate Change on...

zenodo.org
explore.openaire.eu
+1more

csv, txt

Updated Feb 13, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

A Dataset of UN Agencies' Public Communication about Climate Change on Twitter [Dataset]. https://zenodo.org/records/7633599

Explore at:

txt, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7633599

Dataset updated

Feb 13, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Karina Shyrokykh; Karina Shyrokykh; Max Girnyk; Max Girnyk; Lisa Dellmuth; Lisa Dellmuth

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

United Nations

Description

The present dataset contains the Twitter communication of eight international organizations (IOs) in different policy areas that are known to be central in communicating about climate change. The IOs are comparable in their communication, all being parts of the United Nations (UN). The IOs under consideration are:

Food and Agriculture Organization (FAO),
Office for the Coordination of Humanitarian Affairs (UNOCHA),
UN Development Programme (UNDP),
UN Office for Disaster Risk Reduction (UNDRR),
UN Environmental Program (UNEP),
UN International Children’s Emergency Fund (UNICEF),
UN High Commissioner for Refugees (UNHCR),
World Health Organization (WHO).

The tweets were downloaded and parsed via the Twitter Academic Research API (link). In total, the dataset contains 222,191 tweet IDs of the tweets posted by the above 8 UN organizations from their official accounts. This number represents the total number of tweets posted by these selected UN organizations since the beginning of their tweeting history until the end of 2019. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

The dataset consists of two parts:

Unlabeled tweet IDs of the considered IOs (8 txt-files),
Labeled dataset of tweet IDs with labels indicating whether tweets are about climate change or not (1 csv-file).

Unlabeled tweet IDs

The corresponding 8 txt-files contain tweet IDs of the corresponding tweets posted by the UN organizations. The files are summarised in Table 1 below.

**Table 1**. Summary of the collected dataset files.
File	Organization	Account	Start date	End date	Tweet IDs
tweet_ids_FAO_2009_2019.txt	FAO	@FAO	Jan. 2009	Dec. 2019	28,630
tweet_ids_UNDP_2009_2019.txt	UNDP	@UNDP	Jul. 2009	Dec. 2019	47,960
tweet_ids_UNDRR_2009_2019.txt	UNDRR	@UNDRR	Oct. 2010	Dec. 2019	9,735
tweet_ids_UNEP_2009_2019.txt	UNEP	@UNEP	May 2009	Dec. 2019	21,615
tweet_ids_Refugees_2008_2019.txt	UNHCR	@Refugees	Jun. 2008	Dec. 2019	42,882
ttweet_ids_UNICEF_2009_2019.txt	UNICEF	@UNICEF	Jul. 2009	Nov. 2019	34,288
tweet_ids_UNOCHA_2011_2019.txt	UNOCHA	@UNOCHA	Jul. 2011	Jul. 2019	12,521
tweet_ids_WHO_2008_2019.txt	WHO	@WHO	May 2008	Dec. 2019	24,560
				Total	222,191

The dataset contains only tweet IDs to ensure compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The tweet IDs need to be hydrated to be used. For hydrating the present dataset, the Hydrator application (link) may be used; see a step-by-step tutorial on how to use Hydrator (link).

Labeled dataset related to climate change

This is a subset of the entire dataset described above. Namely, 5,750 tweets are randomly selected from the entire dataset and labeled manually as either "climate change-related" or "not climate change-related". The dataset is available in the file dataset_UN_climate_change_labeled.csv and is summarised in Table 2 below.

**Table 2**. Summary of the labeled dataset.
Organization	Tweets
FAO	753
UNDP	1,199
UNDRR	256
UNEP	540
UNHCR	1,114
UNICEF	910
UNOCHA	366
WHO	612
Total	5,750

Data from: Annotated Dataset of History-related Tweets
zenodo.org
csv
Updated Sep 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasunobu Sumikawa; Adam Jatowt; Yasunobu Sumikawa; Adam Jatowt (2021). Annotated Dataset of History-related Tweets [Dataset]. http://doi.org/10.5281/zenodo.4657223
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4657223
Dataset updated
Sep 19, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yasunobu Sumikawa; Adam Jatowt; Yasunobu Sumikawa; Adam Jatowt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains tweet IDs and their 5 types of contextual information including 1) hashtags, 2) their categories, 3) entities obtained by NERD, 4) time-references normalized by Heideltime, and 5) Web categories for URLs attached with history-related hashtag that are related to history and that were collected for the purpose of analyzing how history-related content is disseminated in online social networks. Our IJDL paper shows the analysis results. The preliminary version of the analysis report is available here.

We used the Twitter official search API provided by Twitter to collect tweets. Note that three kinds of tweets are typically found in Twitter: tweets, retweets and quote tweets. Tweet is an original text issued as a post by a Twitter user. A retweet is a copy of an original tweet for the purpose of propagating the tweet content to more users (i.e., one's followers). Finally, a quote tweet copies the content of another tweet and allows also to add new content. A quote tweet is sometimes called a retweet with a comment. In this work, we simply treat all quote tweets as original tweets since they include additional information/text. There were however only 1,877 (0.2%) tweets recognized as quote tweets in our dataset.

To collect tweets that refer to the past or are related to collective memory of past events/entities, we performed hashtag based crawling together with bootstrapping procedure.
At the beginning, we gathered several historical hashtags selected by experts (e.g. #HistoryTeacher, #history, #WmnHist).
In addition, we prepared several hashtags that are commonly used when referring to the past: #onthisday, #thisdayinhistory, #throwbackthursday, #otd. We then collected tweets that contain these hashtags by using Twitter official search API.

The collected tweets were issued from 8 March 2016 to 2 July 2018.
Bootstrapping allowed us to search for other hashtags frequently used with the seed hashtags. The tweets tagged by such hashtags were then included into the seed set after the manual inspection of all the discovered hashtags as of their relation to the history, and filtering ones that are unrelated.
In total, we gathered 147 history-related hashtags which allowed us to collect 2,370,252 tweet IDs pointing to 882,977 tweets and 1,487,275 re-tweets.

Related papers:

Yasunobu Sumikawa, Adam Jatowt, and Marten During, "Digital History meets Microblogging: Analyzing Collective Memories in Twitter", In Proceedings of the 18th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL'18, IEEE/ACM, pp. 213 -- 222, 2018. [paper]

Yasunobu Sumikawa and Adam Jatowt, "Analyzing History-related Posts in Twitter", International Journal on Digital Libraries, Springer, 2020. https://doi.org/10.1007/s00799-020-00296-2 [paper][dataset]

Yasunobu Sumikawa and Adam Jatowt, "Annotated Dataset of History-related Tweets", Data in Brief, Vol. 38, pp. 107344, Elsevier, 2021. [paper]
The Climate Change Twitter Dataset
kaggle.com
data.mendeley.com
Updated Apr 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orvile (2025). The Climate Change Twitter Dataset [Dataset]. https://www.kaggle.com/datasets/orvile/the-climate-change-twitter-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Orvile
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
If you use the dataset, cite the paper: https://doi.org/10.1016/j.eswa.2022.117541

The most comprehensive dataset to date regarding climate change and human opinions via Twitter. It has the heftiest temporal coverage, spanning over 13 years, includes over 15 million tweets spatially distributed across the world, and provides the geolocation of most tweets. Seven dimensions of information are tied to each tweet, namely geolocation, user gender, climate change stance and sentiment, aggressiveness, deviations from historic temperature, and topic modeling, while accompanied by environmental disaster events information. These dimensions were produced by testing and evaluating a plethora of state-of-the-art machine learning algorithms and methods, both supervised and unsupervised, including BERT, RNN, LSTM, CNN, SVM, Naive Bayes, VADER, Textblob, Flair, and LDA.

The following columns are in the dataset:

➡ created_at: The timestamp of the tweet. ➡ id: The unique id of the tweet. ➡ lng: The longitude the tweet was written. ➡ lat: The latitude the tweet was written. ➡ topic: Categorization of the tweet in one of ten topics namely, seriousness of gas emissions, importance of human intervention, global stance, significance of pollution awareness events, weather extremes, impact of resource overconsumption, Donald Trump versus science, ideological positions on global warming, politics, and undefined. ➡ sentiment: A score on a continuous scale. This scale ranges from -1 to 1 with values closer to 1 being translated to positive sentiment, values closer to -1 representing a negative sentiment while values close to 0 depicting no sentiment or being neutral. ➡ stance: That is if the tweet supports the belief of man-made climate change (believer), if the tweet does not believe in man-made climate change (denier), and if the tweet neither supports nor refuses the belief of man-made climate change (neutral). ➡ gender: Whether the user that made the tweet is male, female, or undefined. ➡ temperature_avg: The temperature deviation in Celsius and relative to the January 1951-December 1980 average at the time and place the tweet was written. ➡ aggressiveness: That is if the tweet contains aggressive language or not.

Since Twitter forbids making public the text of the tweets, in order to retrieve it you need to do a process called hydrating. Tools such as Twarc or Hydrator can be used to hydrate tweets.

Instagram accounts with the most followers worldwide 2024

statista.com
es.statista.com

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Instagram accounts with the most followers worldwide 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.

              The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.

              How popular is Instagram?

              Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.

              Who uses Instagram?

              Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.

              Celebrity influencers on Instagram
              Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.

i
Coronavirus (COVID-19) Tweets Dataset
ieee-dataport.org
search.datacite.org
+1more
Updated May 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rabindra Lamsal (2025). Coronavirus (COVID-19) Tweets Dataset [Dataset]. https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset
Explore at:
Dataset updated
May 7, 2025
Authors
Rabindra Lamsal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
2020
R
Deniers and believers in climate change discourse on twitter, and anti/pro...
repod.icm.edu.pl
txt
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoudi, Amin; Jemielniak, Dariusz; Ciechanowski, Leon (2025). Deniers and believers in climate change discourse on twitter, and anti/pro positions in ukraine war and vaccine discourse [Dataset]. http://doi.org/10.18150/FVIMEK
Explore at:
txt(235514081), txt(83611532)Available download formats
Unique identifier
https://doi.org/10.18150/FVIMEK
Dataset updated
Jun 23, 2025
Dataset provided by
RepOD
Authors
Mahmoudi, Amin; Jemielniak, Dariusz; Ciechanowski, Leon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Ukraine
Dataset funded by
National Science Centre (Poland)
Description
We have acquired the data from George Washington University Libraries Dataverse, the Climate Change Tweets Ids [Data set] . This dataset has been collected from the Twitter API using Social Feed Manager, and totalled to 39,622,026 tweets related to climate change. The tweets were collected between September 21, 2017 and May 17, 2019. However, there is a gap in data collection between January 7, 2019 and April 17, 2019. The tweets with the following hashtags and keywords were scraped: climatechange, #climatechangeisreal, #actonclimate, #globalwarming, #climatechangehoax, #climatedeniers, #climatechangeisfalse, #globalwarminghoax, #climatechangenotreal, climate change, global warming, climate hoax.Due to Twitter's Developer Policy, only the tweet IDs were shared in the database, not the full tweets. Therefore, we had to hydrate the tweet ids with the use of Hydrator application. Hydrating was carried out by us in June, 2020, and it allowed us to obtain 22,564,380 tweets (some tweets or user accounts are deleted or suspended by Twitter in its standard maintenance procedures). Challenges encountered during data hydration included dealing with deleted tweets or suspended user accounts, which is a common occurrence in Twitter's standard maintenance procedures. We addressed this by using the Hydrator application, which allowed us to recover as much data as possible within the constraints of Twitter's Developer Policy.In order to comprehensively diagnose Polish social networks and to enable automated classification of Twitter users in terms of their attitude towards vaccinations, we collected a balanced, importance-wise database of Twitter users for manual annotation. The most important keywords used by groups that spread anti-vaccination propaganda were identified. Using our programming pipeline, databases of Polish social media on the topic of the pandemic and attitudes towards vaccinations were obtained. The raw data contained over 5 million tweets from almost 3600 users with the following hashtags related to the COVID-19 pandemic in Poland and the war in Ukraine: stopsegregacjisanitarnej, nieszczepimysie, szczepimysie, szczepienie, szczepienia, koronawirus, koronawiruswpolsce, koronawiruspolska, rozliczymysanitarystow, stopss, covid, covid19, sanitaryzm, epidemia, pandemia, plandemia, zelensky, zelenski, wojna, muremzabraunem, konfederacja, wojnanaukrainie, putin, ukraina, ukraine, rosja, russia, wolyn, bandera, upa. Twelve annotators rated the scraped Twitter users based on their posts on a nine-point Likert scale. Samples evaluated by annotators were partially overlapped in order to examine their consistency and reliability. Statistical tests performed on data before and after binning (in three- and two-category versions) confirmed significant annotator agreement. Fleiss' kappa, Randolpha, Kirchendorff alpha, and intracorrelation coefficients indicate non-random agreement among the competent judges (annotators).Our initial data acquisition based on the abovementioned hashtags yielded 5,308,997 posts. To focus specifically on discussions related to COVID-19 and the war in Ukraine, we implemented a filtering process using Polish word stems relevant to these topics. This step reduced our dataset to 4,840,446 posts. The filtering was performed using regular expressions based on lemmatized versions of key terms. For war-related content, we used stems such as 'wojna' (war), 'inwazj' (invasion), 'ukrai' (Ukraine), and 'putin'. For COVID-related content, we used stems like 'mask' (mask), 'szczepi' (vaccine), and 'koronawirus' (coronavirus). This approach allowed us to capture various grammatical forms of these words.Following this initial filtering, we removed three users who had no posts related to either COVID-19 or the war in Ukraine. This step left us with 3,597 users and 4,839,995 posts. Finally, to ensure consistency in our analysis, we selected only posts in the Polish language. This final step resulted in our dataset of 3,577,040 posts from 3,597 users. Before the tweets content analysis was performed, text lemmatization had been performed, special characters, links, and low-importance words based on a stop list (e.g. conjunctions) had been removed.Data preprocessing has been carried out in Python programming language with the use of specific libraries and our original code. The hydrated tweets were further cleaned by removing duplicates and all tweets that had no English language label. Some characters and technical expressions were then replaced with natural language terms (e.g., changing “&” into “and”). We have also created a couple of versions of the database, for various purposes - in some of them we have replaced emoji pictures with their descriptions (using the demoji library and our original code), for other database versions we have removed the emojis, hyperlinks, and special characters. This caused the dataset to comprise 24,083,452 tweets (7,741,602 tweets without retweets), which makes it the biggest database of social media data referring to climate change analyzed to date.We created the social network directed graph with the use of RAPIDS cuGraph library in Python for most of the network statistics calculations, and also with the use of the graph-tool . The final graph visualization was created with the use of Gephi after preparing and filtering the data in Python. The final graph had 4,398,368 nodes and 18,595,472 edges, after removing duplicates and self-loops.The final label of "believer," "denier," or "neutral/unknown" was assigned to users present across annotators through the averaging of results from multiple annotators.In the Ukraine dataset, the term 'anti-group' refers to various tactics of information warfare aimed at discrediting Ukraine's sovereignty and legitimacy, whereas the 'pro-group' consists of tweets that support Ukraine's sovereignty and legitimacy. In the Vaccine dataset, 'anti' denotes a group of users who publish tweets against vaccination, while 'pro' users advocate for vaccination programs. In the Climate Change dataset, 'denier' users dismiss it as a conspiracy theory, while 'believer' users perceive climate change as a serious threat to the future of humanity.For ClimateChange dataset, the creationdate indicates when the connection between two users was established. The user1 and user2 fields are anonymized unique IDs representing the source and target users, respectively. Specifically, user1 is the unique ID of the source, while user2 is the unique ID of the target. The user1status denotes whether user1 is a believer (1), neutral (2), or denier (3). The creationday is a numeric value tied to the creation date. The onset and terminus fields mark the first and last days of any recorded interaction between user1 and user2, respectively, and duration captures the total time they have interacted. Finally, the w field indicates the number of interactions (such as replies, retweets, or direct messages) exchanged between them in a Twitter context.In the Ukraine war and Vaccine dataset, the “createdate” indicates the date of that interaction. The “likecount,” “retweetcount,” “replycount,” and “quotecount” columns capture various engagement metrics on Twitter—how many times a tweet is liked, retweeted, replied to, or quoted. The “user1” and “user2” fields store unique user IDs, whereas “user1proukraine,” “user1provaccine,” “user2proukraine,” and “user2provaccine” denote each user’s stance (e.g., pro, anti, or unknown) regarding Ukraine and vaccines. The “creationday” is a numeric value corresponding to the creation date, while “onset” and “terminus” mark the first and last recorded interactions between user1 and user2, respectively. Finally, “duration” shows the total time span across which these interactions took place.
World - Twitter Sentiment By Country
kaggle.com
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Jiang (2020). World - Twitter Sentiment By Country [Dataset]. https://www.kaggle.com/wjia26/twittersentimentbycountry/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 10, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
William Jiang
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Area covered
World
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1041505%2F0625876b77e55a56422bb5a37d881e0d%2Fawdasdw.jpg?generation=1595666545033847&alt=media" alt="">

Introduction

Ever wondered what people are saying about certain countries? Whether it's in a positive/negative light? What are the most commonly used phrases/words to describe the country? In this dataset I present tweets where a certain country gets mentioned in the hashtags (e.g. #HongKong, #NewZealand). It contains around 150 countries in the world. I've added an additional field called polarity which has the sentiment computed from the text field. Feel free to explore! Feedback is much appreciated!

Content

Each row represents a tweet. Creation Dates of Tweets Range from 12/07/2020 to 25/07/2020. Will update on a Monthly cadence. - The Country can be derived from the file_name field. (this field is very Tableau friendly when it comes to plotting maps) - The Date at which the tweet was created can be got from created_at field. - The Search Query used to query the Twitter Search Engine can be got from search_query field. - The Tweet Full Text can be got from the text field. - The Sentiment can be got from polarity field. (I've used the Vader Model from NLTK to compute this.)

Notes

There maybe slight duplications in tweet id's before 22/07/2020. I have since fixed this bug.

Acknowledgements

Thanks to the tweepy package for making the data extraction via Twitter API so easy.

Shameless Plug

Feel free to checkout my blog if you want to learn how I built the datalake via AWS or for other data shenanigans.

Here's an App I built using a live version of this data.
f
3805 Tweet IDs from User 25073877 [Thu Feb 25 16:35:12 +0000 2016 to Mon Apr...
city.figshare.com
txt
Updated Apr 3, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ernesto Priego (2017). 3805 Tweet IDs from User 25073877 [Thu Feb 25 16:35:12 +0000 2016 to Mon Apr 03 12:51:01 +0000 2017] [Dataset]. http://doi.org/10.6084/m9.figshare.4811284.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4811284.v1
Dataset updated
Apr 3, 2017
Dataset provided by
City, University of London
Authors
Ernesto Priego
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a CSV file containing Tweet IDs of 3,805 Tweets from user ID 25073877 posted publicly between Thursday February 25 2016 16:35:12 +0000 to Monday April 03 2017 12:51:01 +0000.This file does not include Tweets' texts nor URLs. Columns in the file areid_strfrom_user_id_str created_at time source user_followers_count user_friends_count Motivations to Share this DataArchived Tweets can provide interesting insights for the study of contemporary history of media, politics, diplomacy, etc. The queried account is a public account widely agreed to be of exceptional national and international public interest. Though they provide public access to tweeted content in real time, Twitter Web and mobile clients are not suited for appropriate Tweet corpus analysis. For anyone researching social media, access to the data is absolutely essential in order to perform, review and reproduce studies. Archiving Tweets of public interest due to their historic significance is a means to both preserve and enable reproducible study of this form of rapid online communication that otherwise can very likely become unretrievable as time passes. Due to Twitter's current business model and API limits, to date collecting in real time is the only relatively reliable method to archive Tweets at a small scale. Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using a Python script. The data collection search query was from:realdonaldtrump. A trigger was scheduled to collect atuomatically every hour. The original data harvesting was refined to delete duplications, to subscribe to Twitter's Terms and Conditions and so that the data was sorted in chronological order.Duplication of data due to the automated collection is possible so further data refining might be required. The file may not contain data from Tweets deleted by the queried user account immediately after original publication. Both research and experience show that the Twitter search API is not 100% reliable. (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet posted by the queried account during the indicated period. This file dataset is shared for archival, comparative and indicative educational research purposes only. The content included is from a public Twitter account and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.The original Tweets, their contents and associated metadata were published openly on the Web from the queried public account and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually.No private personal information is shared in this dataset. As indicated above this dataset does not contain the text of the Tweets. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road.This dataset is shared to archive, document and encourage open educational research into political activity on Twitter.Other ConsiderationsAll Twitter users agree to Twitter's Privacy and data sharing policies. Social media research remains in its infancy and though work has been done to develop best practices there is yet no agreement on a series of grey areas relating to reseach methodologies including ad hoc social media specific research ethics guidelines for reproducible research. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time. Reproducibility is considered here a key value for robust and trustworthy research. Different scholarly professional associations like the Modern Language Association recognise Tweets, datasets and other online and digital resources as citeable scholarly outputs.The data contained in the deposited file is otherwise available elsewhere through different methods.
m
Data from: Tracking the Global Pulse: The first public Twitter dataset from...
data.mendeley.com
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kheir eddine daouadi (2025). Tracking the Global Pulse: The first public Twitter dataset from FIFA World Cup [Dataset]. http://doi.org/10.17632/gw3mcnbkwr.2
Explore at:
Unique identifier
https://doi.org/10.17632/gw3mcnbkwr.2
Dataset updated
May 27, 2025
Authors
kheir eddine daouadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
The first public large-scale multilingual Twitter dataset related to the FIFA World Cup 2022, comprising over 28 million posts in 69 unique spoken languages, including Arabic, English, Spanish, French, and many others. This dataset aims to facilitate research in future sentiment analysis, cross-linguistic studies, event-based analytics, meme and hate speech detection, fake news detection, and social manipulation detection.

The file 🚨Qatar22WC.csv🚨 🚀Codebook for | Column Name | Description| |-------------------------------- | day, | hou, | age_of_the_user_account | tweet_count | location | follower_count | following_count | follower_to_Following | favouite_count | verified | Avg_tweet_count | list_count | Tweet_Id | is_reply_tweet | is_quote | retid | lang | hashtags | is_image, | is_video |------------------------ contains tweet-level and user-level metadata for our collected tweets. FIFA World Cup 2022 Twitter Dataset🚀 |----------------------------------------------------------------------------------------| month, year | The date where the tweet posted | min, sec | Hour, minute, and second of tweet timestamp | | User Account age in days | | Total number of tweets posted by the user | | User-defined location field | | Number of followers the user has | | Number of accounts the user is following | | Follower-following ratio | | Number of likes the user did| | Boolean indicating if the user is verified (1 = Verified, 0 = Not Verified) | | Average tweets per day for the user activity| | Number of lists the user is a member | | Tweet ID | | ID of the tweet being replied to (if applicable) | | boolean representing if the tweet is a quote | | Retweet ID if it's a retweet; NaN otherwise | | Language of the tweet | | The keyword or hashtag used to collect the tweet | | Boolean indicating if the tweet associated with image| | Boolean indicating if the tweet associated with video | -------|----------------------------------------------------------------------------------------|

Examples of use case queries are described in the file 🚨fifa_wc_qatar22_examples_of_use_case_queries.ipynb🚨 and accessible via: https://github.com/khairied/Qata_FIFA_World_Cup_22

🚀 Please Cite This as: Daouadi, K. E., Boualleg, Y., Guehairia, O. & Taleb-Ahmed, A. (2025). Tracking the Global Pulse: The first public Twitter dataset from FIFA World Cup, Journal of Computational Social Science.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista Research Department (2025). Twitter users in the United States 2019-2028 [Dataset]. https://www.statista.com/topics/3196/social-media-usage-in-the-united-states/

Twitter users in the United States 2019-2028

Explore at:

Dataset updated

Jul 31, 2025

Dataset provided by

Statistahttp://statista.com/

Authors

Statista Research Department

Area covered

United States

Description

The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.

Clear search

Close search

Google apps

Main menu

Twitter users in the United States 2019-2028

Twitter users worldwide 2019-2028

Social Media Users 2021

Context

Acknowledgements

Two datasets of tweets.

Customer Support on Twitter

Context

Inspiration

Content

tweet_id

author_id

inbound

created_at

text

response_tweet_id

in_response_to_tweet_id

Contributing

Acknowledgements

Relevant Resources

Twitter bot profiling

Data from: IA Tweets Analysis Dataset (Spanish)

Data from: IA Tweets Analysis Dataset (Spanish)

Data from: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual...

COVID-19 Twitter Data Geographic Distribution

Data from: TweetNERD - End to End Entity Linking Benchmark for Tweets

A Dataset of UN Agencies' Public Communication about Climate Change on...

Data from: Annotated Dataset of History-related Tweets

The Climate Change Twitter Dataset

Instagram accounts with the most followers worldwide 2024

Coronavirus (COVID-19) Tweets Dataset

Deniers and believers in climate change discourse on twitter, and anti/pro...

World - Twitter Sentiment By Country

Introduction

Content

Notes

Acknowledgements

Shameless Plug

3805 Tweet IDs from User 25073877 [Thu Feb 25 16:35:12 +0000 2016 to Mon Apr...

Data from: Tracking the Global Pulse: The first public Twitter dataset from...

Twitter users in the United States 2019-2028

`tweet_id`

`author_id`

`inbound`

`created_at`

`text`

`response_tweet_id`

`in_response_to_tweet_id`