100+ datasets found

SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics
zenodo.org
csv, zip
Updated May 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hina Qayyum; Hina Qayyum (2024). SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics [Dataset]. http://doi.org/10.5281/zenodo.11243662
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11243662
Dataset updated
May 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hina Qayyum; Hina Qayyum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 25, 2024
Description
This is a longitudinal Twitter dataset of 143K users during the period 2017-2021. The following is the detail of all the files:

SenTopX_userIDs.txt: contains user IDs of 143K Twitter users.

userIDs_tweetIDs.zip: contains Tweet IDs of users, the name of the file is the user ID and the file contains the list of all the tweet IDs.

users_16_perspective_toxicity_scores.csv contains user IDs and 16 median Perspective API scores, the vector is shared as mean, median, and Gini Index of scores calculated over all tweets of a user.

LDAvis_top30_words_for_extracted_topics.csv contains the top 30 most relevant words extracted from each topic extracted by tweet-level topic modeling using the BERTweet topic model.

topic_modelling_statistics_per_user.csv contains important and relevant statistics related to topic modeling results:

1. user: This column represents the identifier for the user. Each row in the CSV corresponds to a specific user, and this column helps to track and differentiate between the users.

2. avg_topic_probability: This column contains the average probability of the topics for each user calculated across all of the tweets in order to compare users in a meaningful way. It represents the average likelihood that a particular user discusses various topics over the observed period.

3. maximum_topic_avg: This column holds the value of the highest average probability among all topics for each user. It indicates the topic that the user most frequently discusses, on average.

4. index_max_avg_topic_probability_200: This column specifies the index or identifier of the topic with the highest average probability out of 200 possible topics. It shows which topic (out of 200) the user discusses the most.

5. global_avg: This column includes the global average probability of topics across all users. It provides a baseline or overall average topic probability that can be used for comparative purposes.

6. max_global_avg: This column contains the maximum global average probability across all topics for all users. It identifies the most discussed topic across the entire user base.

7. index_max_global_avg: This column shows the index or identifier of the topic with the highest global average probability. It indicates which topic (out of 200) is the most popular across all users.

8. entropy_200_topic: This column represents the entropy of the topics for each user, calculated over 200 topics. Entropy measures the diversity or unpredictability in the user's discussion of topics, with higher entropy indicating more varied topic discussion.

In summary, these columns are used to analyze the topic engagement and preferences of users on a platform, highlighting the most frequently discussed topics, the variability in topic discussions, and how individual user behavior compares to overall trends.
B
#FilmYourHospital Twitter Dataset: a COVID-19 conspiracy theory on Twitter
borealisdata.ca
Updated May 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anatoliy Gruzd; Philip Mai (2021). #FilmYourHospital Twitter Dataset: a COVID-19 conspiracy theory on Twitter [Dataset]. http://doi.org/10.5683/SP2/BSGQGS
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/BSGQGS
Dataset updated
May 20, 2021
Dataset provided by
Borealis
Authors
Anatoliy Gruzd; Philip Mai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The dataset contains 99,039 Tweet IDs of Twitter posts with #FilmYourHospital. It was collected using Netlytic.org between March 28 and April 9, 2020, by querying Twitter Search API (ver.1) very 15 minutes. NOTES: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). For more info about this dataset, read the following paper: https://doi.org/10.1177/2053951720938405
A Twitter Dataset of 100+ million tweets related to COVID-19
zenodo.org
application/gzip, csv +1
Updated Apr 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A Twitter Dataset of 100+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3735274
Explore at:
application/gzip, tsv, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3735274
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
X/Twitter: distribution of global audiences 2024, by gender
statista.com
flwrdeptvarieties.store
Updated May 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
X/Twitter: distribution of global audiences 2024, by gender [Dataset]. https://www.statista.com/statistics/828092/distribution-of-users-on-twitter-worldwide-gender/
Explore at:
Dataset updated
May 22, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2024
Area covered
Worldwide
Description
As of January 2024, micro-blogging platform X (formerly Twitter) was more popular with men than women, with male audiences accounting for 60.9 percent of global users. Additionally, users between the ages of 25 and 34 were particularly active on X/Twitter, making up more than 38 percent of users worldwide. How many people use? Although X/Twitter holds its status as a mainstream social media site, it falls short in comparison to other well-known platforms in terms of user numbers. As of early 2022, X/Twitter had around 436 million monthly active users, whilst Meta’s Facebook reached almost three billion MAU. Overall, the United States is home to over 105 million X/Twitter users, making up Twitter’s largest audience base, followed by Japan, India, and the United Kingdom, respectively. How is Twitter used? X/Twitter is utilized by its audience for many different purposes. In May 2021, over 80 percent of high-volume X/Twitter users (defined as users who tweet around 20 times per month) in the United States reported using the platform for entertainment, whilst 78 percent said they used it as a way to stay informed. High-volume X/Twitter users were far more likely to use the service as a means of expressing their opinion. Furthermore, in 2022, over half of social media users in the U.S. used Twitter as a news resource.  
Posts on X/Twitter mentioning "nuclear" throughout 2022, by sentiment
statista.com
Updated Jun 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Posts on X/Twitter mentioning "nuclear" throughout 2022, by sentiment [Dataset]. https://www.statista.com/statistics/1472764/posts-x-twitter-that-mentioned-nuclear-sentiment/
Explore at:
Dataset updated
Jun 18, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 24, 2022 - Oct 31, 2022
Area covered
Worldwide
Description
According to a report conducted in 2022, posts on X (formerly Twitter) containing the term "nuclear" were mainly of a negative sentiment between February and October 2022. Posts on the social media platform mentioning "nuclear," which evoked negative connotations, increased to 65 percent in March 2022, up from 55 percent in February, following Russia's invasion of Ukraine. Posts using "nuclear" that were of a negative sentiment also saw increases between August and October 2022, linked to the situation at the Zaporizhia nuclear power plant.
B
COVID-19 Twitter Dataset
borealisdata.ca
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anatoliy Gruzd; Philip Mai (2020). COVID-19 Twitter Dataset [Dataset]. http://doi.org/10.5683/SP2/PXF2CU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/PXF2CU
Dataset updated
Nov 10, 2020
Dataset provided by
Borealis
Authors
Anatoliy Gruzd; Philip Mai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The current dataset contains 237M Tweet IDs for Twitter posts that mentioned "COVID" as a keyword or as part of a hashtag (e.g., COVID-19, COVID19) between March and July of 2020. Sampling Method: hourly requests sent to Twitter Search API using Social Feed Manager, an open source software that harvests social media data and related content from Twitter and other platforms. NOTE: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). 3) This dataset, like most datasets collected via the Twitter Search API, is a sample of the available tweets on this topic and is not meant to be comprehensive. Some COVID-related tweets might not be included in the dataset either because the tweets were collected using a standardized but intermittent (hourly) sampling protocol or because tweets used hashtags/keywords other than COVID (e.g., Coronavirus or #nCoV). 4) To broaden this sample, consider comparing/merging this dataset with other COVID-19 related public datasets such as: https://github.com/thepanacealab/covid19_twitter https://ieee-dataport.org/open-access/corona-virus-covid-19-tweets-dataset https://github.com/echen102/COVID-19-TweetIDs
twitter-dataset-tesla
huggingface.co
Updated Jul 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fastai X Hugging Face Group 2022 (2022). twitter-dataset-tesla [Dataset]. https://huggingface.co/datasets/hugginglearners/twitter-dataset-tesla
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 11, 2022
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
fastai X Hugging Face Group 2022
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Twitter Dataset: Tesla

Dataset Summary

This dataset contains all the Tweets regarding #Tesla or #tesla till 12/07/2022 (dd-mm-yyyy). It can be used for sentiment analysis research purpose or used in other NLP tasks or just for fun. It contains 10,000 recent Tweets with the user ID, the hashtags used in the Tweets, and other important features.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More… See the full description on the dataset page: https://huggingface.co/datasets/hugginglearners/twitter-dataset-tesla.
Data from: Twitter Dataset on the Russo-Ukrainian War
zenodo.org
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Shevtsov; Alexander Shevtsov; Despoina Antonakaki; Despoina Antonakaki; Ioannis Lamprou; Sotiris Ioannidis; Sotiris Ioannidis; Polyvios Pratikakis; Polyvios Pratikakis; Ioannis Lamprou (2023). Twitter Dataset on the Russo-Ukrainian War [Dataset]. http://doi.org/10.5281/zenodo.8431047
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8431047
Dataset updated
Oct 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander Shevtsov; Alexander Shevtsov; Despoina Antonakaki; Despoina Antonakaki; Ioannis Lamprou; Sotiris Ioannidis; Sotiris Ioannidis; Polyvios Pratikakis; Polyvios Pratikakis; Ioannis Lamprou
Time period covered
Feb 23, 2022
Area covered
Ukraine
Description
On 24 February 2022, Russia invaded Ukraine, also known now as the Russo-Ukrainian War. We obtained our dataset through Twitter API from 23 February of 2022 until 23 June of 2023. The collected dataset has 127.275.386 tweets, shared in the form of anonymized text, where the tweet/user IDs and user mentions are anonymized and do not provide any personal information. The provided dataset contains user discussion in more than 70 languages, where the 20 most popular are : 'eng', 'fr', 'de', 'mix', 'it', 'es', 'ja', 'ru', 'pl', 'uk', 'tr', 'th', 'hi', 'qme', 'qht', 'nl', 'fi', 'ar', 'zh' and 'pt'. For the purpose of the information integrity tweets are separated and stored in different files ordered by creation date. The provided dataset is shared for further research purposes. Additionally, we provide the list of tweets IDs at the GitHub repository which can be retracted via Twitter API. Furthermore, we also manage to execute some initial analysis including: volume/activity, hashtags popularity, sentiment and military intelligence and publish the results in the web portal.
H
#RoeOverturned: Twitter Dataset on the Abortion Rights Controversy
dataverse.harvard.edu
search.dataone.org
Updated Feb 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashwin Rao; Rong-Ching Chang; Qiankun Zhong; Magdalena Wojcieszak; Kristina Lerman (2023). #RoeOverturned: Twitter Dataset on the Abortion Rights Controversy [Dataset]. http://doi.org/10.7910/DVN/STU0J5
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/STU0J5
Dataset updated
Feb 6, 2023
Dataset provided by
Harvard Dataverse
Authors
Ashwin Rao; Rong-Ching Chang; Qiankun Zhong; Magdalena Wojcieszak; Kristina Lerman
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
On June 24, 2022, the United States Supreme Court overturned landmark rulings made in its 1973 verdict in Roe v. Wade. The justices by way of a majority vote in Dobbs v. Jackson Women's Health Organization, decided that abortion wasn't a constitutional right and returned the issue of abortion to the elected representatives. This decision triggered multiple protests and debates across the US, especially in the context of the midterm elections in November 2022. Given that many citizens use social media platforms to express their views and mobilize for collective action, and given that online debate provides tangible effects on public opinion, political participation, news media coverage, and the political decision-making, it is crucial to understand online discussions surrounding this topic. Toward this end, we present the first large-scale Twitter dataset collected on the abortion rights debate in the United States. We present a set of 74M tweets systematically collected over the course of one year from January 1, 2022 to January 6, 2023.
X/Twitter: average replies on posts 2023-2024
statista.com
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). X/Twitter: average replies on posts 2023-2024 [Dataset]. https://www.statista.com/statistics/1483830/x-twitter-average-replies-posts/
Explore at:
Dataset updated
Aug 8, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Sep 2023 - Mar 2024
Area covered
Worldwide
Description
In 2024, X (formerly Twitter) posts had an average of 3.4 replies, up from an average of 1.64 replies in 2023. Elon Musk's X account is the profile with the most followers on the platform.
Z
TRACES Bulgarian Twitter Dataset on Covid-19 Annotated with Linguistic...
data.niaid.nih.gov
zenodo.org
Updated Apr 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silvia Gargova (2023). TRACES Bulgarian Twitter Dataset on Covid-19 Annotated with Linguistic Markers of Lies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7614246
Explore at:
Dataset updated
Apr 16, 2023
Dataset provided by
Veneta Kireva
Tsvetelina Stefanova
Silvia Gargova
Irina Temnikova
Description
This dataset has been created within Project TRACES (more information: https://traces.gate-ai.eu/). The dataset contains 61411 tweet IDs of tweets, written in Bulgarian, with annotations. The dataset can be used for general use or for building lies and disinformation detection applications.

Note: this dataset is not fact-checked, the social media messages have been retrieved via keywords. For fact-checked datasets, see our other datasets.

The tweets (written between 1 Jan 2020 and 28 June 2022) have been collected via Twitter API under academic access in June 2022 with the following keywords:

(Covid OR коронавирус OR Covid19 OR Covid-19 OR Covid_19) - without replies and without retweets

(Корона OR корона OR Corona OR пандемия OR пандемията OR Spikevax OR SARS-CoV-2 OR бустерна доза) - with replies, but without retweets

Explanations of which fields can be used as markers of lies (or of intentional disinformation) are provided in our forthcoming paper (please cite it when using this dataset):

Irina Temnikova, Silvia Gargova, Ruslana Margova, Veneta Kireva, Ivo Dzhumerov, Tsvetelina Stefanova and Hristiana Nikolaeva (2023) New Bulgarian Resources for Detecting Disinformation. 10th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC'23). Poznań. Poland.

Twitter Conversations about the COVID-19 Omicron Variant: A Large Scale...

zenodo.org
dataverse.harvard.edu

txt

Updated Jul 25, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Nirmalya Thakur; Nirmalya Thakur (2022). Twitter Conversations about the COVID-19 Omicron Variant: A Large Scale Dataset of more than 500,000 Tweets [Dataset]. http://doi.org/10.5281/zenodo.6804323

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6804323

Dataset updated

Jul 25, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Nirmalya Thakur; Nirmalya Thakur

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Please cite the following paper when using this dataset:

N. Thakur and C.Y. Han, “An Exploratory Study of Tweets about the SARS-CoV-2 Omicron Variant: Insights from Sentiment Analysis, Language Interpretation, Source Tracking, Type Classification, and Embedded URL Detection,” Preprints, 2022, DOI: 10.20944/preprints202205.0238.v2

Abstract

This open-access dataset is one of the salient contributions of the above-mentioned paper. It presents a total of 537,702 Tweet IDs of the same number of Tweets about the SARS-CoV-2 Omicron Variant posted on Twitter since the first detected case of this variant on November 24, 2021. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

Data Description

The Tweet IDs are presented in 7 different .txt files based on the timelines of the associated tweets. The following table provides the details of these dataset files. The data collection followed a keyword-based approach and tweets comprising the "omicron" keyword were filtered, collected, and added to this dataset.

Filename	No. of Tweet IDs	Date Range of the Tweet IDs
TweetIDs_November.txt	17271	November 24, 2021 to November 30, 2021
TweetIDs_December.txt	101393	December 1, 2021 to December 31, 2021
TweetIDs_January.txt	95055	January 1, 2022 to January 31, 2022
TweetIDs_February.txt	91571	February 1, 2022 to February 28, 2022
TweetIDs_March.txt	100787	March 1, 2022 to March 31, 2022
TweetIDs_April.txt	94409	April 1, 2022 to April 20, 2022
TweetIDs_May.txt	37216	May 1, 2022 to May 12, 2022

In the above table, the last date for May is May 12 as it was the most recent date at the time of data collection and dataset upload. The dataset would be updated soon to incorporate more recent tweets.

The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.

s
Why Do People Use Twitter?
searchlogistics.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Why Do People Use Twitter? [Dataset]. https://www.searchlogistics.com/learn/statistics/twitter-user-statistics/
Explore at:
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
One of the biggest advantages of Twitter is the speed at which information can be passed around. People use Twitter primarily to get news and for entertainment. This is the breakdown of why people use Twitter today.
Z
Data from: IA Tweets Analysis Dataset (Spanish)
data.niaid.nih.gov
produccioncientifica.uca.es
+1more
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IA Tweets Analysis Dataset (Spanish) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10821484
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
Guerrero-Contreras, Gabriel
Serrano-Fernández, Alejandro
Balderas-Díaz, Sara
Muñoz, Andrés
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Description

This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

Data Collection Method

Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

Dataset Content

ID: A unique identifier for each tweet.

text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

user_followers_count: The current number of followers the account has. It is a non-negative integer.

user_friends_count: The number of users that the account is following. It is a non-negative integer.

user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

Cite as

Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

Potential Use Cases

This dataset is aimed at academic researchers and practitioners with interests in:

Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

Exploring correlations between user engagement metrics and sentiment in discussions about AI.

Data Format and File Type

The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

License

The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
H
Data from: MonkeyPox2022Tweets: A Large-Scale Twitter Dataset on the 2022...
dataverse.harvard.edu
Updated Nov 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). MonkeyPox2022Tweets: A Large-Scale Twitter Dataset on the 2022 Monkeypox Outbreak, Findings from Analysis of Tweets, and Open Research Questions [Dataset]. http://doi.org/10.7910/DVN/CR7T5E
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/CR7T5E
Dataset updated
Nov 19, 2022
Dataset provided by
Harvard Dataverse
Authors
Nirmalya Thakur
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
May 7, 2022 - Nov 11, 2022
Description
Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: A large-scale Twitter dataset on the 2022 Monkeypox outbreak, findings from analysis of Tweets, and open research questions,” Infect. Dis. Rep., vol. 14, no. 6, pp. 855–883, 2022, DOI: https://doi.org/10.3390/idr14060087. Abstract The mining of Tweets to develop datasets on recent issues, global challenges, pandemics, virus outbreaks, emerging technologies, and trending matters has been of significant interest to the scientific community in the recent past, as such datasets serve as a rich data resource for the investigation of different research questions. Furthermore, the virus outbreaks of the past, such as COVID-19, Ebola, Zika virus, and flu, just to name a few, were associated with various works related to the analysis of the multimodal components of Tweets to infer the different characteristics of conversations on Twitter related to these respective outbreaks. The ongoing outbreak of the monkeypox virus, declared a Global Public Health Emergency (GPHE) by the World Health Organization (WHO), has resulted in a surge of conversations about this outbreak on Twitter, which is resulting in the generation of tremendous amounts of Big Data. There has been no prior work in this field thus far that has focused on mining such conversations to develop a Twitter dataset. Therefore, this work presents an open-access dataset of 571,831 Tweets about monkeypox that have been posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset complies with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management. Data Description The dataset consists of a total of 571,831 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 11th November (the most recent date at the time of uploading the most recent version of the dataset). The Tweet IDs are presented in 12 different .txt files based on the timelines of the associated tweets. The following represents the details of these dataset files. Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the associated Tweet IDs: May 7, 2022, to May 21, 2022) Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the associated Tweet IDs: May 21, 2022, to May 27, 2022) Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the associated Tweet IDs: May 27, 2022, to June 5, 2022) Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the associated Tweet IDs: June 5, 2022, to June 11, 2022) Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 46718, Date Range of the associated Tweet IDs: June 12, 2022, to June 30, 2022) Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the associated Tweet IDs: July 1, 2022, to July 23, 2022) Filename: TweetIDs_Part7.txt (No. of Tweet IDs: 105890, Date Range of the associated Tweet IDs: July 24, 2022, to July 31, 2022) Filename: TweetIDs_Part8.txt (No. of Tweet IDs: 93959, Date Range of the associated Tweet IDs: August 1, 2022, to August 9, 2022) Filename: TweetIDs_Part9.txt (No. of Tweet IDs: 50832, Date Range of the associated Tweet IDs: August 10, 2022, to August 24, 2022) Filename: TweetIDs_Part10.txt (No. of Tweet IDs: 39042, Date Range of the associated Tweet IDs: August 25, 2022, to September 19, 2022) Filename: TweetIDs_Part11.txt (No. of Tweet IDs: 12341, Date Range of the associated Tweet IDs: September 20, 2022, to October 9, 2022) Filename: TweetIDs_Part12.txt (No. of Tweet IDs: 15404, Date Range of the associated Tweet IDs: October 10, 2022, to November 11, 2022) Please note: The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset, the Hydrator application (link to download the application: https://github.com/DocNow/hydrator/releases and link to a step-by-step tutorial: https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweets) may be used.
X/Twitter: number of monthly active users 2010-2019
statista.com
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2023). X/Twitter: number of monthly active users 2010-2019 [Dataset]. https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/
Explore at:
Dataset updated
Sep 13, 2023
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
How many people use X/Twitter?

As of the first quarter of 2019, X/Twitter averaged 330 million monthly active users, a decline from its all-time high of 336 MAU in the first quarter of 2018. As of the first quarter of 2019, the company switched its user reporting metric to monetizable daily active users (mDAU).

X/Twitter

X/Twitter is a social networking and microblogging service, enabling registered users to read and post short messages called tweets. X/Twitter messages are limited to 280 characters and users are also able to upload photos or short videos. Tweets are posted to a publicly available profile or can be sent as direct messages to other users.

Part of the social platform’s appeal is the ability of users to follow any other user with a public profile, enabling users to interact with celebrities who regularly post on the social media site. Currently, the most-followed person on Twitter is singer Katy Perry with more than 107 million followers. Twitter has also become an important communications channel for governments and heads of state – U.S. President Donald Trump was the most-followed world leader on Twitter, followed by Pope Francis and Indian Prime Minister Narendra Modi.

Despite the widespread usage among the rich and famous, the decline in active users has not been impressing investors as the platform is largely reliant on delivering advertising to users in order to generate revenues. Twitter’s company revenue in 2018 amounted to three billion U.S. dollars, up from 2.44 billion in the preceding fiscal year. Twitter was only recently able to report a positive annual result for the first time, when the company generated 1.2 billion U.S. dollars in net income in 2018.
s
Twitter Users Broken down By Country
searchlogistics.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Twitter Users Broken down By Country [Dataset]. https://www.searchlogistics.com/learn/statistics/twitter-user-statistics/
Explore at:
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The US has historically been the target country for Twitter since its launch in 2006. This is the full breakdown of Twitter users by country.
Z
Data from: On the Role of Images for Analyzing Claims in Social Media
data.niaid.nih.gov
zenodo.org
Updated Apr 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ewerth, Ralph (2021). On the Role of Images for Analyzing Claims in Social Media [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4592248
Explore at:
Dataset updated
Apr 23, 2021
Dataset provided by
Müller-Budack, Eric
Hakimov, Sherzod
Cheema, Gullal S.
Ewerth, Ralph
Description
This is a multimodal dataset used in the paper "On the Role of Images for Analyzing Claims in Social Media", accepted at CLEOPATRA-2021 (2nd International Workshop on Cross-lingual Event-centric Open Analytics), co-located with The Web Conference 2021.

The four datasets are curated for two different tasks that broadly come under fake news detection. Originally, the datasets were released as part of challenges or papers for text-based NLP tasks and are further extended here with corresponding images.

clef_en and clef_ar are English and Arabic Twitter datasets for claim check-worthiness detection released in CLEF CheckThat! 2020 Barrón-Cedeno et al. [1].

lesa is an English Twitter dataset for claim detection released by Gupta et al.[2]

mediaeval is an English Twitter dataset for conspiracy detection released in MediaEval 2020 Workshop by Pogorelov et al.[3]

The dataset details like data curation and annotation process can be found in the cited papers.

Datasets released here with corresponding images are relatively smaller than the original text-based tweets. The data statistics are as follows: 1. clef_en: 281 2. clef_ar: 2571 3. lesa: 1395 4. mediaeval: 1724

Each folder has two sub-folders and a json file data.json that consists of crawled tweets. Two sub-folders are: 1. images: This Contains crawled images with the same name as tweet-id in data.json. 2. splits: This contains 5-fold splits used for training and evaluation in our paper. Each file in this folder is a csv with two columns

Code for the paper: https://github.com/cleopatra-itn/image_text_claim_detection

If you find the dataset and the paper useful, please cite our paper and the corresponding dataset papers[1,2,3] Cheema, Gullal S., et al. "On the Role of Images for Analyzing Claims in Social Media" 2nd International Workshop on Cross-lingual Event-centric Open Analytics (CLEOPATRA) co-located with The Web Conf 2021.

[1] Barrón-Cedeno, Alberto, et al. "Overview of CheckThat! 2020: Automatic identification and verification of claims in social media." International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham, 2020. [2] Gupta, Shreya, et al. "LESA: Linguistic Encapsulation and Semantic Amalgamation Based Generalised Claim Detection from Online Content." arXiv preprint arXiv:2101.11891 (2021). [3] Pogorelov, Konstantin, et al. "FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020." MediaEval 2020 Workshop. 2020.
Twitter Friends
kaggle.com
Updated Sep 2, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hubert Wassner (2016). Twitter Friends [Dataset]. https://www.kaggle.com/hwassner/TwitterFriends/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2016
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hubert Wassner
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Twitter Friends and hashtags

Context

This datasets is an extract of a wider database aimed at collecting Twitter user's friends (other accound one follows). The global goal is to study user's interest thru who they follow and connection to the hashtag they've used.

Content

It's a list of Twitter user's informations. In the JSON format one twitter user is stored in one object of this more that 40.000 objects list. Each object holds :

avatar : URL to the profile picture

followerCount : the number of followers of this user

friendsCount : the number of people following this user.

friendName : stores the @name (without the '@') of the user (beware this name can be changed by the user)

id : user ID, this number can not change (you can retrieve screen name with this service : https://tweeterid.com/)

friends : the list of IDs the user follows (data stored is IDs of users followed by this user)

lang : the language declared by the user (in this dataset there is only "en" (english))

lastSeen : the time stamp of the date when this user have post his last tweet.

tags : the hashtags (whith or without #) used by the user. It's the "trending topic" the user tweeted about.

tweetID : Id of the last tweet posted by this user.

You also have the CSV format which uses the same naming convention.

These users are selected because they tweeted on Twitter trending topics, I've selected users that have at least 100 followers and following at least 100 other account (in order to filter out spam and non-informative/empty accounts).

Acknowledgements

This data set is build by Hubert Wassner (me) using the Twitter public API. More data can be obtained on request (hubert.wassner AT gmail.com), at this time I've collected over 5 milions in different languages. Some more information can be found here (in french only) : http://wassner.blogspot.fr/2016/06/recuperer-des-profils-twitter-par.html

Past Research

No public research have been done (until now) on this dataset. I made a private application which is described here : http://wassner.blogspot.fr/2016/09/twitter-profiling.html (in French) which uses the full dataset (Millions of full profiles).

Inspiration

On can analyse a lot of stuff with this datasets :

stats about followers & followings

manyfold learning or unsupervised learning from friend list

hashtag prediction from friend list

Contact

Feel free to ask any question (or help request) via Twitter : @hwassner

Enjoy! ;)
X/Twitter average impressions on posts 2023-2024
statista.com
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). X/Twitter average impressions on posts 2023-2024 [Dataset]. https://www.statista.com/statistics/1483819/x-twitter-average-impressions-posts/
Explore at:
Dataset updated
Aug 8, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Sep 2023 - Mar 2024
Area covered
Worldwide
Description
In 2024, X (formerly Twitter) posts generated an average of 2,121 impressions, up from 1,206 impressions in 2023. In 2022, Elon Musk's purchase of Twitter sent shockwaves through the tech world, and much has changed on the platform since.

Facebook

Twitter

Click to copy link

Link copied

Cite

Hina Qayyum; Hina Qayyum (2024). SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics [Dataset]. http://doi.org/10.5281/zenodo.11243662

SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics

Explore at:

zip, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11243662

Dataset updated

May 27, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Hina Qayyum; Hina Qayyum

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

May 25, 2024

Description

This is a longitudinal Twitter dataset of 143K users during the period 2017-2021. The following is the detail of all the files:

SenTopX_userIDs.txt: contains user IDs of 143K Twitter users.
userIDs_tweetIDs.zip: contains Tweet IDs of users, the name of the file is the user ID and the file contains the list of all the tweet IDs.
users_16_perspective_toxicity_scores.csv contains user IDs and 16 median Perspective API scores, the vector is shared as mean, median, and Gini Index of scores calculated over all tweets of a user.
LDAvis_top30_words_for_extracted_topics.csv contains the top 30 most relevant words extracted from each topic extracted by tweet-level topic modeling using the BERTweet topic model.
topic_modelling_statistics_per_user.csv contains important and relevant statistics related to topic modeling results:
- 1. user: This column represents the identifier for the user. Each row in the CSV corresponds to a specific user, and this column helps to track and differentiate between the users.
  
  2. avg_topic_probability: This column contains the average probability of the topics for each user calculated across all of the tweets in order to compare users in a meaningful way. It represents the average likelihood that a particular user discusses various topics over the observed period.
  
  3. maximum_topic_avg: This column holds the value of the highest average probability among all topics for each user. It indicates the topic that the user most frequently discusses, on average.
  
  4. index_max_avg_topic_probability_200: This column specifies the index or identifier of the topic with the highest average probability out of 200 possible topics. It shows which topic (out of 200) the user discusses the most.
  
  5. global_avg: This column includes the global average probability of topics across all users. It provides a baseline or overall average topic probability that can be used for comparative purposes.
  
  6. max_global_avg: This column contains the maximum global average probability across all topics for all users. It identifies the most discussed topic across the entire user base.
  
  7. index_max_global_avg: This column shows the index or identifier of the topic with the highest global average probability. It indicates which topic (out of 200) is the most popular across all users.
  
  8. entropy_200_topic: This column represents the entropy of the topics for each user, calculated over 200 topics. Entropy measures the diversity or unpredictability in the user's discussion of topics, with higher entropy indicating more varied topic discussion.
  
  In summary, these columns are used to analyze the topic engagement and preferences of users on a platform, highlighting the most frequently discussed topics, the variability in topic discussions, and how individual user behavior compares to overall trends.

Clear search

Close search

Google apps

Main menu

SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics

#FilmYourHospital Twitter Dataset: a COVID-19 conspiracy theory on Twitter

A Twitter Dataset of 100+ million tweets related to COVID-19

X/Twitter: distribution of global audiences 2024, by gender

Posts on X/Twitter mentioning "nuclear" throughout 2022, by sentiment

COVID-19 Twitter Dataset

twitter-dataset-tesla

Data from: Twitter Dataset on the Russo-Ukrainian War

#RoeOverturned: Twitter Dataset on the Abortion Rights Controversy

X/Twitter: average replies on posts 2023-2024

TRACES Bulgarian Twitter Dataset on Covid-19 Annotated with Linguistic...

Twitter Conversations about the COVID-19 Omicron Variant: A Large Scale...

Why Do People Use Twitter?

Data from: IA Tweets Analysis Dataset (Spanish)

Data from: MonkeyPox2022Tweets: A Large-Scale Twitter Dataset on the 2022...

X/Twitter: number of monthly active users 2010-2019

Twitter Users Broken down By Country

Data from: On the Role of Images for Analyzing Claims in Social Media

Twitter Friends

Twitter Friends and hashtags

Context

Content

Acknowledgements

Past Research

Inspiration

Contact

X/Twitter average impressions on posts 2023-2024

SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics