100+ datasets found
  1. g

    Just Another Day on Twitter: A Complete 24 Hours of Twitter Data

    • search.gesis.org
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pfeffer, Jürgen, Just Another Day on Twitter: A Complete 24 Hours of Twitter Data [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-2516
    Explore at:
    Dataset provided by
    GESIS search
    GESIS, Köln
    Authors
    Pfeffer, Jürgen
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Description

    At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.

  2. X/Twitter: distribution of global audiences 2025, by age and gender

    • statista.com
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stacy Jo Dixon (2024). X/Twitter: distribution of global audiences 2025, by age and gender [Dataset]. https://www.statista.com/topics/737/twitter/
    Explore at:
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Stacy Jo Dixon
    Description

    As of February 2025, 24.5 percent of X (formerly Twitter) users were men aged between 25 and 34 years. Overall, almost 19 percent of users were men aged between 18 and 24 years. X has a high share of male users when compared to other popular social media platforms.

  3. s

    Twitter Users Broken Down By Age

    • searchlogistics.com
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Twitter Users Broken Down By Age [Dataset]. https://www.searchlogistics.com/learn/statistics/twitter-user-statistics/
    Explore at:
    Dataset updated
    Apr 1, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the breakdown of Twitter users by age group.

  4. Twitter Dataset February 2024

    • kaggle.com
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Kumar Singh (2024). Twitter Dataset February 2024 [Dataset]. https://www.kaggle.com/datasets/fastcurious/twitter-dataset-february-2024
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ayush Kumar Singh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Tweets scraped will all possible datapoints provided by twitter in each tweet. For data extraction or scraping contact me on telegram - @akaseobhw

    All datapoints present for each tweet.

    Each entry in the dataset represents a tweet along with various attributes such as the tweet's ID, URL, text content, retweet count, reply count, like count, quote count, view count, creation date, language, and more. Additionally, there are details about the tweet's author, including their username, profile URL, follower count, following count, profile picture, cover picture, description, location, creation date, and more.

    Here's a brief description of the key fields present in each tweet entry:

    • type: Indicates the type of data, in this case, it's a tweet.
    • id: Unique identifier for the tweet.
    • url: URL of the tweet.
    • twitterUrl: Twitter URL of the tweet.
    • text: Text content of the tweet.
    • retweetCount: Number of retweets.
    • replyCount: Number of replies.
    • likeCount: Number of likes (favorites).
    • quoteCount: Number of times the tweet has been quoted.
    • viewCount: Number of views.
    • createdAt: Date and time when the tweet was created.
    • lang: Language of the tweet.
    • quoteId: ID of the quoted tweet, if this tweet is a quote.
    • bookmarkCount: Number of times the tweet has been bookmarked.
    • isReply: Indicates whether the tweet is a reply to another tweet.
    • author: Information about the author of the tweet.
      • userName: Username of the author.
      • url: URL of the author's profile.
      • followers: Number of followers of the author.
      • following: Number of accounts the author is following.
      • profilePicture: URL of the author's profile picture.
      • coverPicture: URL of the author's cover picture.
      • description: Description or bio of the author.
      • location: Location of the author.
      • createdAt: Date and time when the author's account was created.
    • entities: Entities present in the tweet, such as hashtags, symbols, URLs, and user mentions.
    • isRetweet: Indicates whether the tweet is a retweet.
    • isQuote: Indicates whether the tweet is a quote.
    • quote: Information about the quoted tweet, if this tweet is a quote.
    • media: Information about any media (such as images or videos) attached to the tweet.

    This dataset can be analyzed to gain insights into trends, sentiments, and user behavior on Twitter. You can use Python libraries like pandas to load this dataset and perform various analyses and visualizations.

  5. s

    Twitter Users Broken down By Country

    • searchlogistics.com
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Twitter Users Broken down By Country [Dataset]. https://www.searchlogistics.com/learn/statistics/twitter-user-statistics/
    Explore at:
    Dataset updated
    Apr 1, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The US has historically been the target country for Twitter since its launch in 2006. This is the full breakdown of Twitter users by country.

  6. S

    Twitter Users Statistics 2025: Monthly Active Users, Regional Data & More

    • sqmagazine.co.uk
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SQ Magazine (2025). Twitter Users Statistics 2025: Monthly Active Users, Regional Data & More [Dataset]. https://sqmagazine.co.uk/twitter-users-statistics/
    Explore at:
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    SQ Magazine
    License

    https://sqmagazine.co.uk/privacy-policy/https://sqmagazine.co.uk/privacy-policy/

    Time period covered
    Jan 1, 2024 - Dec 31, 2025
    Area covered
    Global
    Description

    In early 2025, something fascinating happened at a small community center in suburban Ohio. A town hall meeting about local road closures suddenly went viral, not because of the topic, but because a 74-year-old attendee live-tweeted the entire event using her iPad. Within hours, her posts racked up thousands of...

  7. Data from: Google Analytics & Twitter dataset from a movies, TV series and...

    • figshare.com
    • portalcientificovalencia.univeuropea.com
    txt
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Víctor Yeste (2024). Google Analytics & Twitter dataset from a movies, TV series and videogames website [Dataset]. http://doi.org/10.6084/m9.figshare.16553061.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Víctor Yeste
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio

  8. X/Twitter: U.S. users on the likelihood of using the platform in the future...

    • statista.com
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stacy Jo Dixon (2024). X/Twitter: U.S. users on the likelihood of using the platform in the future 2023 [Dataset]. https://www.statista.com/topics/737/twitter/
    Explore at:
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Stacy Jo Dixon
    Description

    According to a survey conducted in March 2023, 40 percent of X (formerly Twitter) users reported it was extremely or very likely that they would still be using the platform in a year's time. Overall, 35 percent said it was somewhat likely, and one-quarter of respondents stated it was not very likely, or not likely at all.

  9. Twitter Dataset

    • brightdata.com
    .json, .csv, .xlsx
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Twitter Dataset [Dataset]. https://brightdata.com/products/datasets/twitter
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our Twitter dataset for diverse applications to enrich business strategies and market insights. Analyzing this dataset provides a comprehensive understanding of social media trends, empowering organizations to refine their communication and marketing strategies. Access the entire dataset or customize a subset to fit your needs. Popular use cases include market research to identify trending topics and hashtags, AI training by reviewing factors such as tweet content, retweets, and user interactions for predictive analytics, and trend forecasting by examining correlations between specific themes and user engagement to uncover emerging social media preferences.

  10. Z

    Data from: IA Tweets Analysis Dataset (Spanish)

    • data.niaid.nih.gov
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Serrano-Fernández, Alejandro (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10821484
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset provided by
    University of Cadiz
    Authors
    Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Description

    This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

    Data Collection Method

    Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

    Dataset Content

    ID: A unique identifier for each tweet.

    text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

    polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

    favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

    retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

    user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

    user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

    user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

    user_followers_count: The current number of followers the account has. It is a non-negative integer.

    user_friends_count: The number of users that the account is following. It is a non-negative integer.

    user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

    user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

    user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

    user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

    Cite as

    Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

    Potential Use Cases

    This dataset is aimed at academic researchers and practitioners with interests in:

    Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

    Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

    Exploring correlations between user engagement metrics and sentiment in discussions about AI.

    Data Format and File Type

    The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

    License

    The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.

  11. f

    Twitter bot profiling

    • figshare.com
    • researchdata.smu.edu.sg
    • +1more
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Analytics Research Centre (2023). Twitter bot profiling [Dataset]. http://doi.org/10.25440/smu.12062706.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    Living Analytics Research Centre
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    This dataset comprises a set of Twitter accounts in Singapore that are used for social bot profiling research conducted by the Living Analytics Research Centre (LARC) at Singapore Management University (SMU). Here a bot is defined as a Twitter account that generates contents and/or interacts with other users automatically (at least according to human judgment). In this research, Twitter bots have been categorized into three major types:

    Broadcast bot. This bot aims at disseminating information to general audience by providing, e.g., benign links to news, blogs or sites. Such bot is often managed by an organization or a group of people (e.g., bloggers). Consumption bot. The main purpose of this bot is to aggregate contents from various sources and/or provide update services (e.g., horoscope reading, weather update) for personal consumption or use. Spam bot. This type of bots posts malicious contents (e.g., to trick people by hijacking certain account or redirecting them to malicious sites), or promotes harmless but invalid/irrelevant contents aggressively.

    This categorization is general enough to cater for new, emerging types of bot (e.g., chatbots can be viewed as a special type of broadcast bots). The dataset was collected from 1 January to 30 April 2014 via the Twitter REST and streaming APIs. Starting from popular seed users (i.e., users having many followers), their follow, retweet, and user mention links were crawled. The data collection proceeds by adding those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. Using this procedure, a total of 159,724 accounts have been collected. To identify bots, the first step is to check active accounts who tweeted at least 15 times within the month of April 2014. These accounts were then manually checked and labelled, of which 589 bots were found. As many more human users are expected in the Twitter population, the remaining accounts were randomly sampled and manually checked. With this, 1,024 human accounts were identified. In total, this results in 1,613 labelled accounts. Related Publication: R. J. Oentaryo, A. Murdopo, P. K. Prasetyo, and E.-P. Lim. (2016). On profiling bots in social media. Proceedings of the International Conference on Social Informatics (SocInfo’16), 92-109. Bellevue, WA. https://doi.org/10.1007/978-3-319-47880-7_6

  12. s

    Twitter cascade dataset

    • researchdata.smu.edu.sg
    • smu.edu.sg
    • +1more
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Analytics Research Centre (2023). Twitter cascade dataset [Dataset]. http://doi.org/10.25440/smu.12062709.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    Living Analytics Research Centre
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having many followers), we crawled their follow, retweet, and user mention links. We then added those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. With this, we have a total of 184,794 Twitter user accounts. Then tweets are crawled from these users from 1 April to 31 August 2012. In all, we got 32,479,134 tweets. To identify cascades, we extracted all the URL links and hashtags from the above tweets. And these URL links and hashtags are considered as the identities of cascades. In other words, all the tweets which contain the same URL link (or the same hashtag) represent a cascade. Mathematically, a cascade is represented as a set of user-timestamp pairs. Figure 1 provides an example, i.e. cascade C = {< u1, t1 >, < u2, t2 >, < u1, t3 >, < u3, t4 >, < u4, t5 >}. For evaluation, the dataset was split into two parts: four months data for training and the last one month data for testing. Table 1summarizes the basic (count) statistics of the dataset. Each line in each file represents a cascade. The first term in each line is a hashtag or URL, the second term is a list of user-timestamp pairs. Due to privacy concerns, all user identities are anonymized.

  13. s

    Twitter Users Broken Down By Gender

    • searchlogistics.com
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Twitter Users Broken Down By Gender [Dataset]. https://www.searchlogistics.com/learn/statistics/twitter-user-statistics/
    Explore at:
    Dataset updated
    Apr 1, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The platform is male-dominated with 68.1% of all Twitter users being male. Just 31.9% of Twitter users are female.

  14. u

    Data from: IA Tweets Analysis Dataset (Spanish)

    • produccioncientifica.uca.es
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés; Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. https://produccioncientifica.uca.es/documentos/67321e53aea56d4af04854c2
    Explore at:
    Dataset updated
    2024
    Authors
    Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés; Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés
    Description

    Cite as

    Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

    General Description

    This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

    Data Collection Method

    Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

    Dataset Content

    ID: A unique identifier for each tweet.

    text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

    polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

    favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

    retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

    user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

    user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

    user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

    user_followers_count: The current number of followers the account has. It is a non-negative integer.

    user_friends_count: The number of users that the account is following. It is a non-negative integer.

    user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

    user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

    user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

    user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

    Potential Use Cases

    This dataset is aimed at academic researchers and practitioners with interests in:

    Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

    Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

    Exploring correlations between user engagement metrics and sentiment in discussions about AI.

    Data Format and File Type

    The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

    License

    The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.

  15. f

    Predicting age groups of Twitter users based on language and metadata...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio A. Morgan-Lopez; Annice E. Kim; Robert F. Chew; Paul Ruddle (2023). Predicting age groups of Twitter users based on language and metadata features [Dataset]. http://doi.org/10.1371/journal.pone.0183537
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Antonio A. Morgan-Lopez; Annice E. Kim; Robert F. Chew; Paul Ruddle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Health organizations are increasingly using social media, such as Twitter, to disseminate health messages to target audiences. Determining the extent to which the target audience (e.g., age groups) was reached is critical to evaluating the impact of social media education campaigns. The main objective of this study was to examine the separate and joint predictive validity of linguistic and metadata features in predicting the age of Twitter users. We created a labeled dataset of Twitter users across different age groups (youth, young adults, adults) by collecting publicly available birthday announcement tweets using the Twitter Search application programming interface. We manually reviewed results and, for each age-labeled handle, collected the 200 most recent publicly available tweets and user handles’ metadata. The labeled data were split into training and test datasets. We created separate models to examine the predictive validity of language features only, metadata features only, language and metadata features, and words/phrases from another age-validated dataset. We estimated accuracy, precision, recall, and F1 metrics for each model. An L1-regularized logistic regression model was conducted for each age group, and predicted probabilities between the training and test sets were compared for each age group. Cohen’s d effect sizes were calculated to examine the relative importance of significant features. Models containing both Tweet language features and metadata features performed the best (74% precision, 74% recall, 74% F1) while the model containing only Twitter metadata features were least accurate (58% precision, 60% recall, and 57% F1 score). Top predictive features included use of terms such as “school” for youth and “college” for young adults. Overall, it was more challenging to predict older adults accurately. These results suggest that examining linguistic and Twitter metadata features to predict youth and young adult Twitter users may be helpful for informing public health surveillance and evaluation research.

  16. X/Twitter: number of worldwide users 2019-2024

    • statista.com
    • tokrwards.com
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). X/Twitter: number of worldwide users 2019-2024 [Dataset]. https://www.statista.com/statistics/303681/twitter-users-worldwide/
    Explore at:
    Dataset updated
    Jun 26, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Dec 2022
    Area covered
    Worldwide
    Description

    As of December 2022, X/Twitter's audience accounted for over *** million monthly active users worldwide. This figure was projected to ******** to approximately *** million by 2024, a ******* of around **** percent compared to 2022.

  17. Customer Support on Twitter

    • kaggle.com
    zip
    Updated Nov 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thought Vector (2017). Customer Support on Twitter [Dataset]. https://www.kaggle.com/thoughtvector/customer-support-on-twitter
    Explore at:
    zip(149959515 bytes)Available download formats
    Dataset updated
    Nov 27, 2017
    Dataset authored and provided by
    Thought Vector
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Customer Support on Twitter dataset is a large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact.

    https://i.imgur.com/nTv3Iuu.png" alt="Example Analysis - Inbound Volume for the Top 20 Brands">

    Context

    Natural language remains the densest encoding of human experience we have, and innovation in NLP has accelerated to power understanding of that data, but the datasets driving this innovation don't match the real language in use today. The Customer Support on Twitter dataset offers a large corpus of modern English (mostly) conversations between consumers and customer support agents on Twitter, and has three important advantages over other conversational text datasets:

    • Focused - Consumers contact customer support to have a specific problem solved, and the manifold of problems to be discussed is relatively small, especially compared to unconstrained conversational datasets like the reddit Corpus.
    • Natural - Consumers in this dataset come from a much broader segment than those in the Ubuntu Dialogue Corpus and have much more natural and recent use of typed text than the Cornell Movie Dialogs Corpus.
    • Succinct - Twitter's brevity causes more natural responses from support agents (rather than scripted), and to-the-point descriptions of problems and solutions. Also, its convenient in allowing for a relatively low message limit size for recurrent nets.

    Inspiration

    The size and breadth of this dataset inspires many interesting questions:

    • Can we predict company responses? Given the bounded set of subjects handled by each company, the answer seems like yes!
    • Do requests get stale? How quickly do the best companies respond, compared to the worst?
    • Can we learn high quality dense embeddings or representations of similarity for topical clustering?
    • How does tone affect the customer support conversation? Does saying sorry help?
    • Can we help companies identify new problems, or ones most affecting their customers?

    Content

    The dataset is a CSV, where each row is a tweet. The different columns are described below. Every conversation included has at least one request from a consumer and at least one response from a company. Which user IDs are company user IDs can be calculated using the inbound field.

    tweet_id

    A unique, anonymized ID for the Tweet. Referenced by response_tweet_id and in_response_to_tweet_id.

    author_id

    A unique, anonymized user ID. @s in the dataset have been replaced with their associated anonymized user ID.

    inbound

    Whether the tweet is "inbound" to a company doing customer support on Twitter. This feature is useful when re-organizing data for training conversational models.

    created_at

    Date and time when the tweet was sent.

    text

    Tweet content. Sensitive information like phone numbers and email addresses are replaced with mask values like _email_.

    response_tweet_id

    IDs of tweets that are responses to this tweet, comma-separated.

    in_response_to_tweet_id

    ID of the tweet this tweet is in response to, if any.

    Contributing

    Know of other brands the dataset should include? Found something that needs to be fixed? Start a discussion, or email me directly at $FIRSTNAME@$LASTNAME.com!

    Acknowledgements

    A huge thank you to my friends who helped bootstrap the list of companies that do customer support on Twitter! There are many rocks that would have been left un-turned were it not for your suggestions!

    Relevant Resources

  18. Z

    COVID-19 Tweets : A dataset contaning more than 600k tweets on the novel...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Habiba Drias (2021). COVID-19 Tweets : A dataset contaning more than 600k tweets on the novel CoronaVirus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4024176
    Explore at:
    Dataset updated
    Jan 23, 2021
    Dataset provided by
    Habiba Drias
    Yassine Drias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 653 996 tweets related to the Coronavirus topic and highlighted by hashtags such as: #COVID-19, #COVID19, #COVID, #Coronavirus, #NCoV and #Corona. The tweets' crawling period started on the 27th of February and ended on the 25th of March 2020, which is spread over four weeks.

    The tweets were generated by 390 458 users from 133 different countries and were written in 61 languages. English being the most used language with almost 400k tweets, followed by Spanish with around 80k tweets.

    The data is stored in as a CSV file, where each line represents a tweet. The CSV file provides information on the following fields:

    Author: the user who posted the tweet

    Recipient: contains the name of the user in case of a reply, otherwise it would have the same value as the previous field

    Tweet: the full content of the tweet

    Hashtags: the list of hashtags present in the tweet

    Language: the language of the tweet

    Relationship: gives information on the type of the tweet, whether it is a retweet, a reply, a tweet with a mention, etc.

    Location: the country of the author of the tweet, which is unfortunately not always available

    Date: the publication date of the tweet

    Source: the device or platform used to send the tweet

    The dataset can as well be used to construct a social graph since it includes the relations "Replies to", "Retweet", "MentionsInRetweet" and "Mentions".

  19. Cashtag Piggybacking dataset - Twitter dataset enriched with financial data

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano Cresci; Fabrizio Lillo; Daniele Regoli; Serena Tardelli; Serena Tardelli; Maurizio Tesconi; Stefano Cresci; Fabrizio Lillo; Daniele Regoli; Maurizio Tesconi (2020). Cashtag Piggybacking dataset - Twitter dataset enriched with financial data [Dataset]. http://doi.org/10.5281/zenodo.2686862
    Explore at:
    zip, application/x-troff-meAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stefano Cresci; Fabrizio Lillo; Daniele Regoli; Serena Tardelli; Serena Tardelli; Maurizio Tesconi; Stefano Cresci; Fabrizio Lillo; Daniele Regoli; Maurizio Tesconi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is composed of

    • Twitter dataset of ~9M tweets mentioning stocks (cashtags) traded on the most important US markets, shared between May and September 2017 (users data enriched with bot classification label)
    • Financial information about ~30k companies found in those tweets, retrieved from Google Finance

    Refer to the paper below for more details.

    Cresci, S., Lillo, F., Regoli, D., Tardelli, S., & Tesconi, M. (2019). Cashtag Piggybacking: Uncovering Spam and Bot Activity in Stock Microblogs on Twitter. ACM Transactions on the Web (TWEB), 13(2), 11.

  20. f

    100 Days of Tweet IDs and Most Frequent Terms in Tweets from_user_id_str...

    • city.figshare.com
    xls
    Updated Apr 29, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ernesto Priego (2017). 100 Days of Tweet IDs and Most Frequent Terms in Tweets from_user_id_str 25073877 [Dataset]. http://doi.org/10.6084/m9.figshare.4955231.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 29, 2017
    Dataset provided by
    City, University of London
    Authors
    Ernesto Priego
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an Excel workbook containing two sheets. The first sheet contains 503 rows corresponding to 503 Tweet id strings from_user_id_str 25073877 and the following corresponding metadata:created_at time user_lang in_reply_to_user_id_str f from_user_id_str in_reply_to_status_id_str source user_followers_count user_friends_countTweet texts, URLs and other metadata such as profile_image_url, status_url and entities_str have not been included.An attempt to remove duplicated entries was made but duplicates might have remained so further data refining might be required prior to analyses.The second sheet contains 400 rows corresponding to the most frequent terms in the dataset's Tweets' texts. The text analysis was performed with the Terms Tool from Voyant Tools by Stéfan Sinclair & Geoffrey Rockwell (2017). An edited English stop words list was applied to remove Twitter data specific terms such as t.co, https, user names, etc. The analysed Tweets contained emojis and other special characters; due to character encoding these will be reflected in the terms list as character combinations. Motivations to Share this DataArchived Tweets can provide interesting insights for the study of contemporary history of media, politics, diplomacy, etc. The queried account is a public account widely agreed to be of exceptional national and international public interest. Though they provide public access to tweeted content in real time, Twitter Web and mobile clients are not suited for appropriate Tweet corpus analysis. For anyone researching social media, access to the data is absolutely essential in order to perform, review and reproduce studies. Archiving Tweets of public interest due to their historic significance is a means to both preserve and enable reproducible study of this form of rapid online communication that otherwise can very likely become unretrievable as time passes. Due to Twitter's current business model and API limits, to date collecting in real time is the only relatively reliable method to archive Tweets at a small scale.So far Twitter data analysis and visualisation has been done without researchers providing access to the source data that would allow reproducibility. It is appreciated that an Excel workbook is far from ideal as a file format, but due to the small scale the intention is to make this data human readable and available to researchers in a variety of non-technical fields. Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using a Python script. The data collection search query was from:realdonaldtrump. A trigger was scheduled to collect atuomatically every hour, this means that any Tweets immediately deleted after publication have not been collected. The original data harvesting was refined to delete duplications, to subscribe to Twitter's Terms and Conditions and so that the data was sorted in chronological order.Duplication of data due to the automated collection is possible so further data refining might be required. The file may not contain data from Tweets deleted by the queried user account immediately after original publication. Both research and experience show that the Twitter search API is not 100% reliable. (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet posted by the queried account during the indicated period. This file dataset is shared for archival, comparative and indicative educational research purposes only. The content included is from a public Twitter account and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.The original Tweets, their contents and associated metadata were published openly on the Web from the queried public account and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually. The license on this output applies to the data collection; third-party content should be attributed to the original authors and copyright owners. Please note that usernames, user profile pictures and full text of the Tweets collected have not been included in this file. No private personal information is shared in this dataset. As indicated above this dataset does not contain the text of the Tweets. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road.This dataset is shared to archive, document and encourage open educational research into political activity on Twitter.Other ConsiderationsAll Twitter users agree to Twitter's Privacy and data sharing policies. Social media research remains in its infancy and though work has been done to develop best practices there is yet no agreement on a series of grey areas relating to reseach methodologies including ad hoc social media specific research ethics guidelines for reproducible research. It is understood that public figures Tweet publicly with the conscious intention to have their Tweets publicly accessed and discussed. It is assumed that a public figure Tweeting publicly is of public interest and that such figure, as a Twitter user, has given implicit consent, by agreeing explicitly to Twitter's Terms and Conditions, for their Tweets to be publicly accessed and discussed, including critical analysis, without the need for prior written permission. There is therefore no difference between collecting data and performing data analysis from a public printed or online publication and collecting data and performing data analysis of a dataset containing Twitter data from a public account from a public user in a public role. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time. Reproducibility is considered here a key value for robust and trustworthy research. Different scholarly professional associations like the Modern Language Association recognise Tweets, datasets and other online and digital resources as citeable scholarly outputs.The data contained in the deposited file is otherwise available elsewhere through different methods.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Pfeffer, Jürgen, Just Another Day on Twitter: A Complete 24 Hours of Twitter Data [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-2516

Just Another Day on Twitter: A Complete 24 Hours of Twitter Data

Related Article
Explore at:
Dataset provided by
GESIS search
GESIS, Köln
Authors
Pfeffer, Jürgen
License

https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

Description

At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.

Search
Clear search
Close search
Google apps
Main menu