100+ datasets found
  1. SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics

    • zenodo.org
    csv, zip
    Updated May 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hina Qayyum; Hina Qayyum (2024). SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics [Dataset]. http://doi.org/10.5281/zenodo.11243662
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    May 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hina Qayyum; Hina Qayyum
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 25, 2024
    Description

    This is a longitudinal Twitter dataset of 143K users during the period 2017-2021. The following is the detail of all the files:

    • SenTopX_userIDs.txt: contains user IDs of 143K Twitter users.
    • userIDs_tweetIDs.zip: contains Tweet IDs of users, the name of the file is the user ID and the file contains the list of all the tweet IDs.
    • users_16_perspective_toxicity_scores.csv contains user IDs and 16 median Perspective API scores, the vector is shared as mean, median, and Gini Index of scores calculated over all tweets of a user.
    • LDAvis_top30_words_for_extracted_topics.csv contains the top 30 most relevant words extracted from each topic extracted by tweet-level topic modeling using the BERTweet topic model.
    • topic_modelling_statistics_per_user.csv contains important and relevant statistics related to topic modeling results:
      • 1. user: This column represents the identifier for the user. Each row in the CSV corresponds to a specific user, and this column helps to track and differentiate between the users.

        2. avg_topic_probability: This column contains the average probability of the topics for each user calculated across all of the tweets in order to compare users in a meaningful way. It represents the average likelihood that a particular user discusses various topics over the observed period.

        3. maximum_topic_avg: This column holds the value of the highest average probability among all topics for each user. It indicates the topic that the user most frequently discusses, on average.

        4. index_max_avg_topic_probability_200: This column specifies the index or identifier of the topic with the highest average probability out of 200 possible topics. It shows which topic (out of 200) the user discusses the most.

        5. global_avg: This column includes the global average probability of topics across all users. It provides a baseline or overall average topic probability that can be used for comparative purposes.

        6. max_global_avg: This column contains the maximum global average probability across all topics for all users. It identifies the most discussed topic across the entire user base.

        7. index_max_global_avg: This column shows the index or identifier of the topic with the highest global average probability. It indicates which topic (out of 200) is the most popular across all users.

        8. entropy_200_topic: This column represents the entropy of the topics for each user, calculated over 200 topics. Entropy measures the diversity or unpredictability in the user's discussion of topics, with higher entropy indicating more varied topic discussion.

        In summary, these columns are used to analyze the topic engagement and preferences of users on a platform, highlighting the most frequently discussed topics, the variability in topic discussions, and how individual user behavior compares to overall trends.

  2. B

    #FilmYourHospital Twitter Dataset: a COVID-19 conspiracy theory on Twitter

    • borealisdata.ca
    Updated May 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anatoliy Gruzd; Philip Mai (2021). #FilmYourHospital Twitter Dataset: a COVID-19 conspiracy theory on Twitter [Dataset]. http://doi.org/10.5683/SP2/BSGQGS
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 20, 2021
    Dataset provided by
    Borealis
    Authors
    Anatoliy Gruzd; Philip Mai
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The dataset contains 99,039 Tweet IDs of Twitter posts with #FilmYourHospital. It was collected using Netlytic.org between March 28 and April 9, 2020, by querying Twitter Search API (ver.1) very 15 minutes. NOTES: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). For more info about this dataset, read the following paper: https://doi.org/10.1177/2053951720938405

  3. A Twitter Dataset of 100+ million tweets related to COVID-19

    • zenodo.org
    application/gzip, csv +1
    Updated Apr 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding (2023). A Twitter Dataset of 100+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3735274
    Explore at:
    application/gzip, tsv, csvAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Gerardo Chowell; Gerardo Chowell; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding
    Description

    Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.

    The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (101,400,452 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (20,244,746 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

    More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

    As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.

  4. X/Twitter: distribution of global audiences 2024, by gender

    • statista.com
    • flwrdeptvarieties.store
    Updated May 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    X/Twitter: distribution of global audiences 2024, by gender [Dataset]. https://www.statista.com/statistics/828092/distribution-of-users-on-twitter-worldwide-gender/
    Explore at:
    Dataset updated
    May 22, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2024
    Area covered
    Worldwide
    Description

    As of January 2024, micro-blogging platform X (formerly Twitter) was more popular with men than women, with male audiences accounting for 60.9 percent of global users. Additionally, users between the ages of 25 and 34 were particularly active on X/Twitter, making up more than 38 percent of users worldwide. How many people use? Although X/Twitter holds its status as a mainstream social media site, it falls short in comparison to other well-known platforms in terms of user numbers. As of early 2022, X/Twitter had around 436 million monthly active users, whilst Meta’s Facebook reached almost three billion MAU. Overall, the United States is home to over 105 million X/Twitter users, making up Twitter’s largest audience base, followed by Japan, India, and the United Kingdom, respectively. How is Twitter used? X/Twitter is utilized by its audience for many different purposes. In May 2021, over 80 percent of high-volume X/Twitter users (defined as users who tweet around 20 times per month) in the United States reported using the platform for entertainment, whilst 78 percent said they used it as a way to stay informed. High-volume X/Twitter users were far more likely to use the service as a means of expressing their opinion. Furthermore, in 2022, over half of social media users in the U.S. used Twitter as a news resource.  

  5. Posts on X/Twitter mentioning "nuclear" throughout 2022, by sentiment

    • statista.com
    Updated Jun 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Posts on X/Twitter mentioning "nuclear" throughout 2022, by sentiment [Dataset]. https://www.statista.com/statistics/1472764/posts-x-twitter-that-mentioned-nuclear-sentiment/
    Explore at:
    Dataset updated
    Jun 18, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 24, 2022 - Oct 31, 2022
    Area covered
    Worldwide
    Description

    According to a report conducted in 2022, posts on X (formerly Twitter) containing the term "nuclear" were mainly of a negative sentiment between February and October 2022. Posts on the social media platform mentioning "nuclear," which evoked negative connotations, increased to 65 percent in March 2022, up from 55 percent in February, following Russia's invasion of Ukraine. Posts using "nuclear" that were of a negative sentiment also saw increases between August and October 2022, linked to the situation at the Zaporizhia nuclear power plant.

  6. B

    COVID-19 Twitter Dataset

    • borealisdata.ca
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anatoliy Gruzd; Philip Mai (2020). COVID-19 Twitter Dataset [Dataset]. http://doi.org/10.5683/SP2/PXF2CU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Borealis
    Authors
    Anatoliy Gruzd; Philip Mai
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The current dataset contains 237M Tweet IDs for Twitter posts that mentioned "COVID" as a keyword or as part of a hashtag (e.g., COVID-19, COVID19) between March and July of 2020. Sampling Method: hourly requests sent to Twitter Search API using Social Feed Manager, an open source software that harvests social media data and related content from Twitter and other platforms. NOTE: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). 3) This dataset, like most datasets collected via the Twitter Search API, is a sample of the available tweets on this topic and is not meant to be comprehensive. Some COVID-related tweets might not be included in the dataset either because the tweets were collected using a standardized but intermittent (hourly) sampling protocol or because tweets used hashtags/keywords other than COVID (e.g., Coronavirus or #nCoV). 4) To broaden this sample, consider comparing/merging this dataset with other COVID-19 related public datasets such as: https://github.com/thepanacealab/covid19_twitter https://ieee-dataport.org/open-access/corona-virus-covid-19-tweets-dataset https://github.com/echen102/COVID-19-TweetIDs

  7. twitter-dataset-tesla

    • huggingface.co
    Updated Jul 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fastai X Hugging Face Group 2022 (2022). twitter-dataset-tesla [Dataset]. https://huggingface.co/datasets/hugginglearners/twitter-dataset-tesla
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 11, 2022
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    fastai X Hugging Face Group 2022
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Twitter Dataset: Tesla

      Dataset Summary
    

    This dataset contains all the Tweets regarding #Tesla or #tesla till 12/07/2022 (dd-mm-yyyy). It can be used for sentiment analysis research purpose or used in other NLP tasks or just for fun. It contains 10,000 recent Tweets with the user ID, the hashtags used in the Tweets, and other important features.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More… See the full description on the dataset page: https://huggingface.co/datasets/hugginglearners/twitter-dataset-tesla.

  8. Data from: Twitter Dataset on the Russo-Ukrainian War

    • zenodo.org
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Shevtsov; Alexander Shevtsov; Despoina Antonakaki; Despoina Antonakaki; Ioannis Lamprou; Sotiris Ioannidis; Sotiris Ioannidis; Polyvios Pratikakis; Polyvios Pratikakis; Ioannis Lamprou (2023). Twitter Dataset on the Russo-Ukrainian War [Dataset]. http://doi.org/10.5281/zenodo.8431047
    Explore at:
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander Shevtsov; Alexander Shevtsov; Despoina Antonakaki; Despoina Antonakaki; Ioannis Lamprou; Sotiris Ioannidis; Sotiris Ioannidis; Polyvios Pratikakis; Polyvios Pratikakis; Ioannis Lamprou
    Time period covered
    Feb 23, 2022
    Area covered
    Ukraine
    Description

    On 24 February 2022, Russia invaded Ukraine, also known now as the Russo-Ukrainian War. We obtained our dataset through Twitter API from 23 February of 2022 until 23 June of 2023. The collected dataset has 127.275.386 tweets, shared in the form of anonymized text, where the tweet/user IDs and user mentions are anonymized and do not provide any personal information. The provided dataset contains user discussion in more than 70 languages, where the 20 most popular are : 'eng', 'fr', 'de', 'mix', 'it', 'es', 'ja', 'ru', 'pl', 'uk', 'tr', 'th', 'hi', 'qme', 'qht', 'nl', 'fi', 'ar', 'zh' and 'pt'. For the purpose of the information integrity tweets are separated and stored in different files ordered by creation date. The provided dataset is shared for further research purposes. Additionally, we provide the list of tweets IDs at the GitHub repository which can be retracted via Twitter API. Furthermore, we also manage to execute some initial analysis including: volume/activity, hashtags popularity, sentiment and military intelligence and publish the results in the web portal.

  9. H

    #RoeOverturned: Twitter Dataset on the Abortion Rights Controversy

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Feb 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashwin Rao; Rong-Ching Chang; Qiankun Zhong; Magdalena Wojcieszak; Kristina Lerman (2023). #RoeOverturned: Twitter Dataset on the Abortion Rights Controversy [Dataset]. http://doi.org/10.7910/DVN/STU0J5
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 6, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Ashwin Rao; Rong-Ching Chang; Qiankun Zhong; Magdalena Wojcieszak; Kristina Lerman
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    On June 24, 2022, the United States Supreme Court overturned landmark rulings made in its 1973 verdict in Roe v. Wade. The justices by way of a majority vote in Dobbs v. Jackson Women's Health Organization, decided that abortion wasn't a constitutional right and returned the issue of abortion to the elected representatives. This decision triggered multiple protests and debates across the US, especially in the context of the midterm elections in November 2022. Given that many citizens use social media platforms to express their views and mobilize for collective action, and given that online debate provides tangible effects on public opinion, political participation, news media coverage, and the political decision-making, it is crucial to understand online discussions surrounding this topic. Toward this end, we present the first large-scale Twitter dataset collected on the abortion rights debate in the United States. We present a set of 74M tweets systematically collected over the course of one year from January 1, 2022 to January 6, 2023.

  10. X/Twitter: average replies on posts 2023-2024

    • statista.com
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). X/Twitter: average replies on posts 2023-2024 [Dataset]. https://www.statista.com/statistics/1483830/x-twitter-average-replies-posts/
    Explore at:
    Dataset updated
    Aug 8, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Sep 2023 - Mar 2024
    Area covered
    Worldwide
    Description

    In 2024, X (formerly Twitter) posts had an average of 3.4 replies, up from an average of 1.64 replies in 2023. Elon Musk's X account is the profile with the most followers on the platform.

  11. Z

    TRACES Bulgarian Twitter Dataset on Covid-19 Annotated with Linguistic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silvia Gargova (2023). TRACES Bulgarian Twitter Dataset on Covid-19 Annotated with Linguistic Markers of Lies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7614246
    Explore at:
    Dataset updated
    Apr 16, 2023
    Dataset provided by
    Veneta Kireva
    Tsvetelina Stefanova
    Silvia Gargova
    Irina Temnikova
    Description

    This dataset has been created within Project TRACES (more information: https://traces.gate-ai.eu/). The dataset contains 61411 tweet IDs of tweets, written in Bulgarian, with annotations. The dataset can be used for general use or for building lies and disinformation detection applications.

    Note: this dataset is not fact-checked, the social media messages have been retrieved via keywords. For fact-checked datasets, see our other datasets.

    The tweets (written between 1 Jan 2020 and 28 June 2022) have been collected via Twitter API under academic access in June 2022 with the following keywords:

    (Covid OR коронавирус OR Covid19 OR Covid-19 OR Covid_19) - without replies and without retweets

    (Корона OR корона OR Corona OR пандемия OR пандемията OR Spikevax OR SARS-CoV-2 OR бустерна доза) - with replies, but without retweets

    Explanations of which fields can be used as markers of lies (or of intentional disinformation) are provided in our forthcoming paper (please cite it when using this dataset):

    Irina Temnikova, Silvia Gargova, Ruslana Margova, Veneta Kireva, Ivo Dzhumerov, Tsvetelina Stefanova and Hristiana Nikolaeva (2023) New Bulgarian Resources for Detecting Disinformation. 10th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC'23). Poznań. Poland.

  12. Twitter Conversations about the COVID-19 Omicron Variant: A Large Scale...

    • zenodo.org
    • dataverse.harvard.edu
    txt
    Updated Jul 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur; Nirmalya Thakur (2022). Twitter Conversations about the COVID-19 Omicron Variant: A Large Scale Dataset of more than 500,000 Tweets [Dataset]. http://doi.org/10.5281/zenodo.6804323
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 25, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nirmalya Thakur; Nirmalya Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please cite the following paper when using this dataset:

    N. Thakur and C.Y. Han, “An Exploratory Study of Tweets about the SARS-CoV-2 Omicron Variant: Insights from Sentiment Analysis, Language Interpretation, Source Tracking, Type Classification, and Embedded URL Detection,” Preprints, 2022, DOI: 10.20944/preprints202205.0238.v2

    Abstract

    This open-access dataset is one of the salient contributions of the above-mentioned paper. It presents a total of 537,702 Tweet IDs of the same number of Tweets about the SARS-CoV-2 Omicron Variant posted on Twitter since the first detected case of this variant on November 24, 2021. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

    Data Description

    The Tweet IDs are presented in 7 different .txt files based on the timelines of the associated tweets. The following table provides the details of these dataset files. The data collection followed a keyword-based approach and tweets comprising the "omicron" keyword were filtered, collected, and added to this dataset.

    Filename

    No. of Tweet IDs

    Date Range of the Tweet IDs

    TweetIDs_November.txt

    17271

    November 24, 2021 to November 30, 2021

    TweetIDs_December.txt

    101393

    December 1, 2021 to December 31, 2021

    TweetIDs_January.txt

    95055

    January 1, 2022 to January 31, 2022

    TweetIDs_February.txt

    91571

    February 1, 2022 to February 28, 2022

    TweetIDs_March.txt

    100787

    March 1, 2022 to March 31, 2022

    TweetIDs_April.txt

    94409

    April 1, 2022 to April 20, 2022

    TweetIDs_May.txt

    37216

    May 1, 2022 to May 12, 2022

    In the above table, the last date for May is May 12 as it was the most recent date at the time of data collection and dataset upload. The dataset would be updated soon to incorporate more recent tweets.

    The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.

  13. s

    Why Do People Use Twitter?

    • searchlogistics.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Why Do People Use Twitter? [Dataset]. https://www.searchlogistics.com/learn/statistics/twitter-user-statistics/
    Explore at:
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    One of the biggest advantages of Twitter is the speed at which information can be passed around. People use Twitter primarily to get news and for entertainment. This is the breakdown of why people use Twitter today.

  14. Z

    Data from: IA Tweets Analysis Dataset (Spanish)

    • data.niaid.nih.gov
    • produccioncientifica.uca.es
    • +1more
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IA Tweets Analysis Dataset (Spanish) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10821484
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset provided by
    Guerrero-Contreras, Gabriel
    Serrano-Fernández, Alejandro
    Balderas-Díaz, Sara
    Muñoz, Andrés
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Description

    This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

    Data Collection Method

    Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

    Dataset Content

    ID: A unique identifier for each tweet.

    text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

    polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

    favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

    retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

    user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

    user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

    user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

    user_followers_count: The current number of followers the account has. It is a non-negative integer.

    user_friends_count: The number of users that the account is following. It is a non-negative integer.

    user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

    user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

    user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

    user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

    Cite as

    Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

    Potential Use Cases

    This dataset is aimed at academic researchers and practitioners with interests in:

    Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

    Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

    Exploring correlations between user engagement metrics and sentiment in discussions about AI.

    Data Format and File Type

    The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

    License

    The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.

  15. H

    Data from: MonkeyPox2022Tweets: A Large-Scale Twitter Dataset on the 2022...

    • dataverse.harvard.edu
    Updated Nov 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur (2022). MonkeyPox2022Tweets: A Large-Scale Twitter Dataset on the 2022 Monkeypox Outbreak, Findings from Analysis of Tweets, and Open Research Questions [Dataset]. http://doi.org/10.7910/DVN/CR7T5E
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 19, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Nirmalya Thakur
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    May 7, 2022 - Nov 11, 2022
    Description

    Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: A large-scale Twitter dataset on the 2022 Monkeypox outbreak, findings from analysis of Tweets, and open research questions,” Infect. Dis. Rep., vol. 14, no. 6, pp. 855–883, 2022, DOI: https://doi.org/10.3390/idr14060087. Abstract The mining of Tweets to develop datasets on recent issues, global challenges, pandemics, virus outbreaks, emerging technologies, and trending matters has been of significant interest to the scientific community in the recent past, as such datasets serve as a rich data resource for the investigation of different research questions. Furthermore, the virus outbreaks of the past, such as COVID-19, Ebola, Zika virus, and flu, just to name a few, were associated with various works related to the analysis of the multimodal components of Tweets to infer the different characteristics of conversations on Twitter related to these respective outbreaks. The ongoing outbreak of the monkeypox virus, declared a Global Public Health Emergency (GPHE) by the World Health Organization (WHO), has resulted in a surge of conversations about this outbreak on Twitter, which is resulting in the generation of tremendous amounts of Big Data. There has been no prior work in this field thus far that has focused on mining such conversations to develop a Twitter dataset. Therefore, this work presents an open-access dataset of 571,831 Tweets about monkeypox that have been posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset complies with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management. Data Description The dataset consists of a total of 571,831 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 11th November (the most recent date at the time of uploading the most recent version of the dataset). The Tweet IDs are presented in 12 different .txt files based on the timelines of the associated tweets. The following represents the details of these dataset files. Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the associated Tweet IDs: May 7, 2022, to May 21, 2022) Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the associated Tweet IDs: May 21, 2022, to May 27, 2022) Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the associated Tweet IDs: May 27, 2022, to June 5, 2022) Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the associated Tweet IDs: June 5, 2022, to June 11, 2022) Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 46718, Date Range of the associated Tweet IDs: June 12, 2022, to June 30, 2022) Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the associated Tweet IDs: July 1, 2022, to July 23, 2022) Filename: TweetIDs_Part7.txt (No. of Tweet IDs: 105890, Date Range of the associated Tweet IDs: July 24, 2022, to July 31, 2022) Filename: TweetIDs_Part8.txt (No. of Tweet IDs: 93959, Date Range of the associated Tweet IDs: August 1, 2022, to August 9, 2022) Filename: TweetIDs_Part9.txt (No. of Tweet IDs: 50832, Date Range of the associated Tweet IDs: August 10, 2022, to August 24, 2022) Filename: TweetIDs_Part10.txt (No. of Tweet IDs: 39042, Date Range of the associated Tweet IDs: August 25, 2022, to September 19, 2022) Filename: TweetIDs_Part11.txt (No. of Tweet IDs: 12341, Date Range of the associated Tweet IDs: September 20, 2022, to October 9, 2022) Filename: TweetIDs_Part12.txt (No. of Tweet IDs: 15404, Date Range of the associated Tweet IDs: October 10, 2022, to November 11, 2022) Please note: The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset, the Hydrator application (link to download the application: https://github.com/DocNow/hydrator/releases and link to a step-by-step tutorial: https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweets) may be used.

  16. X/Twitter: number of monthly active users 2010-2019

    • statista.com
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2023). X/Twitter: number of monthly active users 2010-2019 [Dataset]. https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/
    Explore at:
    Dataset updated
    Sep 13, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    How many people use X/Twitter?

    As of the first quarter of 2019, X/Twitter averaged 330 million monthly active users, a decline from its all-time high of 336 MAU in the first quarter of 2018. As of the first quarter of 2019, the company switched its user reporting metric to monetizable daily active users (mDAU).

    X/Twitter

    X/Twitter is a social networking and microblogging service, enabling registered users to read and post short messages called tweets. X/Twitter messages are limited to 280 characters and users are also able to upload photos or short videos. Tweets are posted to a publicly available profile or can be sent as direct messages to other users.

    Part of the social platform’s appeal is the ability of users to follow any other user with a public profile, enabling users to interact with celebrities who regularly post on the social media site. Currently, the most-followed person on Twitter is singer Katy Perry with more than 107 million followers. Twitter has also become an important communications channel for governments and heads of state – U.S. President Donald Trump was the most-followed world leader on Twitter, followed by Pope Francis and Indian Prime Minister Narendra Modi.

    Despite the widespread usage among the rich and famous, the decline in active users has not been impressing investors as the platform is largely reliant on delivering advertising to users in order to generate revenues. Twitter’s company revenue in 2018 amounted to three billion U.S. dollars, up from 2.44 billion in the preceding fiscal year. Twitter was only recently able to report a positive annual result for the first time, when the company generated 1.2 billion U.S. dollars in net income in 2018.

  17. s

    Twitter Users Broken down By Country

    • searchlogistics.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Twitter Users Broken down By Country [Dataset]. https://www.searchlogistics.com/learn/statistics/twitter-user-statistics/
    Explore at:
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The US has historically been the target country for Twitter since its launch in 2006. This is the full breakdown of Twitter users by country.

  18. Z

    Data from: On the Role of Images for Analyzing Claims in Social Media

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ewerth, Ralph (2021). On the Role of Images for Analyzing Claims in Social Media [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4592248
    Explore at:
    Dataset updated
    Apr 23, 2021
    Dataset provided by
    Müller-Budack, Eric
    Hakimov, Sherzod
    Cheema, Gullal S.
    Ewerth, Ralph
    Description

    This is a multimodal dataset used in the paper "On the Role of Images for Analyzing Claims in Social Media", accepted at CLEOPATRA-2021 (2nd International Workshop on Cross-lingual Event-centric Open Analytics), co-located with The Web Conference 2021.

    The four datasets are curated for two different tasks that broadly come under fake news detection. Originally, the datasets were released as part of challenges or papers for text-based NLP tasks and are further extended here with corresponding images.

    1. clef_en and clef_ar are English and Arabic Twitter datasets for claim check-worthiness detection released in CLEF CheckThat! 2020 Barrón-Cedeno et al. [1].
    2. lesa is an English Twitter dataset for claim detection released by Gupta et al.[2]
    3. mediaeval is an English Twitter dataset for conspiracy detection released in MediaEval 2020 Workshop by Pogorelov et al.[3]

    The dataset details like data curation and annotation process can be found in the cited papers.

    Datasets released here with corresponding images are relatively smaller than the original text-based tweets. The data statistics are as follows: 1. clef_en: 281 2. clef_ar: 2571 3. lesa: 1395 4. mediaeval: 1724

    Each folder has two sub-folders and a json file data.json that consists of crawled tweets. Two sub-folders are: 1. images: This Contains crawled images with the same name as tweet-id in data.json. 2. splits: This contains 5-fold splits used for training and evaluation in our paper. Each file in this folder is a csv with two columns

    Code for the paper: https://github.com/cleopatra-itn/image_text_claim_detection

    If you find the dataset and the paper useful, please cite our paper and the corresponding dataset papers[1,2,3] Cheema, Gullal S., et al. "On the Role of Images for Analyzing Claims in Social Media" 2nd International Workshop on Cross-lingual Event-centric Open Analytics (CLEOPATRA) co-located with The Web Conf 2021.

    [1] Barrón-Cedeno, Alberto, et al. "Overview of CheckThat! 2020: Automatic identification and verification of claims in social media." International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham, 2020. [2] Gupta, Shreya, et al. "LESA: Linguistic Encapsulation and Semantic Amalgamation Based Generalised Claim Detection from Online Content." arXiv preprint arXiv:2101.11891 (2021). [3] Pogorelov, Konstantin, et al. "FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020." MediaEval 2020 Workshop. 2020.

  19. Twitter Friends

    • kaggle.com
    Updated Sep 2, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hubert Wassner (2016). Twitter Friends [Dataset]. https://www.kaggle.com/hwassner/TwitterFriends/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2016
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hubert Wassner
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Twitter Friends and hashtags

    Context

    This datasets is an extract of a wider database aimed at collecting Twitter user's friends (other accound one follows). The global goal is to study user's interest thru who they follow and connection to the hashtag they've used.

    Content

    It's a list of Twitter user's informations. In the JSON format one twitter user is stored in one object of this more that 40.000 objects list. Each object holds :

    • avatar : URL to the profile picture

    • followerCount : the number of followers of this user

    • friendsCount : the number of people following this user.

    • friendName : stores the @name (without the '@') of the user (beware this name can be changed by the user)

    • id : user ID, this number can not change (you can retrieve screen name with this service : https://tweeterid.com/)

    • friends : the list of IDs the user follows (data stored is IDs of users followed by this user)

    • lang : the language declared by the user (in this dataset there is only "en" (english))

    • lastSeen : the time stamp of the date when this user have post his last tweet.

    • tags : the hashtags (whith or without #) used by the user. It's the "trending topic" the user tweeted about.

    • tweetID : Id of the last tweet posted by this user.

    You also have the CSV format which uses the same naming convention.

    These users are selected because they tweeted on Twitter trending topics, I've selected users that have at least 100 followers and following at least 100 other account (in order to filter out spam and non-informative/empty accounts).

    Acknowledgements

    This data set is build by Hubert Wassner (me) using the Twitter public API. More data can be obtained on request (hubert.wassner AT gmail.com), at this time I've collected over 5 milions in different languages. Some more information can be found here (in french only) : http://wassner.blogspot.fr/2016/06/recuperer-des-profils-twitter-par.html

    Past Research

    No public research have been done (until now) on this dataset. I made a private application which is described here : http://wassner.blogspot.fr/2016/09/twitter-profiling.html (in French) which uses the full dataset (Millions of full profiles).

    Inspiration

    On can analyse a lot of stuff with this datasets :

    • stats about followers & followings
    • manyfold learning or unsupervised learning from friend list
    • hashtag prediction from friend list

    Contact

    Feel free to ask any question (or help request) via Twitter : @hwassner

    Enjoy! ;)

  20. X/Twitter average impressions on posts 2023-2024

    • statista.com
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). X/Twitter average impressions on posts 2023-2024 [Dataset]. https://www.statista.com/statistics/1483819/x-twitter-average-impressions-posts/
    Explore at:
    Dataset updated
    Aug 8, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Sep 2023 - Mar 2024
    Area covered
    Worldwide
    Description

    In 2024, X (formerly Twitter) posts generated an average of 2,121 impressions, up from 1,206 impressions in 2023. In 2022, Elon Musk's purchase of Twitter sent shockwaves through the tech world, and much has changed on the platform since.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hina Qayyum; Hina Qayyum (2024). SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics [Dataset]. http://doi.org/10.5281/zenodo.11243662
Organization logo

SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics

Explore at:
zip, csvAvailable download formats
Dataset updated
May 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hina Qayyum; Hina Qayyum
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
May 25, 2024
Description

This is a longitudinal Twitter dataset of 143K users during the period 2017-2021. The following is the detail of all the files:

  • SenTopX_userIDs.txt: contains user IDs of 143K Twitter users.
  • userIDs_tweetIDs.zip: contains Tweet IDs of users, the name of the file is the user ID and the file contains the list of all the tweet IDs.
  • users_16_perspective_toxicity_scores.csv contains user IDs and 16 median Perspective API scores, the vector is shared as mean, median, and Gini Index of scores calculated over all tweets of a user.
  • LDAvis_top30_words_for_extracted_topics.csv contains the top 30 most relevant words extracted from each topic extracted by tweet-level topic modeling using the BERTweet topic model.
  • topic_modelling_statistics_per_user.csv contains important and relevant statistics related to topic modeling results:
    • 1. user: This column represents the identifier for the user. Each row in the CSV corresponds to a specific user, and this column helps to track and differentiate between the users.

      2. avg_topic_probability: This column contains the average probability of the topics for each user calculated across all of the tweets in order to compare users in a meaningful way. It represents the average likelihood that a particular user discusses various topics over the observed period.

      3. maximum_topic_avg: This column holds the value of the highest average probability among all topics for each user. It indicates the topic that the user most frequently discusses, on average.

      4. index_max_avg_topic_probability_200: This column specifies the index or identifier of the topic with the highest average probability out of 200 possible topics. It shows which topic (out of 200) the user discusses the most.

      5. global_avg: This column includes the global average probability of topics across all users. It provides a baseline or overall average topic probability that can be used for comparative purposes.

      6. max_global_avg: This column contains the maximum global average probability across all topics for all users. It identifies the most discussed topic across the entire user base.

      7. index_max_global_avg: This column shows the index or identifier of the topic with the highest global average probability. It indicates which topic (out of 200) is the most popular across all users.

      8. entropy_200_topic: This column represents the entropy of the topics for each user, calculated over 200 topics. Entropy measures the diversity or unpredictability in the user's discussion of topics, with higher entropy indicating more varied topic discussion.

      In summary, these columns are used to analyze the topic engagement and preferences of users on a platform, highlighting the most frequently discussed topics, the variability in topic discussions, and how individual user behavior compares to overall trends.

Search
Clear search
Close search
Google apps
Main menu