88 datasets found
  1. Twitter Tweets Sentiment Dataset

    • kaggle.com
    zip
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
    Explore at:
    zip(1289519 bytes)Available download formats
    Dataset updated
    Apr 8, 2022
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

    Description:

    Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

    Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

    Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

    You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

    Columns:

    1. textID - unique ID for each piece of text
    2. text - the text of the tweet
    3. sentiment - the general sentiment of the tweet

    Acknowledgement:

    The dataset is download from Kaggle Competetions:
    https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build classification models to predict the twitter sentiments.
    • Compare the evaluation metrics of vaious classification algorithms.
  2. Elon Musk Tweets 2010 to 2025 (April)

    • kaggle.com
    zip
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dada Lyndell (2025). Elon Musk Tweets 2010 to 2025 (April) [Dataset]. https://www.kaggle.com/datasets/dadalyndell/elon-musk-tweets-2010-to-2025-march
    Explore at:
    zip(13814200 bytes)Available download formats
    Dataset updated
    Apr 13, 2025
    Authors
    Dada Lyndell
    Description
    • all_musk_posts.csv - Elon Musk's tweets from his official account (@elonmusk) from the very beginning till April 13, 2025.
    • musk_quote_tweets.csv - the original tweets that Elon Musk quote-tweeted to his official account (@elonmusk) from the very beginning till April 13, 2025.

    I scraped Elon Musk's tweets and combined it with other datasets published on Kaggle in different years: - All Elon Musk's Tweets - tweets from Bill Gates, Elon Musk and Ed Lee - Elon Musk Tweets, 2010 to 2017 - Elon Musk Tweets (2021-2023)

    The business magnate Elon Musk initiated an acquisition of the American social media company Twitter, Inc. on April 14, 2022, and concluded it on October 27, 2022. Musk had begun buying shares of the company in January 2022, becoming its largest shareholder by April with a 9.1 percent ownership stake. (Wikipedia)

    By early 2024, Musk had become a vocal and financial supporter of Donald Trump. (Washington Post)

    The data was collected and combined for the publication Poster boy: Six instances of Kremlin disinformation amplified through Elon Musk’s social network (The Insider, 2025-03-12). Below are two visualisations based of this data.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4728018%2F698e901a8dec9a84d7d5d5799427da42%2Ffile-efae4a0f8b8c46becfa2a845a8b6ac17.jpg?generation=1742660891320296&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4728018%2Ff98f3e2b42d9201cee3a37d1d3b5fa24%2Ffile-e0d66dd19064cc2c1ff91eeb34ee8157.jpg?generation=1742660914839024&alt=media" alt="">

    Content of the dataset all_musk_posts.csv:

    id - ID of the tweet by elonmusk

    url - link to a tweet (on x.com)

    twitterUrl - link to a tweet (on twitter.com)

    fullText - text of the tweet

    retweetCount - number of retweets

    replyCount - number of replies

    likeCount - number of likes

    quoteCount - number of quotes

    viewCount - number of views

    createdAt - timestamp, UTC

    bookmarkCount - number of bookmarks

    isReply - boolean, True if the post is a reply

    inReplyToId - ID of the original tweet if that's a reply

    conversationId - conversation ID

    inReplyToUserId - ID of the user that received a reply

    inReplyToUsername - current username of the user that received a reply

    isPinned - boolean, True if the post was pinned

    isRetweet - boolean, True if the post is a retweet

    isQuote - boolean, True if the post is a quote

    isConversationControlled - conversation marked as "controlled", only selected users can reply

    possiblySensitive - conversation marked as "sensitive"

    Content of the dataset musk_quote_tweets.csv:

    orig_tweet_id - ID of the original tweet by that @elonmusk quote-tweeted

    orig_tweet_created_at - timestamp of the original tweet, UTC

    orig_tweet_text - text of the original tweet, UTC

    orig_tweet_url - link to the original tweet (on x.com)

    orig_tweet_twitter_url - link to the original tweet (on twitter.com)

    orig_tweet_username - current (March 2025) username of the account that posted the original tweet

    orig_tweet_retweet_count - number of retweets for the original tweet

    orig_tweet_reply_count - number of replies for the original tweet

    orig_tweet_like_count - number of likes for the original tweet

    orig_tweet_quote_count - number of quotes for the original tweet

    orig_tweet_view_count - number of views for the original tweet

    orig_tweet_bookmark_count - number of bookmarks for the original tweet

    musk_tweet_id - ID of the quote-tweet by elonmusk

    musk_quote_tweet - text of the quote-tweet by elonmusk

    musk_quote_retweet_count - number of retweets for the quote-tweet by elonmusk

    musk_quote_reply_count - number of replies for the quote-tweet by elonmusk

    musk_quote_like_count- number of likes for the quote-tweet by elonmusk

    musk_quote_quote_count- number of quotes for the quote-tweet by elonmusk

    musk_quote_view_count - number of views for the quote-tweet by elonmusk

    musk_quote_bookmark_count - number of bookmarks for the quote-tweet by elonmusk

    musk_quote_created_at - timestamp of the quote-tweet by elonmusk, UTC

    Acknowledgements

    I do not own this data however I scraped this data for educational purposes ONLY. Please do not violate any...

  3. Sentiment Analysis on Financial Tweets

    • kaggle.com
    zip
    Updated Sep 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vivek Rathi (2019). Sentiment Analysis on Financial Tweets [Dataset]. https://www.kaggle.com/datasets/vivekrathi055/sentiment-analysis-on-financial-tweets
    Explore at:
    zip(2538259 bytes)Available download formats
    Dataset updated
    Sep 5, 2019
    Authors
    Vivek Rathi
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The following information can also be found at https://www.kaggle.com/davidwallach/financial-tweets. Out of curosity, I just cleaned the .csv files to perform a sentiment analysis. So both the .csv files in this dataset are created by me.

    Anything you read in the description is written by David Wallach and using all this information, I happen to perform my first ever sentiment analysis.

    "I have been interested in using public sentiment and journalism to gather sentiment profiles on publicly traded companies. I first developed a Python package (https://github.com/dwallach1/Stocker) that scrapes the web for articles written about companies, and then noticed the abundance of overlap with Twitter. I then developed a NodeJS project that I have been running on my RaspberryPi to monitor Twitter for all tweets coming from those mentioned in the content section. If one of them tweeted about a company in the stocks_cleaned.csv file, then it would write the tweet to the database. Currently, the file is only from earlier today, but after about a month or two, I plan to update the tweets.csv file (hopefully closer to 50,000 entries.

    I am not quite sure how this dataset will be relevant, but I hope to use these tweets and try to generate some sense of public sentiment score."

    Content

    This dataset has all the publicly traded companies (tickers and company names) that were used as input to fill the tweets.csv. The influencers whose tweets were monitored were: ['MarketWatch', 'business', 'YahooFinance', 'TechCrunch', 'WSJ', 'Forbes', 'FT', 'TheEconomist', 'nytimes', 'Reuters', 'GerberKawasaki', 'jimcramer', 'TheStreet', 'TheStalwart', 'TruthGundlach', 'Carl_C_Icahn', 'ReformedBroker', 'benbernanke', 'bespokeinvest', 'BespokeCrypto', 'stlouisfed', 'federalreserve', 'GoldmanSachs', 'ianbremmer', 'MorganStanley', 'AswathDamodaran', 'mcuban', 'muddywatersre', 'StockTwits', 'SeanaNSmith'

    Acknowledgements

    The data used here is gathered from a project I developed : https://github.com/dwallach1/StockerBot

    Inspiration

    I hope to develop a financial sentiment text classifier that would be able to track Twitter's (and the entire public's) feelings about any publicly traded company (and cryptocurrency)

  4. h

    Swahili-tweet-sentiment

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davis David, Swahili-tweet-sentiment [Dataset]. https://huggingface.co/datasets/Davis/Swahili-tweet-sentiment
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Davis David
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A new Swahili tweet dataset for sentiment analysis.

      Issues ⚠️
    

    Incase you have any difficulties or issues while trying to run the script you can raise it on the issues section.

      Pull Requests 🔧
    

    If you have something to add or new idea to implement, you are welcome to create a pull requests on improvement.

      Give it a Like 👍
    

    If you find this dataset useful, give it a like so as many people can get to know it.

      Credits
    

    All the credits to Davis David… See the full description on the dataset page: https://huggingface.co/datasets/Davis/Swahili-tweet-sentiment.

  5. In-Depth Twitter Retweet Analysis Dataset

    • kaggle.com
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mulenga Kawimbe (2024). In-Depth Twitter Retweet Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/mulengakawimbe89/in-depth-twitter-retweet-analysis-dataset
    Explore at:
    zip(51790 bytes)Available download formats
    Dataset updated
    Jul 30, 2024
    Authors
    Mulenga Kawimbe
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    In-Depth Twitter Retweet Analysis Dataset

    Dataset Overview

    This dataset provides an extensive analysis of Twitter retweet activities, focusing on various attributes that can influence and describe the nature of retweets. It consists of multiple rows of data, each representing a unique Twitter retweet instance with detailed information on its characteristics.

    Dataset Columns

    1. Weekday: The day of the week when the retweet occurred.

      • Example values: "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"
    2. Hour: The hour of the day when the retweet was made, in 24-hour format.

      • Example values: 0, 1, 2, ..., 23
    3. Day: The day of the month when the retweet was posted.

      • Example values: 1, 2, 3, ..., 31
    4. Lang: The language code of the tweet that was retweeted.

      • Example values: "en" (English), "es" (Spanish), "fr" (French)
    5. Reach: The estimated number of users who have seen the retweet.

    6. RetweetCount: The number of times the retweeted tweet has been retweeted further.

    7. Likes: The number of likes received by the retweeted tweet.

    8. Klout: The Klout score of the user who posted the original tweet, which is a measure of their influence on social media.

    9. Sentiment: The sentiment score of the retweeted tweet, indicating the overall emotional tone.

      • Example values: -1.0 (very negative), 0.0 (neutral), 1.0 (very positive)
    10. LocationID: A numerical identifier representing the geographical location of the user who posted the retweet.

    Usage

    This dataset can be utilized for various analyses, including: - Identifying peak times for retweets - Analyzing the influence of tweet attributes on retweet rates - Sentiment analysis of popular retweets - Geographical distribution of retweet activity - Correlating Klout scores with retweet reach and engagement

    Applications

    Researchers, marketers, and social media analysts can use this dataset to gain insights into Twitter retweet behavior, optimize social media strategies, and understand the factors contributing to the virality of tweets.

  6. d

    Data from: Twitter Big Data as A Resource For Exoskeleton Research: A...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thakur, Nirmalya (2023). Twitter Big Data as A Resource For Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets and 100 Research Questions [Dataset]. http://doi.org/10.7910/DVN/VPPTRF
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Thakur, Nirmalya
    Description

    Please cite the following paper when using this dataset: N. Thakur, “Twitter Big Data as a Resource for Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets and 100 Research Questions,” Preprints, 2022, DOI: 10.20944/preprints202206.0383.v1 Abstract The exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and use cases in assisted living, military, healthcare, firefighting, and industries. With the projected increase in the diverse uses of exoskeletons in the next few years in these application domains and beyond, it is crucial to study, interpret, and analyze user perspectives, public opinion, reviews, and feedback related to exoskeletons, for which a dataset is necessary. The Internet of Everything era of today's living, characterized by people spending more time on the Internet than ever before, holds the potential for developing such a dataset by mining relevant web behavior data from social media communications, which have increased exponentially in the last few years. Twitter, one such social media platform, is highly popular amongst all age groups, who communicate on diverse topics including but not limited to news, current events, politics, emerging technologies, family, relationships, and career opportunities, via tweets, while sharing their views, opinions, perspectives, and feedback towards the same. Therefore, this work presents a dataset of about 140,000 Tweets related to exoskeletons. that were mined for a period of 5-years from May 21, 2017, to May 21, 2022. The tweets contain diverse forms of communications and conversations which communicate user interests, user perspectives, public opinion, reviews, feedback, suggestions, etc., related to exoskeletons. Instructions: This dataset contains about 140,000 Tweets related to exoskeletons. that were mined for a period of 5-years from May 21, 2017, to May 21, 2022. The tweets contain diverse forms of communications and conversations which communicate user interests, user perspectives, public opinion, reviews, feedback, suggestions, etc., related to exoskeletons. The dataset contains only tweet identifiers (Tweet IDs) due to the terms and conditions of Twitter to re-distribute Twitter data only for research purposes. They need to be hydrated to be used. The process of retrieving a tweet's complete information (such as the text of the tweet, username, user ID, date and time, etc.) using its ID is known as the hydration of a tweet ID. The Hydrator application (link to download the application: https://github.com/DocNow/hydrator/releases and link to a step-by-step tutorial: https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweets) or any similar application may be used for hydrating this dataset. Data Description This dataset consists of 7 .txt files. The following shows the number of Tweet IDs and the date range (of the associated tweets) in each of these files. Filename: Exoskeleton_TweetIDs_Set1.txt (Number of Tweet IDs – 22945, Date Range of Tweets - July 20, 2021 – May 21, 2022) Filename: Exoskeleton_TweetIDs_Set2.txt (Number of Tweet IDs – 19416, Date Range of Tweets - Dec 1, 2020 – July 19, 2021) Filename: Exoskeleton_TweetIDs_Set3.txt (Number of Tweet IDs – 16673, Date Range of Tweets - April 29, 2020 - Nov 30, 2020) Filename: Exoskeleton_TweetIDs_Set4.txt (Number of Tweet IDs – 16208, Date Range of Tweets - Oct 5, 2019 - Apr 28, 2020) Filename: Exoskeleton_TweetIDs_Set5.txt (Number of Tweet IDs – 17983, Date Range of Tweets - Feb 13, 2019 - Oct 4, 2019) Filename: Exoskeleton_TweetIDs_Set6.txt (Number of Tweet IDs – 34009, Date Range of Tweets - Nov 9, 2017 - Feb 12, 2019) Filename: Exoskeleton_TweetIDs_Set7.txt (Number of Tweet IDs – 11351, Date Range of Tweets - May 21, 2017 - Nov 8, 2017) Here, the last date for May is May 21 as it was the most recent date at the time of data collection. The dataset would be updated soon to incorporate more recent tweets.

  7. A Twitter Dataset on Tweets about People who Got Lost due to Dementia

    • figshare.com
    application/gzip
    Updated Jan 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelvin KF Tsoi; Nicholas B Chan; Felix CH Chan; Lingling Zhang; Annisa CH Lee; Helen ML Meng (2018). A Twitter Dataset on Tweets about People who Got Lost due to Dementia [Dataset]. http://doi.org/10.6084/m9.figshare.5788125.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 16, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Kelvin KF Tsoi; Nicholas B Chan; Felix CH Chan; Lingling Zhang; Annisa CH Lee; Helen ML Meng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset used and analyzed in the paper "How can we Better Use Twitter to find a Person who Got Lost due to Dementia?".A total of five tables are included. 1. raw_tweets.rds: All tweets that mentioned (i) "Dementia" or "Alzheimer"; and (ii) "Lost" or "Missing", which were crawled from Twitter from April to May 2017. 2. raw_userinfo.rds: The corresponding Twitter user info of Tweets.3. filtered_tweets.csv: Tweets that were included in the study. Details (age, gender, place, etc.) of the corresponding lost person mentioned in each tweet were appended in this table. 4. filtered_userinfo.csv: The corresponding Twitter user info of Tweets that were included in the study. Occupation (police / media / others) of each user were appended in this table. 5. cleansed_lostcases.csv: A cleansed table that shows several features of the lost cases.

  8. Z

    Data from: TWIGMA: A dataset of AI-Generated Images with Metadata From...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiqun Chen; James Zou (2024). TWIGMA: A dataset of AI-Generated Images with Metadata From Twitter [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8031784
    Explore at:
    Dataset updated
    May 28, 2024
    Dataset provided by
    Stanford University
    Authors
    Yiqun Chen; James Zou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update May 2024: Fixed a data type issue with "id" column that prevented twitter ids from rendering correctly.

    Recent progress in generative artificial intelligence (gen-AI) has enabled the generation of photo-realistic and artistically-inspiring photos at a single click, catering to millions of users online. To explore how people use gen-AI models such as DALLE and StableDiffusion, it is critical to understand the themes, contents, and variations present in the AI-generated photos. In this work, we introduce TWIGMA (TWItter Generative-ai images with MetadatA), a comprehensive dataset encompassing 800,000 gen-AI images collected from Jan 2021 to March 2023 on Twitter, with associated metadata (e.g., tweet text, creation date, number of likes).

    Through a comparative analysis of TWIGMA with natural images and human artwork, we find that gen-AI images possess distinctive characteristics and exhibit, on average, lower variability when compared to their non-gen-AI counterparts. Additionally, we find that the similarity between a gen-AI image and human images (i) is correlated with the number of likes; and (ii) can be used to identify human images that served as inspiration for the gen-AI creations. Finally, we observe a longitudinal shift in the themes of AI-generated images on Twitter, with users increasingly sharing artistically sophisticated content such as intricate human portraits, whereas their interest in simple subjects such as natural scenes and animals has decreased. Our analyses and findings underscore the significance of TWIGMA as a unique data resource for studying AI-generated images.

    Note that in accordance with the privacy and control policy of Twitter, NO raw content from Twitter is included in this dataset and users could and need to retrieve the original Twitter content used for analysis using the Twitter id. In addition, users who want to access Twitter data should consult and follow rules and regulations closely at the official Twitter developer policy at https://developer.twitter.com/en/developer-terms/policy.

  9. s

    Twitter bot profiling

    • researchdata.smu.edu.sg
    • smu.edu.sg
    • +1more
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Analytics Research Centre (2023). Twitter bot profiling [Dataset]. http://doi.org/10.25440/smu.12062706.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    Living Analytics Research Centre
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    This dataset comprises a set of Twitter accounts in Singapore that are used for social bot profiling research conducted by the Living Analytics Research Centre (LARC) at Singapore Management University (SMU). Here a bot is defined as a Twitter account that generates contents and/or interacts with other users automatically (at least according to human judgment). In this research, Twitter bots have been categorized into three major types:

    Broadcast bot. This bot aims at disseminating information to general audience by providing, e.g., benign links to news, blogs or sites. Such bot is often managed by an organization or a group of people (e.g., bloggers). Consumption bot. The main purpose of this bot is to aggregate contents from various sources and/or provide update services (e.g., horoscope reading, weather update) for personal consumption or use. Spam bot. This type of bots posts malicious contents (e.g., to trick people by hijacking certain account or redirecting them to malicious sites), or promotes harmless but invalid/irrelevant contents aggressively.

    This categorization is general enough to cater for new, emerging types of bot (e.g., chatbots can be viewed as a special type of broadcast bots). The dataset was collected from 1 January to 30 April 2014 via the Twitter REST and streaming APIs. Starting from popular seed users (i.e., users having many followers), their follow, retweet, and user mention links were crawled. The data collection proceeds by adding those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. Using this procedure, a total of 159,724 accounts have been collected. To identify bots, the first step is to check active accounts who tweeted at least 15 times within the month of April 2014. These accounts were then manually checked and labelled, of which 589 bots were found. As many more human users are expected in the Twitter population, the remaining accounts were randomly sampled and manually checked. With this, 1,024 human accounts were identified. In total, this results in 1,613 labelled accounts. Related Publication: R. J. Oentaryo, A. Murdopo, P. K. Prasetyo, and E.-P. Lim. (2016). On profiling bots in social media. Proceedings of the International Conference on Social Informatics (SocInfo’16), 92-109. Bellevue, WA. https://doi.org/10.1007/978-3-319-47880-7_6

  10. Z

    Dataset for the Article "A Predictive Method to Improve the Effectiveness of...

    • data.niaid.nih.gov
    Updated May 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Furini; Federica Mandreoli; Riccardo Martoglia; Manuela Montangero (2021). Dataset for the Article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4782983
    Explore at:
    Dataset updated
    May 24, 2021
    Dataset provided by
    University of Modena and Reggio Emilia, Italy
    Authors
    Marco Furini; Federica Mandreoli; Riccardo Martoglia; Manuela Montangero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset for the article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario".

    Abstract:

    Museums are embracing social technologies in the attempt to broaden their audience and to engage people. Although social communication seems an easy task, media managers know how hard it is to reach millions of people with a simple message. Indeed, millions of posts are competing every day to get visibility in terms of likes and shares and very little research focused on museums communication to identify best practices. In this paper, we focus on Twitter and we propose a novel method that exploits interpretable machine learning techniques to: (a) predict whether a tweet will likely be appreciated by Twitter users or not; (b) present simple suggestions that will help enhancing the message and increasing the probability of its success. Using a real-world dataset of around 40,000 tweets written by 23 world famous museums, we show that our proposed method allows identifying tweet features that are more likely to influence the tweet success.

    Code to run a selection of experiments is available at https://github.com/rmartoglia/predict-twitter-ch

    Dataset structure

    The dataset contains the dataset used in the experiments of the above research paper. Only the extracted features for the museum tweet threads (and not the message full text) are provided and needed for the analyses.

    We selected 23 well known world spread art museums and grouped them into five groups: G1 (museums with at least three million of followers); G2 (museums with more than one million of followers); G3 (museums with more than 400,000 followers); G4 (museums with more that 200,000 followers); G5 (Italian museums). From these museums, we analyzed ca. 40,000 tweets, with a number varying from 5k ca. to 11k ca. for each museum group, depending on the number of museums in each group.

    Content features: these are the features that can be drawn form the content of the tweet itself. We further divide such features in the following two categories:

    – Countable: these features have a value ranging into different intervals. We take into consideration: the number of hashtags (i.e., words preceded by #) in the tweet, the number of URLs (i.e., links to external resources), the number of images (e.g., photos and graphical emoticons), the number of mentions (i.e., twitter accounts preceded by @), the length of the tweet;

    – On-Off : these features have binary values in {0, 1}. We observe whether the tweet has exclamation marks, question marks, person names, place names, organization names, other names. Moreover, we also take into consideration the tweet topic density: assuming that the involved topics correspond to the hashtags mentioned in the text, we define a tweet as dense of topics if the number of hashtags it contains is greater than a given threshold, set to 5. Finally, we observe the tweet sentiment that might be present (positive or negative) or not (neutral).

    Context features: these features are not drawn form the content of the tweet itself and might give a larger picture of the context in which the tweet was sent. Namely, we take into consideration the part of the day in which the tweet was sent (morning, afternoon, evening and night respectively from 5:00am to 11:59am, from 12:00pm to 5:59pm, from 6:00pm to 10:59pm and from 11pm to 4:59am), and a boolean feature indicating whether the tweet is a retweet or not.

    User features: these features are proper of the user that sent the tweet, and are the same for all the tweets of this user. Namely we consider the name of the museum and the number of followers of the user.

  11. Elon Musk Tweets Dataset (17K)

    • kaggle.com
    zip
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasir Raza (2022). Elon Musk Tweets Dataset (17K) [Dataset]. https://www.kaggle.com/datasets/yasirabdaali/elon-musk-tweets-dataset-17k
    Explore at:
    zip(869801 bytes)Available download formats
    Dataset updated
    Sep 12, 2022
    Authors
    Yasir Raza
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction:

    Elon Musk is an American business magnate. He was one of the founders of PayPal in the past, and the founder and/or co-founder Elon Musk is an American business magnate. He was one of the founders of PayPal in the past, and the founder and CEO of SpaceX, Tesla, SolarCity, OpenAI, Neuralink, and The Boring Company in the present. He is known as much for his extreme forward-thinking ideas and huge media presence as he is for his extremely business savvy.

    Musk is famously active on Twitter. This dataset contains all tweets made by @elonmusk, his official Twitter handle.

    Inspiration:

    Can you figure out Elon Musk's opinions on various things by studying his Twitter statements? How did Elon Musk's post rate increase, decrease, or stayed about the same over time?

    Features of the Data:

    This dataset has the following features; - Date Created - Number of Likes - Source of Tweet - Tweets

  12. Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...

    • zenodo.org
    txt
    Updated Nov 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur; Nirmalya Thakur (2022). MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak [Dataset]. http://doi.org/10.5281/zenodo.6760926
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 17, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nirmalya Thakur; Nirmalya Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please cite the following paper when using this dataset:

    N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2

    Abstract

    The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization is considering whether the outbreak should be assessed as a “potential public health emergency of international concern” or PHEIC, as was done for the COVID-19 and Ebola outbreaks in the past. During this time, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

    Data Description

    The dataset consists of a total of 102,452 tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 26th June 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 5 different .txt files based on the timelines of the associated tweets. The following table provides the details of these dataset files.

    Filename

    No. of Tweet IDs

    Date Range of the Tweet IDs

    TweetIDs_Part1.txt

    13926

    May 7, 2022 to May 21, 2022

    TweetIDs_Part2.txt

    17705

    May 21, 2022 to May 27, 2022

    TweetIDs_Part3.txt

    17585

    May 27, 2022 to June 5, 2022

    TweetIDs_Part4.txt

    19718

    June 5, 2022 to June 11, 2022

    TweetIDs_Part5.txt

    33518

    June 12, 2022 to June 26, 2022

    The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.

  13. S

    Social media profile growth, engagement rate, and reach

    • data.sugarlandtx.gov
    xlsx
    Updated Jan 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Communications and Community Engagement (2024). Social media profile growth, engagement rate, and reach [Dataset]. https://data.sugarlandtx.gov/dataset/social-media-profile-growth-engagement-rate-and-reach
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 3, 2024
    Dataset authored and provided by
    Communications and Community Engagement
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Profile growth - the growth on our social platforms to see where and when we're gaining followers. Engagement rate - a ratio of how many people interacted with ours posts based on when users are usually online. Reach - the number of feeds our posts appeared in (doesn't mean people interacted with the post).

  14. d

    Data from: Supersharers of fake news on Twitter

    • dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahar Baribi-Bartov; Briony Swire-Thompson; Nir Grinberg (2025). Supersharers of fake news on Twitter [Dataset]. http://doi.org/10.5061/dryad.44j0zpcmq
    Explore at:
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Sahar Baribi-Bartov; Briony Swire-Thompson; Nir Grinberg
    Time period covered
    Jan 1, 2024
    Description

    Governments may have the capacity to flood social media with fake news, but little is known about the use of flooding by ordinary voters. In this work, we identify 2107 registered US voters that account for 80% of the fake news shared on Twitter during the 2020 US presidential election by an entire panel of 664,391 voters. We find that supersharers are important members of the network, reaching a sizable 5.2% of registered voters on the platform. Supersharers have a significant overrepresentation of women, older adults, and registered Republicans. Supersharers' massive volume does not seem automated but is rather generated through manual and persistent retweeting. These findings highlight a vulnerability of social media for democracy, where a small group of people distort the political reality for many., This dataset contains aggregated information necessary to replicate the results reported in our work on Supersharers of Fake News on Twitter while respecting and preserving the privacy expectations of individuals included in the analysis. No individual-level data is provided as part of this dataset. The data collection process that enabled the creation of this dataset leveraged a large-scale panel of registered U.S. voters matched to Twitter accounts. We examined the activity of 664,391 panel members who were active on Twitter during the months of the 2020 U.S. presidential election (August to November 2020, inclusive), and identified a subset of 2,107 supersharers, which are the most prolific sharers of fake news in the panel that together account for 80% of fake news content shared on the platform. We rely on a source-level definition of fake news, that uses the manually-labeled list of fake news sites by Grinberg et al. 2019 and an updated list based on NewsGuard ratings (commercial..., , # Supersharers of Fake News on Twitter

    This repository contains data and code for replication of the results presented in the paper.

    The folders are mostly organized by research questions as detailed below. Each folder contains the code and publicly available data necessary for the replication of results. Importantly, no individual-level data is provided as part of this repository. De-identified individual-level data can be attained for IRB-approved uses under the terms and conditions specified in the paper. Once access is granted, the restricted-access data is expected to be located under ./restricted_data.

    The folders in this repository are the following:

    Preprocessing

    Code under the preprocessing folder contains the following:

    1. source classifier - the code used to train a classifier based on NewsGuard domain flags to match the fake news labels source definition use in Grinberg et el. 2019 labels.
    2. political classifier - the code used to identify political tweets, i...
  15. Anna Kendrick Tweets

    • kaggle.com
    zip
    Updated Dec 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Anna Kendrick Tweets [Dataset]. https://www.kaggle.com/datasets/thedevastator/anna-kendrick-twitter-engagement-metrics-2015-20
    Explore at:
    zip(214386 bytes)Available download formats
    Dataset updated
    Dec 21, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Anna Kendrick Tweets

    Analyzing Tweet Success

    By Twitter [source]

    About this dataset

    This dataset provides an invaluable set of data covering a period of five years on one of the most popular celebrities of modern times - actress, singer and songwriter Anna Kendrick (annakendrick47). It gives researchers a comprehensive look into her online presence and content, allowing them to analyse engagement metrics like like count, media involvement, outlinks, quote count and more. With this unique dataset containing information such as ID, conversation ID associated with each tweet and other crucial media related metrics for each post by Anna Kendrick from 2015-2019. Researchers can explore valuable insights that will optimize her content in the future in ways never before thought possible. This comprehensive dataset opens up doors to discover just what drives success for Anna Kendrick on social media!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains tweets made by Anna Kendrick (annakendrick47) from 2015-2019, including detailed engagement metrics. It is an incredibly valuable resource for researchers looking to gain insights into how her tweets are engaging users and being perceived online. This guide outlines how to effectively make use of this data set in your research.

    The dataset contains several columns containing data and metrics associated with each tweet:

    • ID: A unique identifier for each tweet. Can be used to retrieve additional information linking it back to other conversations surrounding it.
    • Conversation ID: Another unique identifier that can be used to link multiple tweets together belonging to the same conversation thread.
    • Date published: The date at which the tweet was posted, in iso8601 format (yyyy-MM-ddTHH:mmZ).
    • Like Count: An integer representing how many people liked this particular tweet from 2015-2019 period.
    • Media Content Types : The type of media included within a tweeted content, like photos, GIF etc..
    • Outlinks : Links included within a tweeted content, like websites URLs etc mentioned in it .

      Quote Count : Integer showing how many times’ other users quoted its original message or mentioned alongside with another context as an example .

      Retweet Count : It counts all retweets annakendrik got since posted till now .

      Reply Count : Number of replies any given tweet receives during certain time period chosen by researcher

    Research Ideas

    • To create more effective Tweet content strategies: By analyzing the engagement metrics of Anna Kendrick’s tweets, researchers could determine which topics garnered the most engagement and use this information to optimize future content experiences.
    • To compare different platforms: Researchers could compare the metrics from her tweets on different platforms to see whether certain topics resonated better with one platform over another (i.e., Twitter vs Instagram).
    • To explore patterns within engagement metrics: By exploring patterns within specific Twitter conversations or individual tweet engagements, researchers can gain a deeper understanding of how people respond to her messages and media content, allowing them to optimize their communication style accordingly

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Twitter.

  16. Z

    Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jikeli, Gunther; Karali, Sameer; Soemer, Katharina (2023). Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A Dataset for Machine Learning and Text Analytics [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8147307
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    Indiana University Bloomington
    Authors
    Jikeli, Gunther; Karali, Sameer; Soemer, Katharina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Institute for the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset on bias against Asians, Blacks, Jews, Latines, and Muslims

    The ISCA project compiled this dataset using an annotation portal, which was used to label tweets as either biased or non-biased, among other labels. Note that the annotation was done on live data, including images and context, such as threads. The original data comes from annotationportal.com. They include representative samples of live tweets from the years 2020 and 2021 with the keywords "Asians, Blacks, Jews, Latinos, and Muslims". A random sample of 600 tweets per year was drawn for each of the keywords. This includes retweets. Due to a sampling error, the sample for the year 2021 for the keyword "Jews" has only 453 tweets from 2021 and 147 from the first eight months of 2022 and it includes some tweets from the query with the keyword "Israel." The tweets were divided into six samples of 100 tweets, which were then annotated by three to seven students in the class "Researching White Supremacism and Antisemitism on Social Media" taught by Gunther Jikeli, Elisha S. Breton, and Seth Moller at Indiana University in the fall of 2022, see this report. Annotators used a scale from 1 to 5 (confident not biased, probably not biased, don't know, probably biased, confident biased). The definitions of bias against each minority group used for annotation are also included in the report. If a tweet called out or denounced bias against the minority in question, it was labeled as "calling out bias." The labels of whether a tweet is biased or calls out bias are based on a 75% majority vote. We considered "probably biased" and "confident biased" as biased and "confident not biased," "probably not biased," and "don't know" as not biased.
    The types of stereotypes vary widely across the different categories of prejudice. While about a third of all biased tweets were classified as "hate" against the minority, the stereotypes in the tweets often matched common stereotypes about the minority. Asians were blamed for the Covid pandemic. Blacks were seen as inferior and associated with crime. Jews were seen as powerful and held collectively responsible for the actions of the State of Israel. Some tweets denied the Holocaust. Hispanics/Latines were portrayed as being in the country illegally and as "invaders," in addition to stereotypical accusations of being lazy, stupid, or having too many children. Muslims, on the other hand, were often collectively blamed for terrorism and violence, though often in conversations about Muslims in India.

    Content:

    This dataset contains 5880 tweets that cover a wide range of topics common in conversations about Asians, Blacks, Jews, Latines, and Muslims. 357 tweets (6.1 %) are labeled as biased and 5523 (93.9 %) are labeled as not biased. 1365 tweets (23.2 %) are labeled as calling out or denouncing bias. 1180 out of 5880 tweets (20.1 %) contain the keyword "Asians," 590 were posted in 2020 and 590 in 2021. 39 tweets (3.3 %) are biased against Asian people. 370 tweets (31,4 %) call out bias against Asians. 1160 out of 5880 tweets (19.7%) contain the keyword "Blacks," 578 were posted in 2020 and 582 in 2021. 101 tweets (8.7 %) are biased against Black people. 334 tweets (28.8 %) call out bias against Blacks. 1189 out of 5880 tweets (20.2 %) contain the keyword "Jews," 592 were posted in 2020, 451 in 2021, and ––as mentioned above––146 tweets from 2022. 83 tweets (7 %) are biased against Jewish people. 220 tweets (18.5 %) call out bias against Jews. 1169 out of 5880 tweets (19.9 %) contain the keyword "Latinos," 584 were posted in 2020 and 585 in 2021. 29 tweets (2.5 %) are biased against Latines. 181 tweets (15.5 %) call out bias against Latines. 1182 out of 5880 tweets (20.1 %) contain the keyword "Muslims," 593 were posted in 2020 and 589 in 2021. 105 tweets (8.9 %) are biased against Muslims. 260 tweets (22 %) call out bias against Muslims.

    File Description:

    The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:
    'TweetID': Represents the tweet ID.
    'Username': Represents the username who published the tweet (if it is a retweet, it will be the user who retweetet the original tweet.
    'Text': Represents the full text of the tweet (not pre-processed). 'CreateDate': Represents the date the tweet was created.
    'Biased': Represents the labeled by our annotators if the tweet is biased (1) or not (0). 'Calling_Out': Represents the label by our annotators if the tweet is calling out bias against minority groups (1) or not (0). 'Keyword': Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.

    Licences

    Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)

    Acknowledgements

    We are grateful for the technical collaboration with Indiana University's Observatory on Social Media (OSoMe). We thank all class participants for the annotations and contributions, including Kate Baba, Eleni Ballis, Garrett Banuelos, Savannah Benjamin, Luke Bianco, Zoe Bogan, Elisha S. Breton, Aidan Calderaro, Anaye Caldron, Olivia Cozzi, Daj Crisler, Jenna Eidson, Ella Fanning, Victoria Ford, Jess Gruettner, Ronan Hancock, Isabel Hawes, Brennan Hensler, Kyra Horton, Maxwell Idczak, Sanjana Iyer, Jacob Joffe, Katie Johnson, Allison Jones, Kassidy Keltner, Sophia Knoll, Jillian Kolesky, Emily Lowrey, Rachael Morara, Benjamin Nadolne, Rachel Neglia, Seungmin Oh, Kirsten Pecsenye, Sophia Perkovich, Joey Philpott, Katelin Ray, Kaleb Samuels, Chloe Sherman, Rachel Weber, Molly Winkeljohn, Ally Wolfgang, Rowan Wolke, Michael Wong, Jane Woods, Kaleb Woodworth, and Aurora Young. This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

  17. Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A...

    • zenodo.org
    csv
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gunther Jikeli; Gunther Jikeli; Sameer Karali; Sameer Karali; Katharina Soemer; Katharina Soemer (2024). Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A Dataset for Machine Learning and Text Analytics [Dataset]. http://doi.org/10.5281/zenodo.10812805
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gunther Jikeli; Gunther Jikeli; Sameer Karali; Sameer Karali; Katharina Soemer; Katharina Soemer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Institute for the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset on bias against Asians, Blacks, Jews, Latines, and Muslims

    Description

    The dataset is a product of a research project at Indiana University on biased messages on Twitter against ethnic and religious minorities. We scraped all live messages with the keywords "Asians, Blacks, Jews, Latinos, and Muslims" from the Twitter archive in 2020, 2021, and 2022.

    Random samples of 600 tweets were created for each keyword and year, including retweets. The samples were annotated in subsamples of 100 tweets by undergraduate students in Professor Gunther Jikeli's class 'Researching White Supremacism and Antisemitism on Social Media' in the fall of 2022 and 2023. A total of 120 students participated in 2022. They annotated datasets from 2020 and 2021. 134 students participated in 2023. They annotated datasets from the years 2021 and 2022. The annotation was done using the Annotation Portal (Jikeli, Soemer and Karali, 2024). The updated version of our portal, AnnotHate, is now publicly available. Each subsample was annotated by an average of 5.65 students per sample in 2022 and 8.32 students per sample in 2023, with a range of three to ten and three to thirteen students, respectively. Annotation included questions about bias and calling out bias.

    Annotators used a scale from 1 to 5 on the bias scale (confident not biased, probably not biased, don't know, probably biased, confident biased), using definitions of bias against each ethnic or religious group that can be found in the research reports from 2022 and 2023. If the annotators interpreted a message as biased according to the definition, they were instructed to choose the specific stereotype from the definition that was most applicable. Tweets that denounced bias against a minority were labeled as "calling out bias".

    The label was determined by a 75% majority vote. We classified “probably biased” and “confident biased” as biased, and “confident not biased,” “probably not biased,” and “don't know” as not biased.

    The stereotypes about the different minorities varied. About a third of all biased tweets were classified as general 'hate' towards the minority. The nature of specific stereotypes varied by group. Asians were blamed for the Covid-19 pandemic, alongside positive but harmful stereotypes about their perceived excessive privilege. Black people were associated with criminal activity and were subjected to views that portrayed them as inferior. Jews were depicted as wielding undue power and were collectively held accountable for the actions of the Israeli government. In addition, some tweets denied the Holocaust. Hispanic people/Latines faced accusations of being undocumented immigrants and "invaders," along with persistent stereotypes of them as lazy, unintelligent, or having too many children. Muslims were often collectively blamed for acts of terrorism and violence, particularly in discussions about Muslims in India.

    The annotation results from both cohorts (Class of 2022 and Class of 2023) will not be merged. They can be identified by the "cohort" column. While both cohorts (Class of 2022 and Class of 2023) annotated the same data from 2021,* their annotation results differ. The class of 2022 identified more tweets as biased for the keywords "Asians, Latinos, and Muslims" than the class of 2023, but nearly all of the tweets identified by the class of 2023 were also identified as biased by the class of 2022. The percentage of biased tweets with the keyword 'Blacks' remained nearly the same.

    *Due to a sampling error for the keyword "Jews" in 2021, the data are not identical between the two cohorts. The 2022 cohort annotated two samples for the keyword Jews, one from 2020 and the other from 2021, while the 2023 cohort annotated samples from 2021 and 2022.The 2021 sample for the keyword "Jews" that the 2022 cohort annotated was not representative. It has only 453 tweets from 2021 and 147 from the first eight months of 2022, and it includes some tweets from the query with the keyword "Israel". The 2021 sample for the keyword "Jews" that the 2023 cohort annotated was drawn proportionally for each trimester of 2021 for the keyword "Jews".

    Content

    Cohort 2022

    This dataset contains 5880 tweets that cover a wide range of topics common in conversations about Asians, Blacks, Jews, Latines, and Muslims. 357 tweets (6.1 %) are labeled as biased and 5523 (93.9 %) are labeled as not biased. 1365 tweets (23.2 %) are labeled as calling out or denouncing bias.

    1180 out of 5880 tweets (20.1 %) contain the keyword "Asians," 590 were posted in 2020 and 590 in 2021. 39 tweets (3.3 %) are biased against Asian people. 370 tweets (31,4 %) call out bias against Asians.

    1160 out of 5880 tweets (19.7%) contain the keyword "Blacks," 578 were posted in 2020 and 582 in 2021. 101 tweets (8.7 %) are biased against Black people. 334 tweets (28.8 %) call out bias against Blacks.

    1189 out of 5880 tweets (20.2 %) contain the keyword "Jews," 592 were posted in 2020, 451 in 2021, and ––as mentioned above––146 tweets from 2022. 83 tweets (7 %) are biased against Jewish people. 220 tweets (18.5 %) call out bias against Jews.

    1169 out of 5880 tweets (19.9 %) contain the keyword "Latinos," 584 were posted in 2020 and 585 in 2021. 29 tweets (2.5 %) are biased against Latines. 181 tweets (15.5 %) call out bias against Latines.

    1182 out of 5880 tweets (20.1 %) contain the keyword "Muslims," 593 were posted in 2020 and 589 in 2021. 105 tweets (8.9 %) are biased against Muslims. 260 tweets (22 %) call out bias against Muslims.

    Cohort 2023

    The dataset contains 5363 tweets with the keywords “Asians, Blacks, Jews, Latinos and Muslims” from 2021 and 2022. 261 tweets (4.9 %) are labeled as biased, and 5102 tweets (95.1 %) were labeled as not biased. 975 tweets (18.1 %) were labeled as calling out or denouncing bias.

    1068 out of 5363 tweets (19.9 %) contain the keyword "Asians," 559 were posted in 2021 and 509 in 2022. 42 tweets (3.9 %) are biased against Asian people. 280 tweets (26.2 %) call out bias against Asians.

    1130 out of 5363 tweets (21.1 %) contain the keyword "Blacks," 586 were posted in 2021 and 544 in 2022. 76 tweets (6.7 %) are biased against Black people. 146 tweets (12.9 %) call out bias against Blacks.

    971 out of 5363 tweets (18.1 %) contain the keyword "Jews," 460 were posted in 2021 and 511 in 2022. 49 tweets (5 %) are biased against Jewish people. 201 tweets (20.7 %) call out bias against Jews.

    1072 out of 5363 tweets (19.9 %) contain the keyword "Latinos," 583 were posted in 2021 and 489 in 2022. 32 tweets (2.9 %) are biased against Latines. 108 tweets (10.1 %) call out bias against Latines.

    1122 out of 5363 tweets (20.9 %) contain the keyword "Muslims," 576 were posted in 2021 and 546 in 2022. 62 tweets (5.5 %) are biased against Muslims. 240 tweets (21.3 %) call out bias against Muslims.

    File Description

    The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:

    'TweetID': Represents the tweet ID.

    'Username': Represents the username who published the tweet (if it is a retweet, it will be the user who retweetet the original tweet.

    'Text': Represents the full text of the tweet (not pre-processed).

    'CreateDate': Represents the date the tweet was created.

    'Biased': Represents the labeled by our annotators if the tweet is biased (1) or not (0).

    'Calling_Out': Represents the label by our annotators if the tweet is calling out bias against minority groups (1) or not (0).

    'Keyword': Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.

    ‘Cohort’: Represents the year the data was annotated (class of 2022 or class of 2023)

    Acknowledgements

    We are grateful for the technical collaboration with Indiana University's Observatory on Social Media (OSoMe). We thank all class participants for the annotations and contributions, including Kate Baba, Eleni Ballis, Garrett Banuelos, Savannah Benjamin, Luke Bianco, Zoe Bogan, Elisha S. Breton, Aidan Calderaro, Anaye Caldron, Olivia Cozzi, Daj Crisler, Jenna Eidson, Ella Fanning, Victoria Ford, Jess Gruettner, Ronan Hancock, Isabel Hawes, Brennan Hensler, Kyra Horton, Maxwell Idczak, Sanjana Iyer, Jacob Joffe, Katie Johnson, Allison Jones, Kassidy Keltner, Sophia Knoll, Jillian Kolesky, Emily Lowrey, Rachael Morara, Benjamin Nadolne, Rachel Neglia, Seungmin Oh, Kirsten Pecsenye, Sophia Perkovich, Joey Philpott, Katelin

  18. f

    Academic information on Twitter: A user survey

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ehsan Mohammadi; Mike Thelwall; Mary Kwasny; Kristi L. Holmes (2023). Academic information on Twitter: A user survey [Dataset]. http://doi.org/10.1371/journal.pone.0197265
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ehsan Mohammadi; Mike Thelwall; Mary Kwasny; Kristi L. Holmes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Although counts of tweets citing academic papers are used as an informal indicator of interest, little is known about who tweets academic papers and who uses Twitter to find scholarly information. Without knowing this, it is difficult to draw useful conclusions from a publication being frequently tweeted. This study surveyed 1,912 users that have tweeted journal articles to ask about their scholarly-related Twitter uses. Almost half of the respondents (45%) did not work in academia, despite the sample probably being biased towards academics. Twitter was used most by people with a social science or humanities background. People tend to leverage social ties on Twitter to find information rather than searching for relevant tweets. Twitter is used in academia to acquire and share real-time information and to develop connections with others. Motivations for using Twitter vary by discipline, occupation, and employment sector, but not much by gender. These factors also influence the sharing of different types of academic information. This study provides evidence that Twitter plays a significant role in the discovery of scholarly information and cross-disciplinary knowledge spreading. Most importantly, the large numbers of non-academic users support the claims of those using tweet counts as evidence for the non-academic impacts of scholarly research.

  19. iPhone 14 Tweets [July / Sept 2022 +144k English]

    • kaggle.com
    zip
    Updated Sep 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tleonel (2022). iPhone 14 Tweets [July / Sept 2022 +144k English] [Dataset]. https://www.kaggle.com/datasets/tleonel/iphone14-tweets
    Explore at:
    zip(16821184 bytes)Available download formats
    Dataset updated
    Sep 8, 2022
    Authors
    Tleonel
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    iPhone 14 📱 🐦 Tweets [11 July - Sept 9 2022 - 144k English] 📱 🐦

    Updated on Sept 9th Includes sent tweets after launch

    https://store.storeimages.cdn-apple.com/4668/as-images.apple.com/is/iphone-14-pro-finish-unselect-gallery-1-202209_GEO_EMEA?wid=5120&hei=2880&fmt=p-jpg&qlt=80&.v=1660754213188" alt="Photo by Apple">

    Trying to do something useful and add a dataset here in Kaggle, and while there are over 90+ datasets for Elon, there's none yet for tweets about the upcoming iPhone 14. I'm interested in seeing what apple is up to this year, so I thought it could be interesting to deep dive into what people have been saying this month before the release, which was announced today by Apple. It will happen on September 7th.

    The dataset has 144k tweets created between July 11th and Sept 9th. Tweets are in English. As the new iPhone was just announced, I plan on updating the dataset to include newer examples and maybe a few older ones to increase the number of samples in the dataset, at least until the first week of launch.

    Columns Description

    • [x] date_time - Date and Time tweet was sent
    • [x] username - Username that sent the tweet
    • [x] user_location - Location entered in the account location info on Twitter
    • [x] user_description - Text added to "about" in account
    • [x] verified - If the user has the "verified by Twitter" blue tick
    • [x] followers_count - Number of Followers
    • [x] following_count - Number of accounts followed by the person who sent the tweet
    • [x] tweet_like_count - How many people liked the tweet
    • [x] tweet_retweet_count - How many people retweeted the tweet
    • [x] tweet_reply_count - How many people replied to that tweet
    • [x] source - Where was the tweet sent from. The link has info if using iPhone, Android and others
    • [x] tweet_text - Text sent in the tweet

    Data and Utilization

    Data was scrapped from Twitter and uploaded as is, no further process to data cleaning was performed, but the data from the tweets are in very good shape. I'd maybe recommend separating data and time and finding a way to change the source from links to the device name or website, depending on what you are interested in using the data for.

    Usage suggestions - Data can be used to perform sentiment analysis, look at the geographical distribution, trends, spam x ham identification, and others.

  20. Unleashing Social Sentiments: A Twitter Analysis

    • kaggle.com
    zip
    Updated Feb 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joy Shil (2023). Unleashing Social Sentiments: A Twitter Analysis [Dataset]. https://www.kaggle.com/datasets/joyshil0599/unleashing-social-sentiments-a-twitter-analysis
    Explore at:
    zip(404155 bytes)Available download formats
    Dataset updated
    Feb 27, 2023
    Authors
    Joy Shil
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    "Unleashing Social Sentiments: A Twitter Analysis" appears to be a study or analysis that uses a Twitter dataset to explore the sentiment and opinions of Twitter users towards a particular topic or set of topics. Without more information about the study, it is difficult to provide a detailed analysis. However, based on the title and the use of a Twitter dataset, it is likely that the study involves the use of sentiment analysis techniques to analyze the opinions and sentiment expressed in the dataset. https://camo.githubusercontent.com/7bf6f8c804cf1ec62e2cbbc7c85ea7dfd65b4848df48be4218e24012c6eb3430/68747470733a2f2f692e6d6f72696f682e636f6d2f323032302f30322f30342f6265656633366664373037642e6a7067">

    The use of Twitter data for sentiment analysis has become increasingly popular in recent years due to the massive volume of data available and the ease with which opinions and sentiment can be expressed on the platform. By analyzing Twitter data, researchers can gain insights into public opinion and sentiment on a wide range of topics, from politics to consumer products to social issues.

    To conduct a Twitter analysis, researchers typically collect a dataset of tweets related to a particular topic or set of topics. This dataset may include features such as the Twitter username, the tweet content, the time and date of the tweet, and any associated metadata such as hashtags or mentions. The dataset can then be processed using NLP or sentiment analysis techniques to classify the sentiment expressed in each tweet as positive, negative, or neutral.

    The dataset contains tweets from the Twitter API that were scraped for seven hashtags:

    #Messi: This hashtag refers to the Argentine soccer superstar Lionel Messi, and is commonly used by fans and followers to discuss his performances, accomplishments, and news related to his career.

    #FIFAWorldCup: This hashtag is used during the FIFA World Cup, a quadrennial international soccer tournament. Tweets with this hashtag may discuss news, scores, or analysis related to the tournament.

    #DeleteFacebook: This hashtag is used by people who advocate for deleting or boycotting Facebook, often in response to controversies related to data privacy, political advertising, or other issues related to the social media giant.

    #MeToo: This hashtag is used in the context of the Me Too movement, a social movement against sexual harassment and assault, particularly in the workplace. Tweets with this hashtag may share personal stories, express support for the movement, or discuss related news and events.

    #BlackLivesMatter: This hashtag is used in the context of the Black Lives Matter movement, a movement against police brutality and systemic racism towards Black people. Tweets with this hashtag may express support for the movement, share news and updates, or discuss related issues.

    #NeverAgain: This hashtag is used in the context of the Never Again movement, which advocates for gun control and other measures to prevent school shootings and other acts of gun violence.

    #BarCamp: This hashtag refers to BarCamp, an international network of unconferences - participant-driven conferences that are open and free to attend. Tweets with this hashtag may discuss upcoming BarCamp events, share insights or learnings from past events, or express support for the BarCamp community.

    The sentiment score was generated using a pre-trained sentiment analysis model, and represents the overall sentiment of the tweet (positive, negative, or neutral).

    The data can be used to gain insights into how people are discussing and reacting to these topics on Twitter, and how the sentiment towards these hashtags may have evolved over time. Researchers and analysts can use this dataset for sentiment analysis, natural language processing, and machine learning applications.

    Some potential analyses that can be performed on the data include sentiment trend analysis over time, geographical distribution of sentiments, and topic modeling to identify themes and topics that emerge from the tweets.

    Overall, the dataset provides a rich resource for researchers and analysts interested in studying social and political issues on social media.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
Organization logo

Twitter Tweets Sentiment Dataset

Twitter Tweets Sentiment Analysis for Natural Language Processing

Explore at:
42 scholarly articles cite this dataset (View in Google Scholar)
zip(1289519 bytes)Available download formats
Dataset updated
Apr 8, 2022
Authors
M Yasser H
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

  1. textID - unique ID for each piece of text
  2. text - the text of the tweet
  3. sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

  • Understand the Dataset & cleanup (if required).
  • Build classification models to predict the twitter sentiments.
  • Compare the evaluation metrics of vaious classification algorithms.
Search
Clear search
Close search
Google apps
Main menu