12 datasets found
  1. Twitter Threads

    • kaggle.com
    zip
    Updated May 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Grijalva (2018). Twitter Threads [Dataset]. https://www.kaggle.com/danielgrijalvas/twitter-threads
    Explore at:
    zip(709787 bytes)Available download formats
    Dataset updated
    May 27, 2018
    Authors
    Daniel Grijalva
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    When Twitter introduced its thread functionality, a debate emerged: "If you're gonna write a f*ck ton of tweets at once, why not write a blog post instead of cluttering my feed?"... "It's easier and user-friendlier to share ideas in a single app"...

    I'm not getting into that debate. Both blog posts and Twitter threads have their own advantages.

    But I noticed a phenomenon while reading threads on Twitter: the engagement—*retweets, likes and replies*—drops with each subsequent tweet!

    Now, this has some logical explanations. Like, people don't want to retweet or like every tweet in a thread, because that'd be annoying. But this trend kept appearing in every single thread I read.

    It was bugging me, so I had to gather some data.

    Content

    The dataset is divided into five parts:
    - five_ten.csv: data of threads 5-10 tweets long
    - ten_fifteen.csv: data of threads 10-15 tweets long
    - fifteen_twenty.csv: data of threads 15-20 tweets long
    - twenty_twentyfive.csv: data of threads 20-25 tweets long
    - twentyfive_thirty.csv: data of threads 25-30 tweets long

    They all contain the same data:
    - id: Tweet ID (maybe I should remove it to anonymize the data?)
    - thread_number: Thread identifier, used for grouping each thread and its tweets
    - timestamp: Creation date of each tweet - text: The content of each tweet
    - retweets: Retweet count for each tweet
    - likes: Like count for each tweet
    - replies: Reply count for each tweet

    Each "bin" contains around 100 threads... so in total there are ~500 threads.

    Acknowledgements

    The threads were manually gathered using Thread Reader (both the web page and the bot).

    Disclaimer

    The content of the threads/tweets did not had any influence in choosing a thread or not. The only parameter was the length of the thread (5-30 tweets tops). The tweets collected date from October 2017 to May 2018.

    Inspiration

    Some things I noticed while gathering the data was that political threads have a steadier engagement than, say, art threads. So context might influence thread engagement, and it'd be interesting to do some NLP to figure that out.

    Also it'd be cool to find a "formula" for better engagement in Twitter threads, like how long should a thread be? or maybe a probability of engagement based on the success of the initial tweet?

    Finally, this whole issue reminds me of the headline problem: most people don't go beyond the headline. Maybe Twitter threads suffer from that too.

  2. Daily Social Media Active Users

    • kaggle.com
    zip
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaik Barood Mohammed Umar Adnaan Faiz (2025). Daily Social Media Active Users [Dataset]. https://www.kaggle.com/datasets/umeradnaan/daily-social-media-active-users
    Explore at:
    zip(126814 bytes)Available download formats
    Dataset updated
    May 5, 2025
    Authors
    Shaik Barood Mohammed Umar Adnaan Faiz
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description:

    The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.

    Dataset Breakdown:

    • Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.

    • Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.

    • Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.

    • Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.

    • Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.

    • Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.

    • Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.

    Context and Use Cases:

    • This synthetic dataset is designed to offer a privacy-friendly alternative for analytics, research, and machine learning purposes. Given the complexities and privacy concerns around using real user data, especially in the context of social media, this dataset offers a clean and secure way to develop, test, and fine-tune applications, models, and algorithms without the risks of handling sensitive or personal information.

    Researchers, data scientists, and developers can use this dataset to:

    • Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.

    • Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.

    • Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.

    • Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.

    • Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.

    • Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.

    The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.

    Future Considerations:

    As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.

    By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...

  3. h

    Twitter-Conversations-Sentiment-Dataset

    • huggingface.co
    Updated Sep 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataHive AI (2025). Twitter-Conversations-Sentiment-Dataset [Dataset]. https://huggingface.co/datasets/datahiveai/Twitter-Conversations-Sentiment-Dataset
    Explore at:
    Dataset updated
    Sep 22, 2025
    Dataset authored and provided by
    DataHive AI
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Twitter Sentiment Dataset

    Sample English-only tweet sentiment dataset. Each row represents a single tweet with anonymized text and conversation structure. This is a sample dataset. To access the full version or request any custom dataset tailored to your needs, contact DataHive at contact@datahive.ai.

      Files Included
    

    dataset.csv – tweets data

      What’s included
    

    Anonymized tweet text Conversation linkage via root_id and parent_id 3-class sentiment label (positive… See the full description on the dataset page: https://huggingface.co/datasets/datahiveai/Twitter-Conversations-Sentiment-Dataset.

  4. Z

    Dataset for the Article "A Predictive Method to Improve the Effectiveness of...

    • data.niaid.nih.gov
    Updated May 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Furini; Federica Mandreoli; Riccardo Martoglia; Manuela Montangero (2021). Dataset for the Article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4782983
    Explore at:
    Dataset updated
    May 24, 2021
    Dataset provided by
    University of Modena and Reggio Emilia, Italy
    Authors
    Marco Furini; Federica Mandreoli; Riccardo Martoglia; Manuela Montangero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset for the article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario".

    Abstract:

    Museums are embracing social technologies in the attempt to broaden their audience and to engage people. Although social communication seems an easy task, media managers know how hard it is to reach millions of people with a simple message. Indeed, millions of posts are competing every day to get visibility in terms of likes and shares and very little research focused on museums communication to identify best practices. In this paper, we focus on Twitter and we propose a novel method that exploits interpretable machine learning techniques to: (a) predict whether a tweet will likely be appreciated by Twitter users or not; (b) present simple suggestions that will help enhancing the message and increasing the probability of its success. Using a real-world dataset of around 40,000 tweets written by 23 world famous museums, we show that our proposed method allows identifying tweet features that are more likely to influence the tweet success.

    Code to run a selection of experiments is available at https://github.com/rmartoglia/predict-twitter-ch

    Dataset structure

    The dataset contains the dataset used in the experiments of the above research paper. Only the extracted features for the museum tweet threads (and not the message full text) are provided and needed for the analyses.

    We selected 23 well known world spread art museums and grouped them into five groups: G1 (museums with at least three million of followers); G2 (museums with more than one million of followers); G3 (museums with more than 400,000 followers); G4 (museums with more that 200,000 followers); G5 (Italian museums). From these museums, we analyzed ca. 40,000 tweets, with a number varying from 5k ca. to 11k ca. for each museum group, depending on the number of museums in each group.

    Content features: these are the features that can be drawn form the content of the tweet itself. We further divide such features in the following two categories:

    – Countable: these features have a value ranging into different intervals. We take into consideration: the number of hashtags (i.e., words preceded by #) in the tweet, the number of URLs (i.e., links to external resources), the number of images (e.g., photos and graphical emoticons), the number of mentions (i.e., twitter accounts preceded by @), the length of the tweet;

    – On-Off : these features have binary values in {0, 1}. We observe whether the tweet has exclamation marks, question marks, person names, place names, organization names, other names. Moreover, we also take into consideration the tweet topic density: assuming that the involved topics correspond to the hashtags mentioned in the text, we define a tweet as dense of topics if the number of hashtags it contains is greater than a given threshold, set to 5. Finally, we observe the tweet sentiment that might be present (positive or negative) or not (neutral).

    Context features: these features are not drawn form the content of the tweet itself and might give a larger picture of the context in which the tweet was sent. Namely, we take into consideration the part of the day in which the tweet was sent (morning, afternoon, evening and night respectively from 5:00am to 11:59am, from 12:00pm to 5:59pm, from 6:00pm to 10:59pm and from 11pm to 4:59am), and a boolean feature indicating whether the tweet is a retweet or not.

    User features: these features are proper of the user that sent the tweet, and are the same for all the tweets of this user. Namely we consider the name of the museum and the number of followers of the user.

  5. Tanamongeau Tweets

    • kaggle.com
    zip
    Updated Dec 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Tanamongeau Tweets [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-tanamongeau-twitter-engagement-pattern/code
    Explore at:
    zip(3945756 bytes)Available download formats
    Dataset updated
    Dec 23, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Tanamongeau Tweets

    Detailed Insight into Tweet Composition and Engagement

    By Twitter [source]

    About this dataset

    This dataset provides an in-depth analysis of the Twitter activity and engagement patterns of Tanamongeau, one of the most popular influencers and content creators out there. The data provided for each tweet contains a wealth of information about the type of content being posted, as well as its associated engagement levels (like count, retweet count, quote count, reply count etc.). Researchers can thus explore how different types of posts fare with regards to user engagement and how subject matter affects conversation trends. This dataset also allows a detailed analysis on the effects that various media elements can have on a user's followers. Included within this data are columns such as created_at (date created) media (images/videos), outlinks (URLs to external pages), quotedTweet (quoted text), retweetedTweet (texts that were already present in tweets) ids and conversationIds which provide researchers with invaluable insights into understanding how Tanamongeau's followers interact with them through social media platforms

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    To use this dataset effectively, it’s important to familiarize yourself with each column first. The columns contained in the index are: Content (the text contents of the tweet), created_at (date/time when tweet was sent), date (date posted), likeCount (number of likes for a particular post), Media (all attached media to a post such as images or GIFs), Outlinks (URL links associated with post), QuoteCount (Number of quotes received on a single post/tweet by other users or accounts) Quoted Tweet ID(the ID from original quoted tweet given by author) ReplyCount(Number of replies given in response to original tweet) Retweet Count(Number Of retweet responses to original post) RetweetedTweet(details related to retweeted posts by other users on same user's profile).

    Once you’ve familiarized yourself with each column, you can begin exploring different angles for analysis. For example: what kind of content is garnering the most engagement? What type of media is performing best? How many average retweets does Tanamongeau receive? Or what type conversations are driving increased engagement? By using this information from the dataset along with your own observations gathered during research will ultimately help form trends and patterns that provide valuable insights in understanding Tanamongeau's engagement habits better than ever before!

    Research Ideas

    • Creating an Insights Dashboard: Analyzing the dataset can be used to create an insights dashboard that will help keep track of Tanamongeau’s Twitter performance over time and identify key trends in their engagement metrics such as likes, retweets, replies etc.
    • Developing Social Media Strategies: The data collected in this dataset can also be used to help inform and develop effective social media strategies based on their past tweeting behavior, content that did well, and levels of engagement with different types of posts.
    • Identifying Influencers/Partnerships: By examining Tanamongeau’s tweeted conversations or replying tweets, researchers can identify potential influencers or partnerships by identifying any shared connections with other Twitter users mentioned in the conversation threads

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Twitter.

  6. Z

    Antisemitism on Twitter: A Dataset for Machine Learning and Text Analytics

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jikeli, Gunther; Karali, Sameer; Miehling, Daniel; Soemer, Katharina (2024). Antisemitism on Twitter: A Dataset for Machine Learning and Text Analytics [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7872834
    Explore at:
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Technical University Berlin
    Indiana University Bloomington
    Authors
    Jikeli, Gunther; Karali, Sameer; Miehling, Daniel; Soemer, Katharina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset from the Institute for the Study of Contemporary Antisemitism (ISCA) at Indiana University:

    The Social Media & Hate research lab at the Institute for the Study of Contemporary Antisemitism compiled this dataset using an annotation portal (Jikeli, Soemer, and Karali 2024), which was used to label tweets as either antisemitic or non-antisemitic, among other labels. Note that annotation was done on live data, including images and context, such as threads. All data was annotated by two experts, and all discrepancies were discussed (Jikeli et al. 2023).

    Content:

    This dataset contains 11311 tweets covering a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and April 2023. The dataset consists of random samples of relevant keywords during this time period. 1,953 tweets (17%) are antisemitic according to the IHRA definition of antisemitism.

    The distribution of tweets by year is as follows: 1499 (13%) from 2019, 3712 (33%) from 2020, 2591 (23%) from 2021, 2644 from 2022 (23%) and 865 (8%) from 2023. 6365 (56%) contain the keyword "Jews," 4134 (37%) include "Israel," 529 (5%) feature the derogatory term "ZioNazi*," and 283 (3%) use the slur "K---s." Some tweets may contain multiple keywords.

    725 out of the 6365 tweets with the keyword "Jews" (11%) and 664 out of the 4134 tweets with the keyword "Israel" (16%) were classified as antisemitic. 97 out of the 283 tweets using the antisemitic slur "K---s" (34%) are antisemitic. Interestingly, many tweets featuring the slur "K---s" actually call out its use. In contrast, the majority of tweets using the derogatory term "ZioNazi*" are antisemitic, with 467 out of 529 (88%) being classified as such.

    File Description:

    The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:

    ‘ID’: Represents the tweet ID.

    ‘Username’: Represents the username that posted the tweet.

    ‘Text’: Represents the full text of the tweet (not pre-processed).

    ‘CreateDate’: Represents the date on which the tweet was created.

    ‘Biased’: Represents the label given by our annotations as to whether the tweet is antisemitic or not.

    ‘Keyword’: Represents the keyword that was used in the query. The keyword can be in the text, including hashtags, mentioned users, or the username itself.

    Licences

    Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)

    Acknowledgements

    We are grateful for the support of Indiana University’s Observatory on Social Media (OSoMe) (Davis et al. 2016) and the contributions and annotations of all team members in our Social Media & Hate Research Lab at Indiana University’s Institute for the Study of Contemporary Antisemitism, especially Grace Bland, Elisha S. Breton, Kathryn Cooper, Robin Forstenhäusler, Sophie von Máriássy, Mabel Poindexter, Jenna Solomon, Clara Schilling, and Victor Tschiskale.

    This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

  7. 🧵@ Threads App Google Reviews

    • kaggle.com
    zip
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2023). 🧵@ Threads App Google Reviews [Dataset]. https://www.kaggle.com/datasets/bwandowando/threads-app-google-reviews
    Explore at:
    zip(5711575 bytes)Available download formats
    Dataset updated
    Nov 20, 2023
    Authors
    BwandoWando
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F08688f046820175d6cb116271bd6f2a2%2Fthreads2.png?generation=1700495055256354&alt=media" alt="">

    As per wikipedia page

    Threads is an online social media and social networking service operated by Meta Platforms. The app offers users the ability to post and share text, images, and videos, as well as interact with other users' posts through replies, reposts, and likes. Closely linked to Meta platform Instagram and additionally requiring users to both have an Instagram account and use Threads under the same Instagram handle, the functionality of Threads is similar to X (formerly known as Twitter). The application is available on iOS and Android devices; the web version offers limited functionality and requires a mobile app install first. It is the fastest-growing consumer software application in history, gaining over 100 million users in its first five days, surpassing the record previously set by ChatGPT. Its early success was not sustained and the user base of the app plummeted more than 80% to 8 million daily active users by the end of July.

    These reviews were extracted from its Google Store page.

    Usage

    This dataset should paint a good picture on what is the public's perception of the app over the years. Using this dataset, we can do the following

    1. Extract sentiments and trends
    2. Identify which version of the app had the most positive feedback, the worst.
    3. Use topic modeling to identify the pain points of the application.

    (AND MANY MORE!)

    Images

    Images generated using Bing Image Generator

  8. h

    mo-customer-support-tweets-945k

    • huggingface.co
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Othman (2024). mo-customer-support-tweets-945k [Dataset]. https://huggingface.co/datasets/MohammadOthman/mo-customer-support-tweets-945k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2024
    Authors
    Mohammad Othman
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Customer Support on Twitter Dataset 945k

      Dataset Description
    
    
    
    
    
      Context
    

    This dataset provides a large corpus of real-world English conversations between consumers and customer support agents on Twitter, designed to drive innovation in Natural Language Processing (NLP) by providing data that better matches the actual language used in contemporary customer support interactions.

      Content
    

    Initially, the data included complex threads of conversations… See the full description on the dataset page: https://huggingface.co/datasets/MohammadOthman/mo-customer-support-tweets-945k.

  9. Z

    Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jikeli, Gunther; Karali, Sameer; Soemer, Katharina (2023). Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A Dataset for Machine Learning and Text Analytics [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8147307
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    Indiana University Bloomington
    Authors
    Jikeli, Gunther; Karali, Sameer; Soemer, Katharina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Institute for the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset on bias against Asians, Blacks, Jews, Latines, and Muslims

    The ISCA project compiled this dataset using an annotation portal, which was used to label tweets as either biased or non-biased, among other labels. Note that the annotation was done on live data, including images and context, such as threads. The original data comes from annotationportal.com. They include representative samples of live tweets from the years 2020 and 2021 with the keywords "Asians, Blacks, Jews, Latinos, and Muslims". A random sample of 600 tweets per year was drawn for each of the keywords. This includes retweets. Due to a sampling error, the sample for the year 2021 for the keyword "Jews" has only 453 tweets from 2021 and 147 from the first eight months of 2022 and it includes some tweets from the query with the keyword "Israel." The tweets were divided into six samples of 100 tweets, which were then annotated by three to seven students in the class "Researching White Supremacism and Antisemitism on Social Media" taught by Gunther Jikeli, Elisha S. Breton, and Seth Moller at Indiana University in the fall of 2022, see this report. Annotators used a scale from 1 to 5 (confident not biased, probably not biased, don't know, probably biased, confident biased). The definitions of bias against each minority group used for annotation are also included in the report. If a tweet called out or denounced bias against the minority in question, it was labeled as "calling out bias." The labels of whether a tweet is biased or calls out bias are based on a 75% majority vote. We considered "probably biased" and "confident biased" as biased and "confident not biased," "probably not biased," and "don't know" as not biased.
    The types of stereotypes vary widely across the different categories of prejudice. While about a third of all biased tweets were classified as "hate" against the minority, the stereotypes in the tweets often matched common stereotypes about the minority. Asians were blamed for the Covid pandemic. Blacks were seen as inferior and associated with crime. Jews were seen as powerful and held collectively responsible for the actions of the State of Israel. Some tweets denied the Holocaust. Hispanics/Latines were portrayed as being in the country illegally and as "invaders," in addition to stereotypical accusations of being lazy, stupid, or having too many children. Muslims, on the other hand, were often collectively blamed for terrorism and violence, though often in conversations about Muslims in India.

    Content:

    This dataset contains 5880 tweets that cover a wide range of topics common in conversations about Asians, Blacks, Jews, Latines, and Muslims. 357 tweets (6.1 %) are labeled as biased and 5523 (93.9 %) are labeled as not biased. 1365 tweets (23.2 %) are labeled as calling out or denouncing bias. 1180 out of 5880 tweets (20.1 %) contain the keyword "Asians," 590 were posted in 2020 and 590 in 2021. 39 tweets (3.3 %) are biased against Asian people. 370 tweets (31,4 %) call out bias against Asians. 1160 out of 5880 tweets (19.7%) contain the keyword "Blacks," 578 were posted in 2020 and 582 in 2021. 101 tweets (8.7 %) are biased against Black people. 334 tweets (28.8 %) call out bias against Blacks. 1189 out of 5880 tweets (20.2 %) contain the keyword "Jews," 592 were posted in 2020, 451 in 2021, and ––as mentioned above––146 tweets from 2022. 83 tweets (7 %) are biased against Jewish people. 220 tweets (18.5 %) call out bias against Jews. 1169 out of 5880 tweets (19.9 %) contain the keyword "Latinos," 584 were posted in 2020 and 585 in 2021. 29 tweets (2.5 %) are biased against Latines. 181 tweets (15.5 %) call out bias against Latines. 1182 out of 5880 tweets (20.1 %) contain the keyword "Muslims," 593 were posted in 2020 and 589 in 2021. 105 tweets (8.9 %) are biased against Muslims. 260 tweets (22 %) call out bias against Muslims.

    File Description:

    The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:
    'TweetID': Represents the tweet ID.
    'Username': Represents the username who published the tweet (if it is a retweet, it will be the user who retweetet the original tweet.
    'Text': Represents the full text of the tweet (not pre-processed). 'CreateDate': Represents the date the tweet was created.
    'Biased': Represents the labeled by our annotators if the tweet is biased (1) or not (0). 'Calling_Out': Represents the label by our annotators if the tweet is calling out bias against minority groups (1) or not (0). 'Keyword': Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.

    Licences

    Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)

    Acknowledgements

    We are grateful for the technical collaboration with Indiana University's Observatory on Social Media (OSoMe). We thank all class participants for the annotations and contributions, including Kate Baba, Eleni Ballis, Garrett Banuelos, Savannah Benjamin, Luke Bianco, Zoe Bogan, Elisha S. Breton, Aidan Calderaro, Anaye Caldron, Olivia Cozzi, Daj Crisler, Jenna Eidson, Ella Fanning, Victoria Ford, Jess Gruettner, Ronan Hancock, Isabel Hawes, Brennan Hensler, Kyra Horton, Maxwell Idczak, Sanjana Iyer, Jacob Joffe, Katie Johnson, Allison Jones, Kassidy Keltner, Sophia Knoll, Jillian Kolesky, Emily Lowrey, Rachael Morara, Benjamin Nadolne, Rachel Neglia, Seungmin Oh, Kirsten Pecsenye, Sophia Perkovich, Joey Philpott, Katelin Ray, Kaleb Samuels, Chloe Sherman, Rachel Weber, Molly Winkeljohn, Ally Wolfgang, Rowan Wolke, Michael Wong, Jane Woods, Kaleb Woodworth, and Aurora Young. This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

  10. Oprah Winfrey's Tweets

    • kaggle.com
    zip
    Updated Dec 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Oprah Winfrey's Tweets [Dataset]. https://www.kaggle.com/thedevastator/oprah-winfrey-s-twitter-engagement-metrics
    Explore at:
    zip(1034241 bytes)Available download formats
    Dataset updated
    Dec 20, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Oprah Winfrey's Tweets

    Insights into Celebrity Social Media Influence

    By Twitter [source]

    About this dataset

    This dataset captures key engagement metrics from the Twitter account of Oprah Winfrey, including likes, quote counts, replies and retweets. These insights provide an in-depth look into how people engage with Oprah’s tweets on a day-to-day basis. By looking at the media type, outlink URLs, and conversation identifiers included here we can gain a better understanding of the broader conversation happening around her tweets. For researchers and academics it may offer insight to celebrities’ use of social media to shape their audience and impact the world. It might even just help users learn more effective ways to make the most out of every tweet! With rich data reflecting likes, quote counts, replies & retweets this dataset offers valuable information for analyzing how successful each post is by capturing user interactions in real time

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    Research Ideas

    • Conducting an analysis on the types of engagements that Oprah’s tweets receive, such as likes, quotes, replies and retweets. This could give us valuable insights into which topics or conversation starters get the most engagement and help to inform future posts.
    • Analyzing data on outlinks used in her tweets to gain insights into what her followers are interested in learning more about and the other conversational threads surrounding those topics.
    • Utilizing this dataset to track which conversations generate the most high-level engagements (likes/retweets from celebrities, influencers etc.) so that marketing teams can target those accounts with relevant campaigns or tailored content for better results

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Twitter.

  11. Reddit r/Ukraine Dataset

    • kaggle.com
    zip
    Updated Jul 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2024). Reddit r/Ukraine Dataset [Dataset]. https://www.kaggle.com/bwandowando/ukrainesubredditthreadsandcomments
    Explore at:
    zip(525566951 bytes)Available download formats
    Dataset updated
    Jul 4, 2024
    Authors
    BwandoWando
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Area covered
    Ukraine
    Description

    Context

    The context and history of the current ongoing conflict can be found https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine.

    Content

    Contains threads, comments, and uploaded pics in the r/Ukraine subreddit. I will update this weekly.

    I also have a Twitter dataset of the Ukraine Conflict which can be found here https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows

    Update

    [Oct 03 2023] Updated with Sept 2023 threads, comments, deleted all pics as some have NSFW content.

    [Aug 01 2023] Updated with Jul 2023 threads, comments, and pics.

    [Jul 15 2023] Updated with Sep 2022 to Jun 2023 threads and comments, Oct 2022 missing though. I tried uploading images but the dataset was flagged to have violated some rules. Im uploading scraped images from the threads and they may be NSFW images as I am uploading thousands. I will not upload images anymore.

    [Sep 06 2022] Updated with 52888 threads and 1763089 comments

    [Jun 15 2022] I uploaded new datasets, amounting to 33K unique threads, 1.1M unique comments, and 6k images.

    Acknowledgements

    Thank you to Anaconda Jupyter, Python, Microsoft Azure, and Tweepy for libraries, services, and programming tools.

    Cover image from this article

    Inspiration

    This dataset can be used in quite a number of ways.

    1. What is the public sentiment on Reddit about the ongoing conflict in Ukraine?
    2. Updates on the conflict, uploaded pics, screens, etc. (and many more)

    Personal Note

    No to war, please. I hope the conflict ends soon and further destruction and bloodshed are stopped

  12. Z

    Clinically-relevant COVID-19 tweets authored by health-care professionals...

    • data.niaid.nih.gov
    Updated Mar 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julia Wu; Venkatesh Sivaraman; Dheekshita Kumar; Juan M. Banda; David Sontag (2021). Clinically-relevant COVID-19 tweets authored by health-care professionals from January to June 2020 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4610528
    Explore at:
    Dataset updated
    Mar 18, 2021
    Dataset provided by
    Carnegie Mellon University
    Massachusetts Institute of Technology
    Georgia State University
    Authors
    Julia Wu; Venkatesh Sivaraman; Dheekshita Kumar; Juan M. Banda; David Sontag
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The rapid evolution of the COVID-19 pandemic has underscored the need to quickly disseminate the latest clinical knowledge during a public-health emergency. One surprisingly effective platform for healthcare professionals (HCPs) to share knowledge and experiences from the front lines has been social media (for example, the "#medtwitter" community on Twitter). However, identifying clinically-relevant content in social media without manual labeling is a challenge because of the sheer volume of irrelevant data. This dataset attempts to automatically extract tweets authored by HCPs and then filter for clinically relevant content.

    The dataset is derived from a large set of English tweets related to COVID-19 (retweets and bots removed) from January to June 2020 (version 14). We utilize a regex based filter on user names, screen names, and bios to identify likely HCPs, narrowing down from around 52 million tweets to around 1 million. We augment the dataset by including any additional tweets in threads for which at least one tweet is present in the dataset. This results in tweets_level_0.csv. Note that this set contains almost all self-declared HCPs, but also includes some false positives; therefore, we develop an iterative relevance filtering pipeline that uses topic modeling and MetaMap concept annotation to identify and enrich clinically-relevant content. Subsequent files represent the outputs of each iteration of filtering. Please see our preprint for more details about our filtering method.

    Each CSV file includes the following fields: "id" (the tweet ID, accessible using the Twitter API), "thread_id" (a generated value that is shared by multiple tweets in the same thread), and "date" (the date that the tweet was posted). Due to Twitter policies, we cannot provide the contents of the tweets, and ask that you "hydrate" the tweets using a Twitter API tool such as twarc. Note that some tweets may have been deleted since the collection of our dataset and will no longer be available.

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Daniel Grijalva (2018). Twitter Threads [Dataset]. https://www.kaggle.com/danielgrijalvas/twitter-threads
Organization logo

Twitter Threads

...or blog posts? Analyzing engagement in Twitter threads

Explore at:
zip(709787 bytes)Available download formats
Dataset updated
May 27, 2018
Authors
Daniel Grijalva
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

When Twitter introduced its thread functionality, a debate emerged: "If you're gonna write a f*ck ton of tweets at once, why not write a blog post instead of cluttering my feed?"... "It's easier and user-friendlier to share ideas in a single app"...

I'm not getting into that debate. Both blog posts and Twitter threads have their own advantages.

But I noticed a phenomenon while reading threads on Twitter: the engagement—*retweets, likes and replies*—drops with each subsequent tweet!

Now, this has some logical explanations. Like, people don't want to retweet or like every tweet in a thread, because that'd be annoying. But this trend kept appearing in every single thread I read.

It was bugging me, so I had to gather some data.

Content

The dataset is divided into five parts:
- five_ten.csv: data of threads 5-10 tweets long
- ten_fifteen.csv: data of threads 10-15 tweets long
- fifteen_twenty.csv: data of threads 15-20 tweets long
- twenty_twentyfive.csv: data of threads 20-25 tweets long
- twentyfive_thirty.csv: data of threads 25-30 tweets long

They all contain the same data:
- id: Tweet ID (maybe I should remove it to anonymize the data?)
- thread_number: Thread identifier, used for grouping each thread and its tweets
- timestamp: Creation date of each tweet - text: The content of each tweet
- retweets: Retweet count for each tweet
- likes: Like count for each tweet
- replies: Reply count for each tweet

Each "bin" contains around 100 threads... so in total there are ~500 threads.

Acknowledgements

The threads were manually gathered using Thread Reader (both the web page and the bot).

Disclaimer

The content of the threads/tweets did not had any influence in choosing a thread or not. The only parameter was the length of the thread (5-30 tweets tops). The tweets collected date from October 2017 to May 2018.

Inspiration

Some things I noticed while gathering the data was that political threads have a steadier engagement than, say, art threads. So context might influence thread engagement, and it'd be interesting to do some NLP to figure that out.

Also it'd be cool to find a "formula" for better engagement in Twitter threads, like how long should a thread be? or maybe a probability of engagement based on the success of the initial tweet?

Finally, this whole issue reminds me of the headline problem: most people don't go beyond the headline. Maybe Twitter threads suffer from that too.

Search
Clear search
Close search
Google apps
Main menu