Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
When Twitter introduced its thread functionality, a debate emerged: "If you're gonna write a f*ck ton of tweets at once, why not write a blog post instead of cluttering my feed?"... "It's easier and user-friendlier to share ideas in a single app"...
I'm not getting into that debate. Both blog posts and Twitter threads have their own advantages.
But I noticed a phenomenon while reading threads on Twitter: the engagement—*retweets, likes and replies*—drops with each subsequent tweet!
Now, this has some logical explanations. Like, people don't want to retweet or like every tweet in a thread, because that'd be annoying. But this trend kept appearing in every single thread I read.
It was bugging me, so I had to gather some data.
The dataset is divided into five parts:
- five_ten.csv: data of threads 5-10 tweets long
- ten_fifteen.csv: data of threads 10-15 tweets long
- fifteen_twenty.csv: data of threads 15-20 tweets long
- twenty_twentyfive.csv: data of threads 20-25 tweets long
- twentyfive_thirty.csv: data of threads 25-30 tweets long
They all contain the same data:
- id: Tweet ID (maybe I should remove it to anonymize the data?)
- thread_number: Thread identifier, used for grouping each thread and its tweets
- timestamp: Creation date of each tweet
- text: The content of each tweet
- retweets: Retweet count for each tweet
- likes: Like count for each tweet
- replies: Reply count for each tweet
Each "bin" contains around 100 threads... so in total there are ~500 threads.
The threads were manually gathered using Thread Reader (both the web page and the bot).
The content of the threads/tweets did not had any influence in choosing a thread or not. The only parameter was the length of the thread (5-30 tweets tops). The tweets collected date from October 2017 to May 2018.
Some things I noticed while gathering the data was that political threads have a steadier engagement than, say, art threads. So context might influence thread engagement, and it'd be interesting to do some NLP to figure that out.
Also it'd be cool to find a "formula" for better engagement in Twitter threads, like how long should a thread be? or maybe a probability of engagement based on the success of the initial tweet?
Finally, this whole issue reminds me of the headline problem: most people don't go beyond the headline. Maybe Twitter threads suffer from that too.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description:
The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.
Dataset Breakdown:
Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.
Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.
Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.
Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.
Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.
Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.
Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.
Context and Use Cases:
Researchers, data scientists, and developers can use this dataset to:
Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.
Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.
Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.
Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.
Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.
Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.
The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.
Future Considerations:
As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.
By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Twitter Sentiment Dataset
Sample English-only tweet sentiment dataset. Each row represents a single tweet with anonymized text and conversation structure. This is a sample dataset. To access the full version or request any custom dataset tailored to your needs, contact DataHive at contact@datahive.ai.
Files Included
dataset.csv – tweets data
What’s included
Anonymized tweet text Conversation linkage via root_id and parent_id 3-class sentiment label (positive… See the full description on the dataset page: https://huggingface.co/datasets/datahiveai/Twitter-Conversations-Sentiment-Dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset for the article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario".
Abstract:
Museums are embracing social technologies in the attempt to broaden their audience and to engage people. Although social communication seems an easy task, media managers know how hard it is to reach millions of people with a simple message. Indeed, millions of posts are competing every day to get visibility in terms of likes and shares and very little research focused on museums communication to identify best practices. In this paper, we focus on Twitter and we propose a novel method that exploits interpretable machine learning techniques to: (a) predict whether a tweet will likely be appreciated by Twitter users or not; (b) present simple suggestions that will help enhancing the message and increasing the probability of its success. Using a real-world dataset of around 40,000 tweets written by 23 world famous museums, we show that our proposed method allows identifying tweet features that are more likely to influence the tweet success.
Code to run a selection of experiments is available at https://github.com/rmartoglia/predict-twitter-ch
Dataset structure
The dataset contains the dataset used in the experiments of the above research paper. Only the extracted features for the museum tweet threads (and not the message full text) are provided and needed for the analyses.
We selected 23 well known world spread art museums and grouped them into five groups: G1 (museums with at least three million of followers); G2 (museums with more than one million of followers); G3 (museums with more than 400,000 followers); G4 (museums with more that 200,000 followers); G5 (Italian museums). From these museums, we analyzed ca. 40,000 tweets, with a number varying from 5k ca. to 11k ca. for each museum group, depending on the number of museums in each group.
Content features: these are the features that can be drawn form the content of the tweet itself. We further divide such features in the following two categories:
– Countable: these features have a value ranging into different intervals. We take into consideration: the number of hashtags (i.e., words preceded by #) in the tweet, the number of URLs (i.e., links to external resources), the number of images (e.g., photos and graphical emoticons), the number of mentions (i.e., twitter accounts preceded by @), the length of the tweet;
– On-Off : these features have binary values in {0, 1}. We observe whether the tweet has exclamation marks, question marks, person names, place names, organization names, other names. Moreover, we also take into consideration the tweet topic density: assuming that the involved topics correspond to the hashtags mentioned in the text, we define a tweet as dense of topics if the number of hashtags it contains is greater than a given threshold, set to 5. Finally, we observe the tweet sentiment that might be present (positive or negative) or not (neutral).
Context features: these features are not drawn form the content of the tweet itself and might give a larger picture of the context in which the tweet was sent. Namely, we take into consideration the part of the day in which the tweet was sent (morning, afternoon, evening and night respectively from 5:00am to 11:59am, from 12:00pm to 5:59pm, from 6:00pm to 10:59pm and from 11pm to 4:59am), and a boolean feature indicating whether the tweet is a retweet or not.
User features: these features are proper of the user that sent the tweet, and are the same for all the tweets of this user. Namely we consider the name of the museum and the number of followers of the user.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Twitter [source]
This dataset provides an in-depth analysis of the Twitter activity and engagement patterns of Tanamongeau, one of the most popular influencers and content creators out there. The data provided for each tweet contains a wealth of information about the type of content being posted, as well as its associated engagement levels (like count, retweet count, quote count, reply count etc.). Researchers can thus explore how different types of posts fare with regards to user engagement and how subject matter affects conversation trends. This dataset also allows a detailed analysis on the effects that various media elements can have on a user's followers. Included within this data are columns such as created_at (date created) media (images/videos), outlinks (URLs to external pages), quotedTweet (quoted text), retweetedTweet (texts that were already present in tweets) ids and conversationIds which provide researchers with invaluable insights into understanding how Tanamongeau's followers interact with them through social media platforms
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
To use this dataset effectively, it’s important to familiarize yourself with each column first. The columns contained in the index are: Content (the text contents of the tweet), created_at (date/time when tweet was sent), date (date posted), likeCount (number of likes for a particular post), Media (all attached media to a post such as images or GIFs), Outlinks (URL links associated with post), QuoteCount (Number of quotes received on a single post/tweet by other users or accounts) Quoted Tweet ID(the ID from original quoted tweet given by author) ReplyCount(Number of replies given in response to original tweet) Retweet Count(Number Of retweet responses to original post) RetweetedTweet(details related to retweeted posts by other users on same user's profile).
Once you’ve familiarized yourself with each column, you can begin exploring different angles for analysis. For example: what kind of content is garnering the most engagement? What type of media is performing best? How many average retweets does Tanamongeau receive? Or what type conversations are driving increased engagement? By using this information from the dataset along with your own observations gathered during research will ultimately help form trends and patterns that provide valuable insights in understanding Tanamongeau's engagement habits better than ever before!
- Creating an Insights Dashboard: Analyzing the dataset can be used to create an insights dashboard that will help keep track of Tanamongeau’s Twitter performance over time and identify key trends in their engagement metrics such as likes, retweets, replies etc.
- Developing Social Media Strategies: The data collected in this dataset can also be used to help inform and develop effective social media strategies based on their past tweeting behavior, content that did well, and levels of engagement with different types of posts.
- Identifying Influencers/Partnerships: By examining Tanamongeau’s tweeted conversations or replying tweets, researchers can identify potential influencers or partnerships by identifying any shared connections with other Twitter users mentioned in the conversation threads
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Twitter.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset from the Institute for the Study of Contemporary Antisemitism (ISCA) at Indiana University:
The Social Media & Hate research lab at the Institute for the Study of Contemporary Antisemitism compiled this dataset using an annotation portal (Jikeli, Soemer, and Karali 2024), which was used to label tweets as either antisemitic or non-antisemitic, among other labels. Note that annotation was done on live data, including images and context, such as threads. All data was annotated by two experts, and all discrepancies were discussed (Jikeli et al. 2023).
Content:
This dataset contains 11311 tweets covering a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and April 2023. The dataset consists of random samples of relevant keywords during this time period. 1,953 tweets (17%) are antisemitic according to the IHRA definition of antisemitism.
The distribution of tweets by year is as follows: 1499 (13%) from 2019, 3712 (33%) from 2020, 2591 (23%) from 2021, 2644 from 2022 (23%) and 865 (8%) from 2023. 6365 (56%) contain the keyword "Jews," 4134 (37%) include "Israel," 529 (5%) feature the derogatory term "ZioNazi*," and 283 (3%) use the slur "K---s." Some tweets may contain multiple keywords.
725 out of the 6365 tweets with the keyword "Jews" (11%) and 664 out of the 4134 tweets with the keyword "Israel" (16%) were classified as antisemitic. 97 out of the 283 tweets using the antisemitic slur "K---s" (34%) are antisemitic. Interestingly, many tweets featuring the slur "K---s" actually call out its use. In contrast, the majority of tweets using the derogatory term "ZioNazi*" are antisemitic, with 467 out of 529 (88%) being classified as such.
File Description:
The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:
‘ID’: Represents the tweet ID.
‘Username’: Represents the username that posted the tweet.
‘Text’: Represents the full text of the tweet (not pre-processed).
‘CreateDate’: Represents the date on which the tweet was created.
‘Biased’: Represents the label given by our annotations as to whether the tweet is antisemitic or not.
‘Keyword’: Represents the keyword that was used in the query. The keyword can be in the text, including hashtags, mentioned users, or the username itself.
Licences
Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)
Acknowledgements
We are grateful for the support of Indiana University’s Observatory on Social Media (OSoMe) (Davis et al. 2016) and the contributions and annotations of all team members in our Social Media & Hate Research Lab at Indiana University’s Institute for the Study of Contemporary Antisemitism, especially Grace Bland, Elisha S. Breton, Kathryn Cooper, Robin Forstenhäusler, Sophie von Máriássy, Mabel Poindexter, Jenna Solomon, Clara Schilling, and Victor Tschiskale.
This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F08688f046820175d6cb116271bd6f2a2%2Fthreads2.png?generation=1700495055256354&alt=media" alt="">
As per wikipedia page
Threads is an online social media and social networking service operated by Meta Platforms. The app offers users the ability to post and share text, images, and videos, as well as interact with other users' posts through replies, reposts, and likes. Closely linked to Meta platform Instagram and additionally requiring users to both have an Instagram account and use Threads under the same Instagram handle, the functionality of Threads is similar to X (formerly known as Twitter). The application is available on iOS and Android devices; the web version offers limited functionality and requires a mobile app install first. It is the fastest-growing consumer software application in history, gaining over 100 million users in its first five days, surpassing the record previously set by ChatGPT. Its early success was not sustained and the user base of the app plummeted more than 80% to 8 million daily active users by the end of July.
These reviews were extracted from its Google Store page.
This dataset should paint a good picture on what is the public's perception of the app over the years. Using this dataset, we can do the following
(AND MANY MORE!)
Images generated using Bing Image Generator
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Customer Support on Twitter Dataset 945k
Dataset Description
Context
This dataset provides a large corpus of real-world English conversations between consumers and customer support agents on Twitter, designed to drive innovation in Natural Language Processing (NLP) by providing data that better matches the actual language used in contemporary customer support interactions.
Content
Initially, the data included complex threads of conversations… See the full description on the dataset page: https://huggingface.co/datasets/MohammadOthman/mo-customer-support-tweets-945k.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ISCA project compiled this dataset using an annotation portal, which was used to label tweets as either biased or non-biased, among other labels. Note that the annotation was done on live data, including images and context, such as threads. The original data comes from annotationportal.com. They include representative samples of live tweets from the years 2020 and 2021 with the keywords "Asians, Blacks, Jews, Latinos, and Muslims".
A random sample of 600 tweets per year was drawn for each of the keywords. This includes retweets. Due to a sampling error, the sample for the year 2021 for the keyword "Jews" has only 453 tweets from 2021 and 147 from the first eight months of 2022 and it includes some tweets from the query with the keyword "Israel." The tweets were divided into six samples of 100 tweets, which were then annotated by three to seven students in the class "Researching White Supremacism and Antisemitism on Social Media" taught by Gunther Jikeli, Elisha S. Breton, and Seth Moller at Indiana University in the fall of 2022, see this report. Annotators used a scale from 1 to 5 (confident not biased, probably not biased, don't know, probably biased, confident biased). The definitions of bias against each minority group used for annotation are also included in the report.
If a tweet called out or denounced bias against the minority in question, it was labeled as "calling out bias."
The labels of whether a tweet is biased or calls out bias are based on a 75% majority vote. We considered "probably biased" and "confident biased" as biased and "confident not biased," "probably not biased," and "don't know" as not biased.
The types of stereotypes vary widely across the different categories of prejudice. While about a third of all biased tweets were classified as "hate" against the minority, the stereotypes in the tweets often matched common stereotypes about the minority. Asians were blamed for the Covid pandemic. Blacks were seen as inferior and associated with crime. Jews were seen as powerful and held collectively responsible for the actions of the State of Israel. Some tweets denied the Holocaust. Hispanics/Latines were portrayed as being in the country illegally and as "invaders," in addition to stereotypical accusations of being lazy, stupid, or having too many children. Muslims, on the other hand, were often collectively blamed for terrorism and violence, though often in conversations about Muslims in India.
This dataset contains 5880 tweets that cover a wide range of topics common in conversations about Asians, Blacks, Jews, Latines, and Muslims. 357 tweets (6.1 %) are labeled as biased and 5523 (93.9 %) are labeled as not biased. 1365 tweets (23.2 %) are labeled as calling out or denouncing bias. 1180 out of 5880 tweets (20.1 %) contain the keyword "Asians," 590 were posted in 2020 and 590 in 2021. 39 tweets (3.3 %) are biased against Asian people. 370 tweets (31,4 %) call out bias against Asians. 1160 out of 5880 tweets (19.7%) contain the keyword "Blacks," 578 were posted in 2020 and 582 in 2021. 101 tweets (8.7 %) are biased against Black people. 334 tweets (28.8 %) call out bias against Blacks. 1189 out of 5880 tweets (20.2 %) contain the keyword "Jews," 592 were posted in 2020, 451 in 2021, and ––as mentioned above––146 tweets from 2022. 83 tweets (7 %) are biased against Jewish people. 220 tweets (18.5 %) call out bias against Jews. 1169 out of 5880 tweets (19.9 %) contain the keyword "Latinos," 584 were posted in 2020 and 585 in 2021. 29 tweets (2.5 %) are biased against Latines. 181 tweets (15.5 %) call out bias against Latines. 1182 out of 5880 tweets (20.1 %) contain the keyword "Muslims," 593 were posted in 2020 and 589 in 2021. 105 tweets (8.9 %) are biased against Muslims. 260 tweets (22 %) call out bias against Muslims.
The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:
'TweetID': Represents the tweet ID.
'Username': Represents the username who published the tweet (if it is a retweet, it will be the user who retweetet the original tweet.
'Text': Represents the full text of the tweet (not pre-processed).
'CreateDate': Represents the date the tweet was created.
'Biased': Represents the labeled by our annotators if the tweet is biased (1) or not (0).
'Calling_Out': Represents the label by our annotators if the tweet is calling out bias against minority groups (1) or not (0).
'Keyword': Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.
Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)
We are grateful for the technical collaboration with Indiana University's Observatory on Social Media (OSoMe). We thank all class participants for the annotations and contributions, including Kate Baba, Eleni Ballis, Garrett Banuelos, Savannah Benjamin, Luke Bianco, Zoe Bogan, Elisha S. Breton, Aidan Calderaro, Anaye Caldron, Olivia Cozzi, Daj Crisler, Jenna Eidson, Ella Fanning, Victoria Ford, Jess Gruettner, Ronan Hancock, Isabel Hawes, Brennan Hensler, Kyra Horton, Maxwell Idczak, Sanjana Iyer, Jacob Joffe, Katie Johnson, Allison Jones, Kassidy Keltner, Sophia Knoll, Jillian Kolesky, Emily Lowrey, Rachael Morara, Benjamin Nadolne, Rachel Neglia, Seungmin Oh, Kirsten Pecsenye, Sophia Perkovich, Joey Philpott, Katelin Ray, Kaleb Samuels, Chloe Sherman, Rachel Weber, Molly Winkeljohn, Ally Wolfgang, Rowan Wolke, Michael Wong, Jane Woods, Kaleb Woodworth, and Aurora Young. This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Twitter [source]
This dataset captures key engagement metrics from the Twitter account of Oprah Winfrey, including likes, quote counts, replies and retweets. These insights provide an in-depth look into how people engage with Oprah’s tweets on a day-to-day basis. By looking at the media type, outlink URLs, and conversation identifiers included here we can gain a better understanding of the broader conversation happening around her tweets. For researchers and academics it may offer insight to celebrities’ use of social media to shape their audience and impact the world. It might even just help users learn more effective ways to make the most out of every tweet! With rich data reflecting likes, quote counts, replies & retweets this dataset offers valuable information for analyzing how successful each post is by capturing user interactions in real time
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Conducting an analysis on the types of engagements that Oprah’s tweets receive, such as likes, quotes, replies and retweets. This could give us valuable insights into which topics or conversation starters get the most engagement and help to inform future posts.
- Analyzing data on outlinks used in her tweets to gain insights into what her followers are interested in learning more about and the other conversational threads surrounding those topics.
- Utilizing this dataset to track which conversations generate the most high-level engagements (likes/retweets from celebrities, influencers etc.) so that marketing teams can target those accounts with relevant campaigns or tailored content for better results
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Twitter.
Facebook
Twitterhttps://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
The context and history of the current ongoing conflict can be found https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine.
Contains threads, comments, and uploaded pics in the r/Ukraine subreddit. I will update this weekly.
I also have a Twitter dataset of the Ukraine Conflict which can be found here https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows
[Oct 03 2023] Updated with Sept 2023 threads, comments, deleted all pics as some have NSFW content.
[Aug 01 2023] Updated with Jul 2023 threads, comments, and pics.
[Jul 15 2023] Updated with Sep 2022 to Jun 2023 threads and comments, Oct 2022 missing though. I tried uploading images but the dataset was flagged to have violated some rules. Im uploading scraped images from the threads and they may be NSFW images as I am uploading thousands. I will not upload images anymore.
[Sep 06 2022] Updated with 52888 threads and 1763089 comments
[Jun 15 2022] I uploaded new datasets, amounting to 33K unique threads, 1.1M unique comments, and 6k images.
Thank you to Anaconda Jupyter, Python, Microsoft Azure, and Tweepy for libraries, services, and programming tools.
Cover image from this article
This dataset can be used in quite a number of ways.
No to war, please. I hope the conflict ends soon and further destruction and bloodshed are stopped
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rapid evolution of the COVID-19 pandemic has underscored the need to quickly disseminate the latest clinical knowledge during a public-health emergency. One surprisingly effective platform for healthcare professionals (HCPs) to share knowledge and experiences from the front lines has been social media (for example, the "#medtwitter" community on Twitter). However, identifying clinically-relevant content in social media without manual labeling is a challenge because of the sheer volume of irrelevant data. This dataset attempts to automatically extract tweets authored by HCPs and then filter for clinically relevant content.
The dataset is derived from a large set of English tweets related to COVID-19 (retweets and bots removed) from January to June 2020 (version 14). We utilize a regex based filter on user names, screen names, and bios to identify likely HCPs, narrowing down from around 52 million tweets to around 1 million. We augment the dataset by including any additional tweets in threads for which at least one tweet is present in the dataset. This results in tweets_level_0.csv. Note that this set contains almost all self-declared HCPs, but also includes some false positives; therefore, we develop an iterative relevance filtering pipeline that uses topic modeling and MetaMap concept annotation to identify and enrich clinically-relevant content. Subsequent files represent the outputs of each iteration of filtering. Please see our preprint for more details about our filtering method.
Each CSV file includes the following fields: "id" (the tweet ID, accessible using the Twitter API), "thread_id" (a generated value that is shared by multiple tweets in the same thread), and "date" (the date that the tweet was posted). Due to Twitter policies, we cannot provide the contents of the tweets, and ask that you "hydrate" the tweets using a Twitter API tool such as twarc. Note that some tweets may have been deleted since the collection of our dataset and will no longer be available.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
When Twitter introduced its thread functionality, a debate emerged: "If you're gonna write a f*ck ton of tweets at once, why not write a blog post instead of cluttering my feed?"... "It's easier and user-friendlier to share ideas in a single app"...
I'm not getting into that debate. Both blog posts and Twitter threads have their own advantages.
But I noticed a phenomenon while reading threads on Twitter: the engagement—*retweets, likes and replies*—drops with each subsequent tweet!
Now, this has some logical explanations. Like, people don't want to retweet or like every tweet in a thread, because that'd be annoying. But this trend kept appearing in every single thread I read.
It was bugging me, so I had to gather some data.
The dataset is divided into five parts:
- five_ten.csv: data of threads 5-10 tweets long
- ten_fifteen.csv: data of threads 10-15 tweets long
- fifteen_twenty.csv: data of threads 15-20 tweets long
- twenty_twentyfive.csv: data of threads 20-25 tweets long
- twentyfive_thirty.csv: data of threads 25-30 tweets long
They all contain the same data:
- id: Tweet ID (maybe I should remove it to anonymize the data?)
- thread_number: Thread identifier, used for grouping each thread and its tweets
- timestamp: Creation date of each tweet
- text: The content of each tweet
- retweets: Retweet count for each tweet
- likes: Like count for each tweet
- replies: Reply count for each tweet
Each "bin" contains around 100 threads... so in total there are ~500 threads.
The threads were manually gathered using Thread Reader (both the web page and the bot).
The content of the threads/tweets did not had any influence in choosing a thread or not. The only parameter was the length of the thread (5-30 tweets tops). The tweets collected date from October 2017 to May 2018.
Some things I noticed while gathering the data was that political threads have a steadier engagement than, say, art threads. So context might influence thread engagement, and it'd be interesting to do some NLP to figure that out.
Also it'd be cool to find a "formula" for better engagement in Twitter threads, like how long should a thread be? or maybe a probability of engagement based on the success of the initial tweet?
Finally, this whole issue reminds me of the headline problem: most people don't go beyond the headline. Maybe Twitter threads suffer from that too.