Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
*** Fake News on Twitter ***
These 5 datasets are the results of an empirical study on the spreading process of newly fake news on Twitter. Particularly, we have focused on those fake news which have given rise to a truth spreading simultaneously against them. The story of each fake news is as follow:
1- FN1: A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.
2- FN2: Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."
3- FN3: Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.
4- FN4: The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.
5- FN5: In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.
The data collection has been done in two stages that each provided a new dataset: 1- attaining Dataset of Diffusion (DD) that includes information of fake news/truth tweets and retweets 2- Query of neighbors for spreaders of tweets that provides us with Dataset of Graph (DG).
DD
DD for each fake news story is an excel file, named FNx_DD where x is the number of fake news, and has the following structure:
The structure of excel files for each dataset is as follow:
DG
DG for each fake news contains two files:
Because in the graph file, the label of each node is the number of its entrance in the graph. For example if node with user ID 12345637 be the first node which has been entered into the graph file then its label in the graph is 0 and its real ID (12345637) would be at the row number 1 (because the row number 0 belongs to column labels) in the jsonl file and so on other node IDs would be at the next rows of the file (each row corresponds to 1 user id). Therefore, if we want to know for example what the user id of node 200 (labeled 200 in the graph) is, then in jsonl file we should look at row number 202.
The user IDs of spreaders in DG (those who have had a post in DD) would be available in DD to get extra information about them and their tweet/retweet. The other user IDs in DG are the neighbors of these spreaders and might not exist in DD.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
kNN-Embed: Locally Smoothed Embedding Mixtures For Multi-interest Candidate Retrieval
This repo contains the TwitterFaveGraph dataset from our paper kNN-Embed: Locally Smoothed Embedding Mixtures For Multi-interest Candidate Retrieval. [PDF] [HuggingFace Datasets] This work is licensed under a Creative Commons Attribution 4.0 International License.
TwitterFollowGraph
TwitterFollowGraph is a bipartite directed graph of users (consumer) nodes to author (producer) nodes… See the full description on the dataset page: https://huggingface.co/datasets/Twitter/TwitterFollowGraph.
Facebook
TwitterAs of December 2022, X/Twitter's audience accounted for over *** million monthly active users worldwide. This figure was projected to ******** to approximately *** million by 2024, a ******* of around **** percent compared to 2022.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 653 996 tweets related to the Coronavirus topic and highlighted by hashtags such as: #COVID-19, #COVID19, #COVID, #Coronavirus, #NCoV and #Corona. The tweets' crawling period started on the 27th of February and ended on the 25th of March 2020, which is spread over four weeks.
The tweets were generated by 390 458 users from 133 different countries and were written in 61 languages. English being the most used language with almost 400k tweets, followed by Spanish with around 80k tweets.
The data is stored in as a CSV file, where each line represents a tweet. The CSV file provides information on the following fields:
Author: the user who posted the tweet
Recipient: contains the name of the user in case of a reply, otherwise it would have the same value as the previous field
Tweet: the full content of the tweet
Hashtags: the list of hashtags present in the tweet
Language: the language of the tweet
Relationship: gives information on the type of the tweet, whether it is a retweet, a reply, a tweet with a mention, etc.
Location: the country of the author of the tweet, which is unfortunately not always available
Date: the publication date of the tweet
Source: the device or platform used to send the tweet
The dataset can as well be used to construct a social graph since it includes the relations "Replies to", "Retweet", "MentionsInRetweet" and "Mentions".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MiCRO: Multi-interest Candidate Retrieval Online
This repo contains the TwitterFaveGraph dataset from our paper MiCRO: Multi-interest Candidate Retrieval Online. [PDF] [HuggingFace Datasets] This work is licensed under a Creative Commons Attribution 4.0 International License.
TwitterFaveGraph
TwitterFaveGraph is a bipartite directed graph of user nodes to Tweet nodes where an edge represents a "fave" engagement. Each edge is binned into predetermined time chunks which… See the full description on the dataset page: https://huggingface.co/datasets/Twitter/TwitterFaveGraph.
Facebook
Twitterhttps://www.isi.edu/~lerman/downloads/twitter/twitter2010.htmlhttps://www.isi.edu/~lerman/downloads/twitter/twitter2010.html
Twitter_2010 data set contains tweets containing URLs that have been posted on Twitter during October 2010. In addition to tweets, we also the followee links of tweeting users, allowing us to reconstruct the follower graph of active (tweeting) users. The dataset contains 66,059 URLs, 2,859,764 tweets, 736,930 users and 36,743,448 links.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Top retweeted users (highest indegree).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Learning Stance Embeddings from Signed Social Graphs
This repo contains the datasets from our paper Learning Stance Embeddings from Signed Social Graphs. [PDF] [HuggingFace Datasets] This work is licensed under a Creative Commons Attribution 4.0 International License.
Overview
A key challenge in social network analysis is understanding the position, or stance, of people in the graph on a large set of topics. In such social graphs, modeling (dis)agreement patterns… See the full description on the dataset page: https://huggingface.co/datasets/Twitter/SignedGraphs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One of the biggest advantages of Twitter is the speed at which information can be passed around. People use Twitter primarily to get news and for entertainment. This is the breakdown of why people use Twitter today.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Twitter is ranked as the 12h most popular social media site in the world. The platform currently has 611 million active monthly users.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the key Twitter user statistics that you need to know.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Twitter follower-followee graph with 269,640 nodes and 6,818,501 edges from [Kwak], and we obtain the ground truth labels from [SybilSCAR]. Among them 178377 are benign and 91263 are Sybil. We divide 9000 Sybil and 17000 benign users (about 10%) from them as the training set and test on the overall social graph.
H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social network or a news media?” in WWW, 2010 B. Wang, L. Zhang, and N. Z. Gong, “SybilSCAR: Sybil detection in online social networks via local rule based propagation,” in IEEE INFOCOM, 2017.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Top retweeted users (highest out degree).
Facebook
TwitterDuring the first half of 2024 there were 34,497 pieces of content removed from X due to personal privacy violations, which include the publishing or sharing of other people's private information. These types of violations are also known as doxxing. Overall, 30,450 of these pieces of content were reported manually by users of the platform.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The complete list of all entities with the corresponding keywords that were used for each one.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The US has the largest number of Twitter users with over a 100 million users. They account for about 16.7% of all Twitter users worldwide.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The Observatory Knowledge Graph (OKG) is a knowledge graph with tweets on inequality in terms of the OBIO ontology (https://w3id.org/okg/obio-ontology/), which integrates social media metadata with various types of linguistic knowledge. The OKG can be used as the backbone of a social media observatory, to facilitate a deeper understanding of social media discourse on inequality.
We retrieved tweets and retweets published from the end (30th) of May 2020 to the beginning (1st) of May 2023.
In this version of the OKG, we use a sample of 85,247 tweets, published from May 30th to August 27th, 2020. To be compliant with Twitter's policies, we remove usernames and id's, as well as the tweet texts and sentences. We also replace user IRIs with skolem IRIs through skolemization.
Access to the OKG as well as the SPARQL endpoint can be requested by sending a mail to the contact person (l.stork@uva.nl) with the following information:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is structured as a graph, where nodes represent users and edges capture their interactions, including tweets, retweets, replies, and mentions. Each node provides detailed user attributes, such as unique ID, follower and following counts, and verification status, offering insights into each user's identity, role, and influence in the mental health discourse. The edges illustrate user interactions, highlighting engagement patterns and types of content that drive responses, such as tweet impressions. This interconnected structure enables sentiment analysis and public reaction studies, allowing researchers to explore engagement trends and identify the mental health topics that resonate most with users.
The dataset consists of three files: 1. Edges Data: Contains graph data essential for social network analysis, including fields for UserID (Source), UserID (Destination), Post/Tweet ID, and Date of Relationship. This file enables analysis of user connections without including tweet content, maintaining compliance with Twitter/X’s data-sharing policies. 2. Nodes Data: Offers user-specific details relevant to network analysis, including UserID, Account Creation Date, Follower and Following counts, Verified Status, and Date Joined Twitter. This file allows researchers to examine user behavior (e.g., identifying influential users or spam-like accounts) without direct reference to tweet content. 3. Twitter/X Content Data: This file contains only the raw tweet text as a single-column dataset, without associated user identifiers or metadata. By isolating the text, we ensure alignment with anonymization standards observed in similar published datasets, safeguarding user privacy in compliance with Twitter/X's data guidelines. This content is crucial for addressing the research focus on mental health discourse in social media. (References to prior Data in Brief publications involving Twitter/X data informed the dataset's structure.)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Node statistics for TwiBot-20, Bitcoin OTC and Bitcoin Alpha datasets.
Facebook
TwitterAs of February 2025, 24.5 percent of X (formerly Twitter) users were men aged between 25 and 34 years. Overall, almost 19 percent of users were men aged between 18 and 24 years. X has a high share of male users when compared to other popular social media platforms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
*** Fake News on Twitter ***
These 5 datasets are the results of an empirical study on the spreading process of newly fake news on Twitter. Particularly, we have focused on those fake news which have given rise to a truth spreading simultaneously against them. The story of each fake news is as follow:
1- FN1: A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.
2- FN2: Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."
3- FN3: Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.
4- FN4: The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.
5- FN5: In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.
The data collection has been done in two stages that each provided a new dataset: 1- attaining Dataset of Diffusion (DD) that includes information of fake news/truth tweets and retweets 2- Query of neighbors for spreaders of tweets that provides us with Dataset of Graph (DG).
DD
DD for each fake news story is an excel file, named FNx_DD where x is the number of fake news, and has the following structure:
The structure of excel files for each dataset is as follow:
DG
DG for each fake news contains two files:
Because in the graph file, the label of each node is the number of its entrance in the graph. For example if node with user ID 12345637 be the first node which has been entered into the graph file then its label in the graph is 0 and its real ID (12345637) would be at the row number 1 (because the row number 0 belongs to column labels) in the jsonl file and so on other node IDs would be at the next rows of the file (each row corresponds to 1 user id). Therefore, if we want to know for example what the user id of node 200 (labeled 200 in the graph) is, then in jsonl file we should look at row number 202.
The user IDs of spreaders in DG (those who have had a post in DD) would be available in DD to get extra information about them and their tweet/retweet. The other user IDs in DG are the neighbors of these spreaders and might not exist in DD.