Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news. The five breaking news provided with the dataset are as follows:* Charlie Hebdo: 458 rumours (22.0%) and 1,621 non-rumours (78.0%).* Ferguson: 284 rumours (24.8%) and 859 non-rumours (75.2%).* Germanwings Crash: 238 rumours (50.7%) and 231 non-rumours (49.3%).* Ottawa Shooting: 470 rumours (52.8%) and 420 non-rumours (47.2%).* Sydney Siege: 522 rumours (42.8%) and 699 non-rumours (57.2%).The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet.This dataset was used in the paper 'Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media' for rumour detection. For more details, please refer to the paper.License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains a collection of Twitter rumours and non-rumours during six real-world events: 1) 2013 Boston marathon bombings, 2) 2014 Ottawa shooting, 3) 2014 Sydney siege, 4) 2015 Charlie Hebdo Attack, 5) 2014 Ferguson unrest, and 6) 2015 Germanwings plane crash
The data set is an augmented data set of the PHEME dataset of rumours and non-rumours based on two data sets: the PHEME data [2] (downloaded via https://figshare.com/articles/PHEME_dataset_for_Rumour_Detection_and_Veracity_Classification/6392078), and the CrisisLexT26 data [3] (downloaded via https://github.com/sajao/CrisisLex/tree/master/data/CrisisLexT26/2013_Boston_bombings).
PHEME-Aug v2.0 (aug-rnr-data_filtered.tar.bz2 and aur-rnr-data_full.tar.bz2) contains augmented data for all six events.
aug-rnr-data_full.tar.bz2 contains source tweets and replies without temporal filtering. Please refer to [1] for details about temporal filtering. The statistics are as follows:
* 2013 Boston marathon bombings: 392 rumours and 784 non-rumours
* 2014 Ottawa shooting: 1,047 rumours and 2,072 non-rumours
* 2014 Sydney siege: 1,764 rumours and 3,530 non-rumours
* 2015 Charlie Hebdo Attack: 1,225 rumours and 2,450 non-rumours
* 2014 Ferguson unrest: 737 rumours and 1,476 non-rumours
* 2015 Germanwings plane crash: 502 rumours and 604 non-rumours
aug-rnr-data_filtered.tar.bz2 contains source tweets, replies, and retweets after temporal filtering and deduplication. Please refer to [1] for details. The statistics are as follows:
* 2013 Boston marathon bombings: 323 rumours and 645 non-rumours
* 2014 Ottawa shooting: 713 rumours and 1,420 non-rumours
* 2014 Sydney siege: 1,134 rumours and 2,262 non-rumours
* 2015 Charlie Hebdo Attack: 812 rumours and 1,673 non-rumours
* 2014 Ferguson unrest: 471 rumours and 949 non-rumours
* 2015 Germanwings plane crash: 375 rumours and 402 non-rumours
The data structure follows the format of the PHEME data [2]. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘aug_complete.csv’ and ‘reference.csv'.
'aug_complete.csv' file contains the metadata (tweet ID, tweet text, timestamp, and rumour label) of augmented tweets before deduplication and filtering tweets without context (i.e., replies).
'reference.csv' file contains manually annotated reference tweets [2, 3].
If you use our augmented data (PHEME-Aug v2.0), please also cite:
[1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019
==============================================================================================
[2] Kochkina, E., Liakata, M., & Zubiaga, A. (2018). All-in-one: Multi-task Learning for Rumour Verification. COLING.
[3] Olteanu, A., Vieweg, S., & Castillo, C. (2015, February). What to expect when the unexpected happens: Social media communications across crises. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 994-1009). ACM
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news.
The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘annotation.json’ which contains information about veracity of the rumour and ‘structure.json’, which contains information about structure of the conversation.
This dataset is an extension of the PHEME dataset of rumours and non-rumours (https://figshare.com/articles/PHEME_dataset_of_rumours_and_non-rumours/4010619), it contains rumours related to 9 events and each of the rumours is annotated with its veracity value, either True, False or Unverified.
This dataset was used in the paper 'All-in-one: Multi-task Learning for Rumour Verification'. For more details, please refer to the paper.
Code using this dataset can be found on github (https://github.com/kochkinaelena/Multitask4Veracity).
License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset for Multi-modal Rumour Detection which was crawled from the social platform Weibo
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)
This study determines the major difference between rumors and non-rumors and explores rumor classification performance levels over varying time windows—from the first three days to nearly two months. A comprehensive set of user, structural, linguistic, and temporal features was examined and their relative strength was compared from near-complete date of Twitter. Our contribution is at providing deep insight into the cumulative spreading patterns of rumors over time as well as at tracking the precise changes in predictive powers across rumor features. Statistical analysis finds that structural and temporal features distinguish rumors from non-rumors over a long-term window, yet they are not available during the initial propagation phase. In contrast, user and linguistic features are readily available and act as a good indicator during the initial propagation phase. Based on these findings, we suggest a new rumor classification algorithm that achieves competitive accuracy over both short and long time windows. These findings provide new insights for explaining rumor mechanism theories and for identifying features of early rumor detection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewThis folder contains datasets and experimental results used in a research project on rumor generation, detection, and debunking. The core data was generated by two large language models—DeepSeek-R1 and qwq-32b—with additional detection results from DeepSeek-V3. The folder includes both direct model outputs and results derived from further analyses based on these outputs. The data is organized into several subfolders, each focusing on specific aspects of the research. Details of the analysis procedures are described in the accompanying manuscript.Folder Structure1. deepseek-r1-debunkingThis folder contains the results generated by the DeepSeek-R1 model for debunking rumors. The files include:R_readability_results.json: Contains readability analysis results for the generated debunking texts.sentiment_analysis_R.json: Contains sentiment analysis results for the generated debunking texts.R_debunking_texts.json: Contains the debunking texts generated by the model.R_debunking_texts_with_similarity.json: Contains the debunking texts along with their similarity scores to the offical debunking texts.2. deepseek-r1-detectionThis folder contains the results of DeepSeek-R1's detection of rumors in the FakeNewsNet and Twitter1516 datasets. The files include:DR1_detection_twitter1516.json: Detection results for the Twitter1516 dataset.DR1_detection_fakenews.json: Detection results for the FakeNewsNet dataset.3. deepseek-r1-generationThis folder includes the generated rumors based on specific themes using the DeepSeek-R1 model. The themes and corresponding files include:entertainment.json: Rumors generated on entertainment-related topics.financial.json: Rumors generated on financial-related topics.health.json: Rumors generated on health-related topics.disaster-related.json: Rumors generated on disaster-related topics.4. deepseek-v3-detectionThis folder contains the rumor detection results for the FakeNewsNet and Twitter1516 datasets, generated by the updated DeepSeek-V3 model. The files include:v3_results_fakenews.json: Detection results for the FakeNewsNet dataset.v3_results_twitter1516.json: Detection results for the Twitter1516 dataset.5. qwq-32b-debunkingThis folder contains the results of the qwq-32b model for debunking rumors. The files include:Q_debunking_texts_with_similarity.json: Contains the debunking texts with similarity scores to the original content.Q_sentiment_analysis.json: Contains sentiment analysis results for the generated debunking texts.Q_debunking_readability_results.json: Contains readability analysis results for the generated debunking texts.Q_debunking_texts.json: Contains the debunking texts generated by the model.6. qwq-32b-detectionThis folder includes the detection results for FakeNewsNet and Twitter1516 datasets, generated by the qwq-32b model. The files include:Q_rumor_detection_results_fakenews.json: Detection results for the FakeNewsNet dataset.Q_rumor_detection_results_twitter1516.json: Detection results for the Twitter1516 dataset.7. qwq-32b-generationThis folder contains the generated rumors based on specific themes using the qwq-32b model. The themes and corresponding files include:entertainment.json: Rumors generated on entertainment-related topics.financial.json: Rumors generated on financial-related topics.health.json: Rumors generated on health-related topics.disaster.json: Rumors generated on disaster-related topics.Data DescriptionThe following datasets were used in this research:FakeNewsNet: A widely used dataset consisting of fake news stories, which is employed for training and evaluating rumor detection models. This dataset includes news articles labeled as "fake" or "real," and is used in the detection phase of this study.Twitter1516: A dataset containing rumors and non-rumors from Twitter. It is used to evaluate both rumor detection and generation models. The dataset contains tweets labeled as either rumors or non-rumors, providing a benchmark for evaluating the performance of detection models.Both datasets are publicly available and were used to train, test, and evaluate the models in this study. Please refer to the original dataset publications for detailed information on their structure and labeling.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Twitter 15、Twitter 16 和 PHEME。Twitter15 和 Twitter16 数据集中的每个陈述都被标记为非谣言 (NR)、虚假谣言 (F)、真实谣言 (T) 或未经证实的谣言 (U)。PHEME 数据集被标记为虚假谣言 (F) 、真谣言 (T) 和未经证实的谣言 (U)。数据集包含丰富的信息,包括作者发布的源文章、其他用户评论或转发的文章及其发布时间。从事件总数来看,Twitter15 包含 1490 个事件,Twitter16 包含 818 个事件,PHEME 已达到 2402 个事件。从谣言分类来看,Twitter 15 上有 370 个虚假谣言事件,Twitter 16 上有 205 个,PHEME 上有 638 个;就真实谣言事件的数量而言,Twitter15 和 Twitter16 都有 205 个,PHEME 达到了 1067 个;就未经证实的谣言事件数量而言,Twitter15 和 Twitter16 分别为 374 和 203,而 PHEME 有 697 个,PHEME 中非谣言事件的数量尚未明确给出。此外,所有活动的转发评论总数在 Twitter 15 上达到 331612,在 Twitter 16 上达到 204820 条,在 PHEME 上达到 105354。
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Yearly citation counts for the publication titled "Rumor Detection of Sina Weibo Based on SDSMOTE and Feature Selection".
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The evaluation index, comparison test literature, network parameters, equipment configuration and basic network are described in detail.
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Yearly citation counts for the publication titled "Chinese microblog rumor detection based on deep sequence context".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data set contains information about the Amazon rainforest wildfires that took place in 2019. Twitter data has been collected between August 21, 2019 and September 27, 2019 based on the following hashtags: #PrayforAmazonas, #AmazonRainforest, and #AmazonFire.
The goal of this data set is to detect whether a tweet is identified as a rumor or not (given by the 'label' column). A tweet that is identified as a rumor is labeled as 1, and 0 otherwise. The tweets were labeled by two independent annotators using the following guidelines. Whether a tweet is a rumor or not depends on 3 important aspects: (1) A rumor is a piece of information that is unverified or not confirmed by official instances. In other words, it does not matter whether the information turns out to be true or false in the future. (2) More specifically, a tweet is a rumor if the information is unverified at the time of posting. (3) For a tweet to be a rumor, it should contain an assertion, meaning the author of tweet commits to the truth of the message.
In sum, the annotators indicated that a tweet is a rumor if it consisted of an assertion giving information that is unverifiable at the time of posting. Practically, to check whether the information in a tweet was verified or confirmed by official instances at the moment of tweeting, the annotators used BBC News and Reuters. After all the tweets were labeled, the annotators re-iterated over the tweets they disagreed on to produce the final tweet label.
Besides the label indicating whether a tweet is a rumor or not (i.e., ‘label’), the data set contains the tweet itself (i.e., ‘full_text’), and additional metadata (e.g., ‘created_at’, ‘favorite_count’). In total, the data set contains 1,392 observations of which 184 (13%) are identified as rumors.
This data set can be used by researchers to make rumor detection models (i.e., statistical, machine learning and deep learning models) using both unstructured (i.e., textual) and structured data.
This describes two fake news datasets covering seven different news domains. One of the datasets is collected by combining manual and crowdsourced annotation approaches (FakeNewsAMT), while the second is collected directly from the web (Celebrity).
The FakeNewsDatabase dataset contains news in six different domains: technology, education, business, sports, politics, and entertainment. The legitimate news included in the dataset were collected from a variety of mainstream news websites predominantly in the US such as the ABCNews, CNN, USAToday, NewYorkTimes, FoxNews, Bloomberg, and CNET among others. The fake news included in this dataset consist of fake versions of the legitimate news in the dataset, written using Mechanical Turk. More details on the data collection are provided in section 3 of the paper.
The Celebrity dataset contain news about celebrities (actors, singers, socialites, and politicians). The legitimate news in the dataset were obtained from entertainment, fashion and style news sections in mainstream news websites and from entertainment magazines websites. The fake news were obtained from gossip websites such as Entertainment Weekly, People Magazine, RadarOnline, and other tabloid and entertainment-oriented publications. The news articles were collected in pairs, with one article being legitimate and the other fake (rumors and false reports). The articles were manually verified using gossip-checking sites such as "GossipCop.com", and also cross-referenced with information from other entertainment news sources on the web.
The data directory contains two fake news datasets:
Celebrity The fake and legitimate news are provided in two separate folders. The fake and legitimate labels are also provided as part of the filename.
FakeNewsAMT The fake and legitimate news are provided in two separate folders. Each folder contains 40 news from six different domains: technology, education, business, sports, politics, and entertainment. The file names indicate the news domain: business (biz), education (edu), entertainment (entmt), politics (polit), sports (sports) and technology (tech). The fake and legitimate labels are also provided as part of the filename.
@article{Perez-Rosas18Automatic, author = {Ver\’{o}nica P\'{e}rez-Rosas, Bennett Kleinberg, Alexandra Lefevre, Rada Mihalcea}, title = {Automatic Detection of Fake News}, journal = {International Conference on Computational Linguistics (COLING)}, year = {2018} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study proposes an innovative approach for multimodal fake news detection that utilizes a stick-breaking smoothed Dirichlet distribution. This approach enables the model to capture intricate, subtle interactions between modalities more effectively, thereby improving detection performance and enhancing the system's adaptability to various forms of fake news content
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overall performance of rumor detection task on Twitter15 and Twitter16.
The dataset consists of social media posts from WEIBO and TWITTER used for early rumour detection, capturing different events, posts, and user interactions involving users participating in the propagation of rumors.
Task
Fake news has become one of the main threats of our society. Although fake news is not a new phenomenon, the exponential growth of social media has offered an easy platform for their fast propagation. A great amount of fake news, and rumors are propagated in online social networks with the aim, usually, to deceive users and formulate specific opinions. Users play a critical role in the creation and propagation of fake news online by consuming and sharing articles with inaccurate information either intentionally or unintentionally. To this end, in this task, we aim at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users.
After having addressed several aspects of author profiling in social media from 2013 to 2019 (bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possbile to discriminate authors that have shared some fake news in the past from those that, to the best of our knowledge, have never done it.
As in previous years, we propose the task from a multilingual perspective:
NOTE: Although we recommend to participate in both languages (English and Spanish), it is possible to address the problem just for one language.
Data
Input
The uncompressed dataset consists in a folder per language (en, es). Each folder contains:
The format of the XML files is:
The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.
b2d5748083d6fdffec6c2d68d4d4442d:::0 2bed15d46872169dc7deaf8d2b43a56:::0 8234ac5cca1aed3f9029277b2cb851b:::1 5ccd228e21485568016b4ee82deb0d28:::0 60d068f9cafb656431e62a6542de2dc0:::1 ...
Output
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.
IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.
Evaluation
The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.
Submission
Once you finished tuning your approach on the validation set, your software will be tested on the test set. During the competition, the test set will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the test corpus and (ii) an absolute path to an empty output directory:
mySoftware -i INPUT-DIRECTORY -o OUTPUT-DIRECTORY
Within OUTPUT-DIRECTORY
, we require two subfolders: en
and es
, one folder per language, respectively. As the provided output directory is guaranteed to be empty, your software needs to create those subfolders. Within each of these subfolders, you need to create one xml file per author. The xml file looks like this:
The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.
Related Work
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Yearly citation counts for the publication titled "Construction on Framework of Rumor Detection and Warning System Based on Web Mining Technology".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
*** Newly Emerged Rumors in Twitter ***
These 12 datasets are the results of an empirical study on the spreading process of newly emerged rumors in Twitter. Newly emerged rumors are those rumors whose rise and fall happen in a short period of time, in contrast to the long standing rumors. Particularly, we have focused on those newly emerged rumors which have given rise to an anti-rumor spreading simultaneously against them. The story of each rumor is as follow :
1- Dataset_R1 : The National Football League team in Washington D.C. changed its name to Redhawks.
2- Dataset_R2 : A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.
3- Dataset_R3 : Facebook CEO Mark Zuckerberg bought a "super-yacht" for $150 million.
4- Dataset_R4 : Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."
5- Dataset_R5 : Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.
6- Dataset_R6 : Harley-Davidson's chief executive officer Matthew Levatich called President Trump "a moron."
7- Dataset_R7 : The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.
8- Dataset_R8 : Michael Jordan resigned from the board at Nike and took his Air Jordan line of apparel with him.
9- Dataset_R9 : In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.
10- Dataset_R10 : During confirmation hearings for Supreme Court nominee Brett Kavanaugh, congressional Democrats demanded that the nominee undergo DNA testing to prove he is not Adolf Hitler.
11- Dataset_R11 : Singer Michael Bublé's upcoming album will be his last, as he is retiring from making music.Singer Michael Bublé's upcoming album will be his last, as he is retiring from making music.
12- Dataset_R12 : A screenshot from MyLife.com confirms that mail bomb suspect Cesar Sayoc was registered as a Democrat.
The structure of excel files for each dataset is as follow :
- Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet :
- User ID (user who has posted the current tweet/retweet)
- The description sentence in the profile of the user who has published the tweet/retweet
- The number of published tweet/retweet by the user at the time of posting the current tweet/retweet
- Date and time of creation of the the account by which the current tweet/retweet has been posted
- Language of the tweet/retweet
- Number of followers
- Number of followings (friends)
- Date and time of posting the current tweet/retweet
- Number of like (favorite) the current tweet had been acquired before crawling it
- Number of times the current tweet had been retweeted before crawling it
- Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)
- The source (OS) of device by which the current tweet/retweet was posted
- Tweet/Retweet ID
- Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)
- Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)
- Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)
- Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)
- State of the tweet which can be one of the following forms (achieved by an agreement between the annotators) :
r : The tweet/retweet is a rumor post
a : The tweet/retweet is an anti-rumor post
q : The tweet/retweet is a question about the rumor, however neither confirm nor deny it
n : The tweet/retweet is not related to the rumor (even though it contains the queries related to the rumor, but does not refer to the rumor)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news. The five breaking news provided with the dataset are as follows:* Charlie Hebdo: 458 rumours (22.0%) and 1,621 non-rumours (78.0%).* Ferguson: 284 rumours (24.8%) and 859 non-rumours (75.2%).* Germanwings Crash: 238 rumours (50.7%) and 231 non-rumours (49.3%).* Ottawa Shooting: 470 rumours (52.8%) and 420 non-rumours (47.2%).* Sydney Siege: 522 rumours (42.8%) and 699 non-rumours (57.2%).The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet.This dataset was used in the paper 'Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media' for rumour detection. For more details, please refer to the paper.License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.