44 datasets found

Data from: PHEME dataset of rumours and non-rumours
figshare.com
bz2
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arkaitz Zubiaga; Geraldine Wong Sak Hoi; Maria Liakata; Rob Procter (2023). PHEME dataset of rumours and non-rumours [Dataset]. http://doi.org/10.6084/m9.figshare.4010619.v1
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4010619.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Arkaitz Zubiaga; Geraldine Wong Sak Hoi; Maria Liakata; Rob Procter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news. The five breaking news provided with the dataset are as follows:* Charlie Hebdo: 458 rumours (22.0%) and 1,621 non-rumours (78.0%).* Ferguson: 284 rumours (24.8%) and 859 non-rumours (75.2%).* Germanwings Crash: 238 rumours (50.7%) and 231 non-rumours (49.3%).* Ottawa Shooting: 470 rumours (52.8%) and 420 non-rumours (47.2%).* Sydney Siege: 522 rumours (42.8%) and 699 non-rumours (57.2%).The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet.This dataset was used in the paper 'Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media' for rumour detection. For more details, please refer to the paper.License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.
E
Augmented dataset of rumours and non-rumours for rumour detection
live.european-language-grid.eu
data.niaid.nih.gov
+1more
json
Updated Oct 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Augmented dataset of rumours and non-rumours for rumour detection [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7551
Explore at:
jsonAvailable download formats
Dataset updated
Oct 22, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains a collection of Twitter rumours and non-rumours during six real-world events: 1) 2013 Boston marathon bombings, 2) 2014 Ottawa shooting, 3) 2014 Sydney siege, 4) 2015 Charlie Hebdo Attack, 5) 2014 Ferguson unrest, and 6) 2015 Germanwings plane crash
The data set is an augmented data set of the PHEME dataset of rumours and non-rumours based on two data sets: the PHEME data [2] (downloaded via https://figshare.com/articles/PHEME_dataset_for_Rumour_Detection_and_Veracity_Classification/6392078), and the CrisisLexT26 data [3] (downloaded via https://github.com/sajao/CrisisLex/tree/master/data/CrisisLexT26/2013_Boston_bombings).

PHEME-Aug v2.0 (aug-rnr-data_filtered.tar.bz2 and aur-rnr-data_full.tar.bz2) contains augmented data for all six events.
aug-rnr-data_full.tar.bz2 contains source tweets and replies without temporal filtering. Please refer to [1] for details about temporal filtering. The statistics are as follows:
* 2013 Boston marathon bombings: 392 rumours and 784 non-rumours
* 2014 Ottawa shooting: 1,047 rumours and 2,072 non-rumours
* 2014 Sydney siege: 1,764 rumours and 3,530 non-rumours
* 2015 Charlie Hebdo Attack: 1,225 rumours and 2,450 non-rumours
* 2014 Ferguson unrest: 737 rumours and 1,476 non-rumours
* 2015 Germanwings plane crash: 502 rumours and 604 non-rumours

aug-rnr-data_filtered.tar.bz2 contains source tweets, replies, and retweets after temporal filtering and deduplication. Please refer to [1] for details. The statistics are as follows:
* 2013 Boston marathon bombings: 323 rumours and 645 non-rumours
* 2014 Ottawa shooting: 713 rumours and 1,420 non-rumours
* 2014 Sydney siege: 1,134 rumours and 2,262 non-rumours
* 2015 Charlie Hebdo Attack: 812 rumours and 1,673 non-rumours
* 2014 Ferguson unrest: 471 rumours and 949 non-rumours
* 2015 Germanwings plane crash: 375 rumours and 402 non-rumours

The data structure follows the format of the PHEME data [2]. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘aug_complete.csv’ and ‘reference.csv'.
'aug_complete.csv' file contains the metadata (tweet ID, tweet text, timestamp, and rumour label) of augmented tweets before deduplication and filtering tweets without context (i.e., replies).
'reference.csv' file contains manually annotated reference tweets [2, 3].

If you use our augmented data (PHEME-Aug v2.0), please also cite:
[1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019
==============================================================================================
[2] Kochkina, E., Liakata, M., & Zubiaga, A. (2018). All-in-one: Multi-task Learning for Rumour Verification. COLING.
[3] Olteanu, A., Vieweg, S., & Castillo, C. (2015, February). What to expect when the unexpected happens: Social media communications across crises. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 994-1009). ACM
PHEME dataset for Rumour Detection and Veracity Classification
figshare.com
application/gzip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Kochkina; Maria Liakata; Arkaitz Zubiaga (2023). PHEME dataset for Rumour Detection and Veracity Classification [Dataset]. http://doi.org/10.6084/m9.figshare.6392078.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6392078.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Elena Kochkina; Maria Liakata; Arkaitz Zubiaga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news.

The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘annotation.json’ which contains information about veracity of the rumour and ‘structure.json’, which contains information about structure of the conversation.

This dataset is an extension of the PHEME dataset of rumours and non-rumours (https://figshare.com/articles/PHEME_dataset_of_rumours_and_non-rumours/4010619), it contains rumours related to 9 events and each of the rumours is annotated with its veracity value, either True, False or Unverified.

This dataset was used in the paper 'All-in-one: Multi-task Learning for Rumour Verification'. For more details, please refer to the paper.

Code using this dataset can be found on github (https://github.com/kochkinaelena/Multitask4Veracity).

License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.
S
RumourDetectionDataset
scidb.cn
Updated Jul 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jiang chao (2024). RumourDetectionDataset [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00310
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00133.00310
Dataset updated
Jul 26, 2024
Dataset provided by
Science Data Bank
Authors
jiang chao
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset for Multi-modal Rumour Detection which was crawled from the social platform Weibo
C
Repository of fake news detection datasets
data.4tu.nl
zip
Updated Mar 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni (2021). Repository of fake news detection datasets [Dataset]. http://doi.org/10.4121/14151755.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/14151755.v1
Dataset updated
Mar 18, 2021
Dataset provided by
4TU.ResearchData
Authors
Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2000 - 2019
Description
The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)
d
Data from: Rumor detection over varying time windows.
search.dataone.org
plos.figshare.com
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kwon, Sejeong; Cha, Meeyoung; Jung, Kyomin (2023). Rumor detection over varying time windows. [Dataset]. http://doi.org/10.7910/DVN/BFGAVZ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/BFGAVZ
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Kwon, Sejeong; Cha, Meeyoung; Jung, Kyomin
Description
This study determines the major difference between rumors and non-rumors and explores rumor classification performance levels over varying time windows—from the first three days to nearly two months. A comprehensive set of user, structural, linguistic, and temporal features was examined and their relative strength was compared from near-complete date of Twitter. Our contribution is at providing deep insight into the cumulative spreading patterns of rumors over time as well as at tracking the precise changes in predictive powers across rumor features. Statistical analysis finds that structural and temporal features distinguish rumors from non-rumors over a long-term window, yet they are not available during the initial propagation phase. In contrast, user and linguistic features are readily available and act as a good indicator during the initial propagation phase. Based on these findings, we suggest a new rumor classification algorithm that achieves competitive accuracy over both short and long time windows. These findings provide new insights for explaining rumor mechanism theories and for identifying features of early rumor detection.
Data from: main data
figshare.com
Updated Aug 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Momo (2025). main data [Dataset]. http://doi.org/10.6084/m9.figshare.28881542.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.28881542.v1
Dataset updated
Aug 18, 2025
Dataset provided by
figshare
Authors
Momo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OverviewThis folder contains datasets and experimental results used in a research project on rumor generation, detection, and debunking. The core data was generated by two large language models—DeepSeek-R1 and qwq-32b—with additional detection results from DeepSeek-V3. The folder includes both direct model outputs and results derived from further analyses based on these outputs. The data is organized into several subfolders, each focusing on specific aspects of the research. Details of the analysis procedures are described in the accompanying manuscript.Folder Structure1. deepseek-r1-debunkingThis folder contains the results generated by the DeepSeek-R1 model for debunking rumors. The files include:R_readability_results.json: Contains readability analysis results for the generated debunking texts.sentiment_analysis_R.json: Contains sentiment analysis results for the generated debunking texts.R_debunking_texts.json: Contains the debunking texts generated by the model.R_debunking_texts_with_similarity.json: Contains the debunking texts along with their similarity scores to the offical debunking texts.2. deepseek-r1-detectionThis folder contains the results of DeepSeek-R1's detection of rumors in the FakeNewsNet and Twitter1516 datasets. The files include:DR1_detection_twitter1516.json: Detection results for the Twitter1516 dataset.DR1_detection_fakenews.json: Detection results for the FakeNewsNet dataset.3. deepseek-r1-generationThis folder includes the generated rumors based on specific themes using the DeepSeek-R1 model. The themes and corresponding files include:entertainment.json: Rumors generated on entertainment-related topics.financial.json: Rumors generated on financial-related topics.health.json: Rumors generated on health-related topics.disaster-related.json: Rumors generated on disaster-related topics.4. deepseek-v3-detectionThis folder contains the rumor detection results for the FakeNewsNet and Twitter1516 datasets, generated by the updated DeepSeek-V3 model. The files include:v3_results_fakenews.json: Detection results for the FakeNewsNet dataset.v3_results_twitter1516.json: Detection results for the Twitter1516 dataset.5. qwq-32b-debunkingThis folder contains the results of the qwq-32b model for debunking rumors. The files include:Q_debunking_texts_with_similarity.json: Contains the debunking texts with similarity scores to the original content.Q_sentiment_analysis.json: Contains sentiment analysis results for the generated debunking texts.Q_debunking_readability_results.json: Contains readability analysis results for the generated debunking texts.Q_debunking_texts.json: Contains the debunking texts generated by the model.6. qwq-32b-detectionThis folder includes the detection results for FakeNewsNet and Twitter1516 datasets, generated by the qwq-32b model. The files include:Q_rumor_detection_results_fakenews.json: Detection results for the FakeNewsNet dataset.Q_rumor_detection_results_twitter1516.json: Detection results for the Twitter1516 dataset.7. qwq-32b-generationThis folder contains the generated rumors based on specific themes using the qwq-32b model. The themes and corresponding files include:entertainment.json: Rumors generated on entertainment-related topics.financial.json: Rumors generated on financial-related topics.health.json: Rumors generated on health-related topics.disaster.json: Rumors generated on disaster-related topics.Data DescriptionThe following datasets were used in this research:FakeNewsNet: A widely used dataset consisting of fake news stories, which is employed for training and evaluating rumor detection models. This dataset includes news articles labeled as "fake" or "real," and is used in the detection phase of this study.Twitter1516: A dataset containing rumors and non-rumors from Twitter. It is used to evaluate both rumor detection and generation models. The dataset contains tweets labeled as either rumors or non-rumors, providing a benchmark for evaluating the performance of detection models.Both datasets are publicly available and were used to train, test, and evaluate the models in this study. Please refer to the original dataset publications for detailed information on their structure and labeling.
f
COVID-19 rumor dataset
figshare.com
html
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cheng (2023). COVID-19 rumor dataset [Dataset]. http://doi.org/10.6084/m9.figshare.14456385.v2
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14456385.v2
Dataset updated
Jun 10, 2023
Dataset provided by
figshare
Authors
cheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }
S
Social media rumor detection datasets, Twitter15, Twitter16, PHEME
scidb.cn
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yinweiming (2024). Social media rumor detection datasets, Twitter15, Twitter16, PHEME [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00417
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00133.00417
Dataset updated
Dec 12, 2024
Dataset provided by
Science Data Bank
Authors
yinweiming
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Twitter 15、Twitter 16 和 PHEME。Twitter15 和 Twitter16 数据集中的每个陈述都被标记为非谣言（NR）、虚假谣言（F）、真实谣言（T）或未经证实的谣言（U）。PHEME 数据集被标记为虚假谣言（F）、真谣言（T）和未经证实的谣言（U）。数据集包含丰富的信息，包括作者发布的源文章、其他用户评论或转发的文章及其发布时间。从事件总数来看，Twitter15 包含 1490 个事件，Twitter16 包含 818 个事件，PHEME 已达到 2402 个事件。从谣言分类来看，Twitter 15 上有 370 个虚假谣言事件，Twitter 16 上有 205 个，PHEME 上有 638 个;就真实谣言事件的数量而言，Twitter15 和 Twitter16 都有 205 个，PHEME 达到了 1067 个;就未经证实的谣言事件数量而言，Twitter15 和 Twitter16 分别为 374 和 203，而 PHEME 有 697 个，PHEME 中非谣言事件的数量尚未明确给出。此外，所有活动的转发评论总数在 Twitter 15 上达到 331612，在 Twitter 16 上达到 204820 条，在 PHEME 上达到 105354。
s
Citation Trends for "Rumor Detection of Sina Weibo Based on SDSMOTE and...
shibatadb.com
Updated Apr 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2019). Citation Trends for "Rumor Detection of Sina Weibo Based on SDSMOTE and Feature Selection" [Dataset]. https://www.shibatadb.com/article/D3nNTeDG
Explore at:
Dataset updated
Apr 15, 2019
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Time period covered
2020 - 2022
Variables measured
New Citations per Year
Description
Yearly citation counts for the publication titled "Rumor Detection of Sina Weibo Based on SDSMOTE and Feature Selection".
S
Experimental parameters
scidb.cn
Updated Jun 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wangxiaopei (2022). Experimental parameters [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00018
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00133.00018
Dataset updated
Jun 22, 2022
Dataset provided by
Science Data Bank
Authors
wangxiaopei
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The evaluation index, comparison test literature, network parameters, equipment configuration and basic network are described in detail.
s
Citation Trends for "Chinese microblog rumor detection based on deep...
shibatadb.com
Updated Apr 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2018). Citation Trends for "Chinese microblog rumor detection based on deep sequence context" [Dataset]. https://www.shibatadb.com/article/3nsf7fwd
Explore at:
Dataset updated
Apr 27, 2018
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Time period covered
2019 - 2024
Variables measured
New Citations per Year
Description
Yearly citation counts for the publication titled "Chinese microblog rumor detection based on deep sequence context".
m
Data from: Amazon Rainforest Wildfires Rumor Detection
data.mendeley.com
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bram Janssens (2022). Amazon Rainforest Wildfires Rumor Detection [Dataset]. http://doi.org/10.17632/m7k4gsffry.1
Explore at:
Unique identifier
https://doi.org/10.17632/m7k4gsffry.1
Dataset updated
Dec 6, 2022
Authors
Bram Janssens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Amazon Rainforest
Description
The data set contains information about the Amazon rainforest wildfires that took place in 2019. Twitter data has been collected between August 21, 2019 and September 27, 2019 based on the following hashtags: #PrayforAmazonas, #AmazonRainforest, and #AmazonFire.

The goal of this data set is to detect whether a tweet is identified as a rumor or not (given by the 'label' column). A tweet that is identified as a rumor is labeled as 1, and 0 otherwise. The tweets were labeled by two independent annotators using the following guidelines. Whether a tweet is a rumor or not depends on 3 important aspects: (1) A rumor is a piece of information that is unverified or not confirmed by official instances. In other words, it does not matter whether the information turns out to be true or false in the future. (2) More specifically, a tweet is a rumor if the information is unverified at the time of posting. (3) For a tweet to be a rumor, it should contain an assertion, meaning the author of tweet commits to the truth of the message.

In sum, the annotators indicated that a tweet is a rumor if it consisted of an assertion giving information that is unverifiable at the time of posting. Practically, to check whether the information in a tweet was verified or confirmed by official instances at the moment of tweeting, the annotators used BBC News and Reuters. After all the tweets were labeled, the annotators re-iterated over the tweets they disagreed on to produce the final tweet label.

Besides the label indicating whether a tweet is a rumor or not (i.e., ‘label’), the data set contains the tweet itself (i.e., ‘full_text’), and additional metadata (e.g., ‘created_at’, ‘favorite_count’). In total, the data set contains 1,392 observations of which 184 (13%) are identified as rumors.

This data set can be used by researchers to make rumor detection models (i.e., statistical, machine learning and deep learning models) using both unstructured (i.e., textual) and structured data.
Fake-News-Dataset
kaggle.com
Updated Apr 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sumanthvrao (2019). Fake-News-Dataset [Dataset]. https://www.kaggle.com/sumanthvrao/fakenewsdataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 19, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sumanthvrao
Description
Introduction

This describes two fake news datasets covering seven different news domains. One of the datasets is collected by combining manual and crowdsourced annotation approaches (FakeNewsAMT), while the second is collected directly from the web (Celebrity).

Data collection

The FakeNewsDatabase dataset contains news in six different domains: technology, education, business, sports, politics, and entertainment. The legitimate news included in the dataset were collected from a variety of mainstream news websites predominantly in the US such as the ABCNews, CNN, USAToday, NewYorkTimes, FoxNews, Bloomberg, and CNET among others. The fake news included in this dataset consist of fake versions of the legitimate news in the dataset, written using Mechanical Turk. More details on the data collection are provided in section 3 of the paper.

The Celebrity dataset contain news about celebrities (actors, singers, socialites, and politicians). The legitimate news in the dataset were obtained from entertainment, fashion and style news sections in mainstream news websites and from entertainment magazines websites. The fake news were obtained from gossip websites such as Entertainment Weekly, People Magazine, RadarOnline, and other tabloid and entertainment-oriented publications. The news articles were collected in pairs, with one article being legitimate and the other fake (rumors and false reports). The articles were manually verified using gossip-checking sites such as "GossipCop.com", and also cross-referenced with information from other entertainment news sources on the web.

The data directory contains two fake news datasets:

Celebrity The fake and legitimate news are provided in two separate folders. The fake and legitimate labels are also provided as part of the filename.

FakeNewsAMT The fake and legitimate news are provided in two separate folders. Each folder contains 40 news from six different domains: technology, education, business, sports, politics, and entertainment. The file names indicate the news domain: business (biz), education (edu), entertainment (entmt), politics (polit), sports (sports) and technology (tech). The fake and legitimate labels are also provided as part of the filename.

Dataset citation :

@article{Perez-Rosas18Automatic, author = {Ver\’{o}nica P\'{e}rez-Rosas, Bennett Kleinberg, Alexandra Lefevre, Rada Mihalcea}, title = {Automatic Detection of Fake News}, journal = {International Conference on Computational Linguistics (COLING)}, year = {2018} }
Datasets: fake news multimodal datasets (Twitter and Weibo). Credit: Data 1:...
figshare.com
zip
Updated Mar 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akinlolu Ojo (2025). Datasets: fake news multimodal datasets (Twitter and Weibo). Credit: Data 1: (Twitter dataset): The data that support the findings of this study are derived from “Detection and visualization of misleading content on Twitter” at https://github.com/MKLab-ITI/image-verification-corpus, DOI: "10.1007/s13735-017-0143-x." Data 2: (Weibo dataset): The data that support the findings of this study are derived from “EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection” at https://github.com/yaqingwang/EANN-KDD18?tab=readme-ov-file, DOI: “10.1145/3219819.3219903.” [Dataset]. http://doi.org/10.6084/m9.figshare.28516655.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28516655.v2
Dataset updated
Mar 1, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Akinlolu Ojo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study proposes an innovative approach for multimodal fake news detection that utilizes a stick-breaking smoothed Dirichlet distribution. This approach enables the model to capture intricate, subtle interactions between modalities more effectively, thereby improving detection performance and enhancing the system's adaptability to various forms of fake news content
f
Overall performance of rumor detection task on Twitter15 and Twitter16.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiho Choi; Taewook Ko; Younhyuk Choi; Hyungho Byun; Chong-kwon Kim (2023). Overall performance of rumor detection task on Twitter15 and Twitter16. [Dataset]. http://doi.org/10.1371/journal.pone.0256039.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0256039.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jiho Choi; Taewook Ko; Younhyuk Choi; Hyungho Byun; Chong-kwon Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overall performance of rumor detection task on Twitter15 and Twitter16.
r
WEIBO and TWITTER datasets for rumour detection
resodate.org
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Gao; Jing Li; Arkaitz Zubiaga; Binyang Li (2024). WEIBO and TWITTER datasets for rumour detection [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvd2VpYm8tYW5kLXR3aXR0ZXItZGF0YXNldHMtZm9yLXJ1bW91ci1kZXRlY3Rpb24=
Explore at:
Dataset updated
Nov 25, 2024
Dataset provided by
Leibniz Data Manager
Authors
Wei Gao; Jing Li; Arkaitz Zubiaga; Binyang Li
Description
The dataset consists of social media posts from WEIBO and TWITTER used for early rumour detection, capturing different events, posts, and user interactions involving users participating in the propagation of rumors.
Profiling Fake News Spreaders on Twitter
zenodo.org
Updated Sep 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU; FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU (2020). Profiling Fake News Spreaders on Twitter [Dataset]. http://doi.org/10.5281/zenodo.3692319
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3692319
Dataset updated
Sep 20, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU; FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU
Description
Task

Fake news has become one of the main threats of our society. Although fake news is not a new phenomenon, the exponential growth of social media has offered an easy platform for their fast propagation. A great amount of fake news, and rumors are propagated in online social networks with the aim, usually, to deceive users and formulate specific opinions. Users play a critical role in the creation and propagation of fake news online by consuming and sharing articles with inaccurate information either intentionally or unintentionally. To this end, in this task, we aim at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users.

After having addressed several aspects of author profiling in social media from 2013 to 2019 (bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possbile to discriminate authors that have shared some fake news in the past from those that, to the best of our knowledge, have never done it.

As in previous years, we propose the task from a multilingual perspective:

English

Spanish

NOTE: Although we recommend to participate in both languages (English and Spanish), it is possible to address the problem just for one language.

Data

Input

The uncompressed dataset consists in a folder per language (en, es). Each folder contains:

A XML file per author (Twitter user) with 100 tweets. The name of the XML file correspond to the unique author id.

A truth.txt file with the list of authors and the ground truth.

The format of the XML files is:

The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.

b2d5748083d6fdffec6c2d68d4d4442d:::0 2bed15d46872169dc7deaf8d2b43a56:::0 8234ac5cca1aed3f9029277b2cb851b:::1 5ccd228e21485568016b4ee82deb0d28:::0 60d068f9cafb656431e62a6542de2dc0:::1 ...

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Evaluation

The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.

Submission

Once you finished tuning your approach on the validation set, your software will be tested on the test set. During the competition, the test set will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the test corpus and (ii) an absolute path to an empty output directory:

mySoftware -i INPUT-DIRECTORY -o OUTPUT-DIRECTORY

Within OUTPUT-DIRECTORY, we require two subfolders: en and es, one folder per language, respectively. As the provided output directory is guaranteed to be empty, your software needs to create those subfolders. Within each of these subfolders, you need to create one xml file per author. The xml file looks like this:

The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Related Work

Bilal Ghanem, Paolo Rosso, Francisco Rangel. An Emotional Analysis of False Information in Social Media and News Articles. arXiv preprint arXiv:1908.09951 (2019). ACM Transactions on Internet Technology (TOIT). In Press.

Anastasia Giachanou, Paolo Rosso, Fabio Crestani. Leveraging Emotional Signals for Credibility Detection. Proceedings of the 42nd International ACM Conference on Research and Development in Information Retrieval (SIGIR). pp 877–880. (2019)

Andre Guess, Jonathan Nagler, and Joshua Tucker. Less than you think: Prevalence and predictors of fake news dissemination on Facebook. Science Advances vol. 5 (2019)

Andrew Hall, Loren Terveen, Aaron Halfaker. Bot Detection in Wikidata Using Behavioral and Other Informal Cues. Proceedings of the ACM on Human-Computer Interaction. 2018 Nov 1;2(CSCW):64.

Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, Gerhard Weikum. DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp 22-32. (2018)

Francisco Rangel and Paolo Rosso. Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter. In: L. Cappellato, N. Ferro, D. E. Losada and H. Müller (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings.CEUR-WS.org, vol. 2380

Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 6th author profiling task at pan 2018: multimodal gender identification in Twitter. In: CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 2125.

Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato L., Ferro N., Goeuriot L, Mandl T. (Eds.) CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1866.

Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pottast, Benno Stein. Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog K., Capellato L., Ferro N., Macdonald C. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1609, pp. 750-784

Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.

Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.

Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179

Francisco Rangel and Paolo Rosso On the Implications of the General Data Protection Regulation on the Organisation of Evaluation Tasks. In: Language and Law / Linguagem e Direito, Vol. 5(2), pp. 80-102

Kai Shu, Suhang Wang, and Huan Liu. Understanding user profiles on social media for fake news detection. Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 430--435 (2018)

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter. (2017)
s
Citation Trends for "Construction on Framework of Rumor Detection and...
shibatadb.com
Updated Jun 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2018). Citation Trends for "Construction on Framework of Rumor Detection and Warning System Based on Web Mining Technology" [Dataset]. https://www.shibatadb.com/article/PCmoaJ7G
Explore at:
Dataset updated
Jun 15, 2018
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Time period covered
2019 - 2022
Variables measured
New Citations per Year
Description
Yearly citation counts for the publication titled "Construction on Framework of Rumor Detection and Warning System Based on Web Mining Technology".
Newly Emerged Rumors in Twitter
zenodo.org
data.niaid.nih.gov
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amirhosein Bodaghi; Amirhosein Bodaghi (2020). Newly Emerged Rumors in Twitter [Dataset]. http://doi.org/10.5281/zenodo.2563864
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2563864
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amirhosein Bodaghi; Amirhosein Bodaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
*** Newly Emerged Rumors in Twitter ***

These 12 datasets are the results of an empirical study on the spreading process of newly emerged rumors in Twitter. Newly emerged rumors are those rumors whose rise and fall happen in a short period of time, in contrast to the long standing rumors. Particularly, we have focused on those newly emerged rumors which have given rise to an anti-rumor spreading simultaneously against them. The story of each rumor is as follow :

1- Dataset_R1 : The National Football League team in Washington D.C. changed its name to Redhawks.

2- Dataset_R2 : A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.

3- Dataset_R3 : Facebook CEO Mark Zuckerberg bought a "super-yacht" for $150 million.

4- Dataset_R4 : Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."

5- Dataset_R5 : Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.

6- Dataset_R6 : Harley-Davidson's chief executive officer Matthew Levatich called President Trump "a moron."

7- Dataset_R7 : The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.

8- Dataset_R8 : Michael Jordan resigned from the board at Nike and took his Air Jordan line of apparel with him.

9- Dataset_R9 : In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.

10- Dataset_R10 : During confirmation hearings for Supreme Court nominee Brett Kavanaugh, congressional Democrats demanded that the nominee undergo DNA testing to prove he is not Adolf Hitler.

11- Dataset_R11 : Singer Michael Bublé's upcoming album will be his last, as he is retiring from making music.Singer Michael Bublé's upcoming album will be his last, as he is retiring from making music.

12- Dataset_R12 : A screenshot from MyLife.com confirms that mail bomb suspect Cesar Sayoc was registered as a Democrat.

The structure of excel files for each dataset is as follow :

- Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet :

- User ID (user who has posted the current tweet/retweet)

- The description sentence in the profile of the user who has published the tweet/retweet

- The number of published tweet/retweet by the user at the time of posting the current tweet/retweet

- Date and time of creation of the the account by which the current tweet/retweet has been posted

- Language of the tweet/retweet

- Number of followers

- Number of followings (friends)

- Date and time of posting the current tweet/retweet

- Number of like (favorite) the current tweet had been acquired before crawling it

- Number of times the current tweet had been retweeted before crawling it

- Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)

- The source (OS) of device by which the current tweet/retweet was posted

- Tweet/Retweet ID

- Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)

- Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)

- Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)

- Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)

- State of the tweet which can be one of the following forms (achieved by an agreement between the annotators) :

r : The tweet/retweet is a rumor post

a : The tweet/retweet is an anti-rumor post

q : The tweet/retweet is a question about the rumor, however neither confirm nor deny it

n : The tweet/retweet is not related to the rumor (even though it contains the queries related to the rumor, but does not refer to the rumor)

Facebook

Twitter

Click to copy link

Link copied

Cite

Arkaitz Zubiaga; Geraldine Wong Sak Hoi; Maria Liakata; Rob Procter (2023). PHEME dataset of rumours and non-rumours [Dataset]. http://doi.org/10.6084/m9.figshare.4010619.v1

Data from: PHEME dataset of rumours and non-rumours

Explore at:

33 scholarly articles cite this dataset (View in Google Scholar)

bz2Available download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.4010619.v1

Dataset updated

May 30, 2023

Dataset provided by

Figsharehttp://figshare.com/

Authors

Arkaitz Zubiaga; Geraldine Wong Sak Hoi; Maria Liakata; Rob Procter

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news. The five breaking news provided with the dataset are as follows:* Charlie Hebdo: 458 rumours (22.0%) and 1,621 non-rumours (78.0%).* Ferguson: 284 rumours (24.8%) and 859 non-rumours (75.2%).* Germanwings Crash: 238 rumours (50.7%) and 231 non-rumours (49.3%).* Ottawa Shooting: 470 rumours (52.8%) and 420 non-rumours (47.2%).* Sydney Siege: 522 rumours (42.8%) and 699 non-rumours (57.2%).The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet.This dataset was used in the paper 'Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media' for rumour detection. For more details, please refer to the paper.License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.

Clear search

Close search

Google apps

Main menu

Data from: PHEME dataset of rumours and non-rumours

Augmented dataset of rumours and non-rumours for rumour detection

PHEME dataset for Rumour Detection and Veracity Classification

RumourDetectionDataset

Repository of fake news detection datasets

Data from: Rumor detection over varying time windows.

Data from: main data

COVID-19 rumor dataset

Social media rumor detection datasets, Twitter15, Twitter16, PHEME

Citation Trends for "Rumor Detection of Sina Weibo Based on SDSMOTE and...

Experimental parameters

Citation Trends for "Chinese microblog rumor detection based on deep...

Data from: Amazon Rainforest Wildfires Rumor Detection

Fake-News-Dataset

Introduction

Data collection

Dataset citation :

Datasets: fake news multimodal datasets (Twitter and Weibo). Credit: Data 1:...

Overall performance of rumor detection task on Twitter15 and Twitter16.

WEIBO and TWITTER datasets for rumour detection

Profiling Fake News Spreaders on Twitter

Citation Trends for "Construction on Framework of Rumor Detection and...

Newly Emerged Rumors in Twitter

Data from: PHEME dataset of rumours and non-rumours