44 datasets found
  1. Data from: PHEME dataset of rumours and non-rumours

    • figshare.com
    bz2
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arkaitz Zubiaga; Geraldine Wong Sak Hoi; Maria Liakata; Rob Procter (2023). PHEME dataset of rumours and non-rumours [Dataset]. http://doi.org/10.6084/m9.figshare.4010619.v1
    Explore at:
    bz2Available download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Arkaitz Zubiaga; Geraldine Wong Sak Hoi; Maria Liakata; Rob Procter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news. The five breaking news provided with the dataset are as follows:* Charlie Hebdo: 458 rumours (22.0%) and 1,621 non-rumours (78.0%).* Ferguson: 284 rumours (24.8%) and 859 non-rumours (75.2%).* Germanwings Crash: 238 rumours (50.7%) and 231 non-rumours (49.3%).* Ottawa Shooting: 470 rumours (52.8%) and 420 non-rumours (47.2%).* Sydney Siege: 522 rumours (42.8%) and 699 non-rumours (57.2%).The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet.This dataset was used in the paper 'Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media' for rumour detection. For more details, please refer to the paper.License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.

  2. E

    Augmented dataset of rumours and non-rumours for rumour detection

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    json
    Updated Oct 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Augmented dataset of rumours and non-rumours for rumour detection [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7551
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 22, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains a collection of Twitter rumours and non-rumours during six real-world events: 1) 2013 Boston marathon bombings, 2) 2014 Ottawa shooting, 3) 2014 Sydney siege, 4) 2015 Charlie Hebdo Attack, 5) 2014 Ferguson unrest, and 6) 2015 Germanwings plane crash

    The data set is an augmented data set of the PHEME dataset of rumours and non-rumours based on two data sets: the PHEME data [2] (downloaded via https://figshare.com/articles/PHEME_dataset_for_Rumour_Detection_and_Veracity_Classification/6392078), and the CrisisLexT26 data [3] (downloaded via https://github.com/sajao/CrisisLex/tree/master/data/CrisisLexT26/2013_Boston_bombings).

    PHEME-Aug v2.0 (aug-rnr-data_filtered.tar.bz2 and aur-rnr-data_full.tar.bz2) contains augmented data for all six events.

    aug-rnr-data_full.tar.bz2 contains source tweets and replies without temporal filtering. Please refer to [1] for details about temporal filtering. The statistics are as follows:

    * 2013 Boston marathon bombings: 392 rumours and 784 non-rumours

    * 2014 Ottawa shooting: 1,047 rumours and 2,072 non-rumours

    * 2014 Sydney siege: 1,764 rumours and 3,530 non-rumours

    * 2015 Charlie Hebdo Attack: 1,225 rumours and 2,450 non-rumours

    * 2014 Ferguson unrest: 737 rumours and 1,476 non-rumours

    * 2015 Germanwings plane crash: 502 rumours and 604 non-rumours

    aug-rnr-data_filtered.tar.bz2 contains source tweets, replies, and retweets after temporal filtering and deduplication. Please refer to [1] for details. The statistics are as follows:

    * 2013 Boston marathon bombings: 323 rumours and 645 non-rumours

    * 2014 Ottawa shooting: 713 rumours and 1,420 non-rumours

    * 2014 Sydney siege: 1,134 rumours and 2,262 non-rumours

    * 2015 Charlie Hebdo Attack: 812 rumours and 1,673 non-rumours

    * 2014 Ferguson unrest: 471 rumours and 949 non-rumours

    * 2015 Germanwings plane crash: 375 rumours and 402 non-rumours

    The data structure follows the format of the PHEME data [2]. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘aug_complete.csv’ and ‘reference.csv'.

    'aug_complete.csv' file contains the metadata (tweet ID, tweet text, timestamp, and rumour label) of augmented tweets before deduplication and filtering tweets without context (i.e., replies).

    'reference.csv' file contains manually annotated reference tweets [2, 3].

    If you use our augmented data (PHEME-Aug v2.0), please also cite:

    [1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019

    ==============================================================================================

    [2] Kochkina, E., Liakata, M., & Zubiaga, A. (2018). All-in-one: Multi-task Learning for Rumour Verification. COLING.

    [3] Olteanu, A., Vieweg, S., & Castillo, C. (2015, February). What to expect when the unexpected happens: Social media communications across crises. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 994-1009). ACM

  3. PHEME dataset for Rumour Detection and Veracity Classification

    • figshare.com
    application/gzip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Kochkina; Maria Liakata; Arkaitz Zubiaga (2023). PHEME dataset for Rumour Detection and Veracity Classification [Dataset]. http://doi.org/10.6084/m9.figshare.6392078.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Elena Kochkina; Maria Liakata; Arkaitz Zubiaga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news.

    The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘annotation.json’ which contains information about veracity of the rumour and ‘structure.json’, which contains information about structure of the conversation.

    This dataset is an extension of the PHEME dataset of rumours and non-rumours (https://figshare.com/articles/PHEME_dataset_of_rumours_and_non-rumours/4010619), it contains rumours related to 9 events and each of the rumours is annotated with its veracity value, either True, False or Unverified.

    This dataset was used in the paper 'All-in-one: Multi-task Learning for Rumour Verification'. For more details, please refer to the paper.

    Code using this dataset can be found on github (https://github.com/kochkinaelena/Multitask4Veracity).

    License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.

  4. S

    RumourDetectionDataset

    • scidb.cn
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jiang chao (2024). RumourDetectionDataset [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00310
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2024
    Dataset provided by
    Science Data Bank
    Authors
    jiang chao
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset for Multi-modal Rumour Detection which was crawled from the social platform Weibo

  5. C

    Repository of fake news detection datasets

    • data.4tu.nl
    zip
    Updated Mar 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni (2021). Repository of fake news detection datasets [Dataset]. http://doi.org/10.4121/14151755.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 18, 2021
    Dataset provided by
    4TU.ResearchData
    Authors
    Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    2000 - 2019
    Description

    The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)

  6. d

    Data from: Rumor detection over varying time windows.

    • search.dataone.org
    • plos.figshare.com
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kwon, Sejeong; Cha, Meeyoung; Jung, Kyomin (2023). Rumor detection over varying time windows. [Dataset]. http://doi.org/10.7910/DVN/BFGAVZ
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Kwon, Sejeong; Cha, Meeyoung; Jung, Kyomin
    Description

    This study determines the major difference between rumors and non-rumors and explores rumor classification performance levels over varying time windows—from the first three days to nearly two months. A comprehensive set of user, structural, linguistic, and temporal features was examined and their relative strength was compared from near-complete date of Twitter. Our contribution is at providing deep insight into the cumulative spreading patterns of rumors over time as well as at tracking the precise changes in predictive powers across rumor features. Statistical analysis finds that structural and temporal features distinguish rumors from non-rumors over a long-term window, yet they are not available during the initial propagation phase. In contrast, user and linguistic features are readily available and act as a good indicator during the initial propagation phase. Based on these findings, we suggest a new rumor classification algorithm that achieves competitive accuracy over both short and long time windows. These findings provide new insights for explaining rumor mechanism theories and for identifying features of early rumor detection.

  7. Data from: main data

    • figshare.com
    Updated Aug 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Momo (2025). main data [Dataset]. http://doi.org/10.6084/m9.figshare.28881542.v1
    Explore at:
    Dataset updated
    Aug 18, 2025
    Dataset provided by
    figshare
    Authors
    Momo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OverviewThis folder contains datasets and experimental results used in a research project on rumor generation, detection, and debunking. The core data was generated by two large language models—DeepSeek-R1 and qwq-32b—with additional detection results from DeepSeek-V3. The folder includes both direct model outputs and results derived from further analyses based on these outputs. The data is organized into several subfolders, each focusing on specific aspects of the research. Details of the analysis procedures are described in the accompanying manuscript.Folder Structure1. deepseek-r1-debunkingThis folder contains the results generated by the DeepSeek-R1 model for debunking rumors. The files include:R_readability_results.json: Contains readability analysis results for the generated debunking texts.sentiment_analysis_R.json: Contains sentiment analysis results for the generated debunking texts.R_debunking_texts.json: Contains the debunking texts generated by the model.R_debunking_texts_with_similarity.json: Contains the debunking texts along with their similarity scores to the offical debunking texts.2. deepseek-r1-detectionThis folder contains the results of DeepSeek-R1's detection of rumors in the FakeNewsNet and Twitter1516 datasets. The files include:DR1_detection_twitter1516.json: Detection results for the Twitter1516 dataset.DR1_detection_fakenews.json: Detection results for the FakeNewsNet dataset.3. deepseek-r1-generationThis folder includes the generated rumors based on specific themes using the DeepSeek-R1 model. The themes and corresponding files include:entertainment.json: Rumors generated on entertainment-related topics.financial.json: Rumors generated on financial-related topics.health.json: Rumors generated on health-related topics.disaster-related.json: Rumors generated on disaster-related topics.4. deepseek-v3-detectionThis folder contains the rumor detection results for the FakeNewsNet and Twitter1516 datasets, generated by the updated DeepSeek-V3 model. The files include:v3_results_fakenews.json: Detection results for the FakeNewsNet dataset.v3_results_twitter1516.json: Detection results for the Twitter1516 dataset.5. qwq-32b-debunkingThis folder contains the results of the qwq-32b model for debunking rumors. The files include:Q_debunking_texts_with_similarity.json: Contains the debunking texts with similarity scores to the original content.Q_sentiment_analysis.json: Contains sentiment analysis results for the generated debunking texts.Q_debunking_readability_results.json: Contains readability analysis results for the generated debunking texts.Q_debunking_texts.json: Contains the debunking texts generated by the model.6. qwq-32b-detectionThis folder includes the detection results for FakeNewsNet and Twitter1516 datasets, generated by the qwq-32b model. The files include:Q_rumor_detection_results_fakenews.json: Detection results for the FakeNewsNet dataset.Q_rumor_detection_results_twitter1516.json: Detection results for the Twitter1516 dataset.7. qwq-32b-generationThis folder contains the generated rumors based on specific themes using the qwq-32b model. The themes and corresponding files include:entertainment.json: Rumors generated on entertainment-related topics.financial.json: Rumors generated on financial-related topics.health.json: Rumors generated on health-related topics.disaster.json: Rumors generated on disaster-related topics.Data DescriptionThe following datasets were used in this research:FakeNewsNet: A widely used dataset consisting of fake news stories, which is employed for training and evaluating rumor detection models. This dataset includes news articles labeled as "fake" or "real," and is used in the detection phase of this study.Twitter1516: A dataset containing rumors and non-rumors from Twitter. It is used to evaluate both rumor detection and generation models. The dataset contains tweets labeled as either rumors or non-rumors, providing a benchmark for evaluating the performance of detection models.Both datasets are publicly available and were used to train, test, and evaluate the models in this study. Please refer to the original dataset publications for detailed information on their structure and labeling.

  8. f

    COVID-19 rumor dataset

    • figshare.com
    html
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cheng (2023). COVID-19 rumor dataset [Dataset]. http://doi.org/10.6084/m9.figshare.14456385.v2
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    figshare
    Authors
    cheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }

  9. S

    Social media rumor detection datasets, Twitter15, Twitter16, PHEME

    • scidb.cn
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yinweiming (2024). Social media rumor detection datasets, Twitter15, Twitter16, PHEME [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00417
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    Science Data Bank
    Authors
    yinweiming
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Twitter 15、Twitter 16 和 PHEME。Twitter15 和 Twitter16 数据集中的每个陈述都被标记为非谣言 (NR)、虚假谣言 (F)、真实谣言 (T) 或未经证实的谣言 (U)。PHEME 数据集被标记为虚假谣言 (F) 、真谣言 (T) 和未经证实的谣言 (U)。数据集包含丰富的信息,包括作者发布的源文章、其他用户评论或转发的文章及其发布时间。从事件总数来看,Twitter15 包含 1490 个事件,Twitter16 包含 818 个事件,PHEME 已达到 2402 个事件。从谣言分类来看,Twitter 15 上有 370 个虚假谣言事件,Twitter 16 上有 205 个,PHEME 上有 638 个;就真实谣言事件的数量而言,Twitter15 和 Twitter16 都有 205 个,PHEME 达到了 1067 个;就未经证实的谣言事件数量而言,Twitter15 和 Twitter16 分别为 374 和 203,而 PHEME 有 697 个,PHEME 中非谣言事件的数量尚未明确给出。此外,所有活动的转发评论总数在 Twitter 15 上达到 331612,在 Twitter 16 上达到 204820 条,在 PHEME 上达到 105354。

  10. s

    Citation Trends for "Rumor Detection of Sina Weibo Based on SDSMOTE and...

    • shibatadb.com
    Updated Apr 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yubetsu (2019). Citation Trends for "Rumor Detection of Sina Weibo Based on SDSMOTE and Feature Selection" [Dataset]. https://www.shibatadb.com/article/D3nNTeDG
    Explore at:
    Dataset updated
    Apr 15, 2019
    Dataset authored and provided by
    Yubetsu
    License

    https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt

    Time period covered
    2020 - 2022
    Variables measured
    New Citations per Year
    Description

    Yearly citation counts for the publication titled "Rumor Detection of Sina Weibo Based on SDSMOTE and Feature Selection".

  11. S

    Experimental parameters

    • scidb.cn
    Updated Jun 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wangxiaopei (2022). Experimental parameters [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00018
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2022
    Dataset provided by
    Science Data Bank
    Authors
    wangxiaopei
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The evaluation index, comparison test literature, network parameters, equipment configuration and basic network are described in detail.

  12. s

    Citation Trends for "Chinese microblog rumor detection based on deep...

    • shibatadb.com
    Updated Apr 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yubetsu (2018). Citation Trends for "Chinese microblog rumor detection based on deep sequence context" [Dataset]. https://www.shibatadb.com/article/3nsf7fwd
    Explore at:
    Dataset updated
    Apr 27, 2018
    Dataset authored and provided by
    Yubetsu
    License

    https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt

    Time period covered
    2019 - 2024
    Variables measured
    New Citations per Year
    Description

    Yearly citation counts for the publication titled "Chinese microblog rumor detection based on deep sequence context".

  13. m

    Data from: Amazon Rainforest Wildfires Rumor Detection

    • data.mendeley.com
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bram Janssens (2022). Amazon Rainforest Wildfires Rumor Detection [Dataset]. http://doi.org/10.17632/m7k4gsffry.1
    Explore at:
    Dataset updated
    Dec 6, 2022
    Authors
    Bram Janssens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Amazon Rainforest
    Description

    The data set contains information about the Amazon rainforest wildfires that took place in 2019. Twitter data has been collected between August 21, 2019 and September 27, 2019 based on the following hashtags: #PrayforAmazonas, #AmazonRainforest, and #AmazonFire.

    The goal of this data set is to detect whether a tweet is identified as a rumor or not (given by the 'label' column). A tweet that is identified as a rumor is labeled as 1, and 0 otherwise. The tweets were labeled by two independent annotators using the following guidelines. Whether a tweet is a rumor or not depends on 3 important aspects: (1) A rumor is a piece of information that is unverified or not confirmed by official instances. In other words, it does not matter whether the information turns out to be true or false in the future. (2) More specifically, a tweet is a rumor if the information is unverified at the time of posting. (3) For a tweet to be a rumor, it should contain an assertion, meaning the author of tweet commits to the truth of the message.

    In sum, the annotators indicated that a tweet is a rumor if it consisted of an assertion giving information that is unverifiable at the time of posting. Practically, to check whether the information in a tweet was verified or confirmed by official instances at the moment of tweeting, the annotators used BBC News and Reuters. After all the tweets were labeled, the annotators re-iterated over the tweets they disagreed on to produce the final tweet label.

    Besides the label indicating whether a tweet is a rumor or not (i.e., ‘label’), the data set contains the tweet itself (i.e., ‘full_text’), and additional metadata (e.g., ‘created_at’, ‘favorite_count’). In total, the data set contains 1,392 observations of which 184 (13%) are identified as rumors.

    This data set can be used by researchers to make rumor detection models (i.e., statistical, machine learning and deep learning models) using both unstructured (i.e., textual) and structured data.

  14. Fake-News-Dataset

    • kaggle.com
    Updated Apr 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sumanthvrao (2019). Fake-News-Dataset [Dataset]. https://www.kaggle.com/sumanthvrao/fakenewsdataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sumanthvrao
    Description

    Introduction

    This describes two fake news datasets covering seven different news domains. One of the datasets is collected by combining manual and crowdsourced annotation approaches (FakeNewsAMT), while the second is collected directly from the web (Celebrity).

    Data collection

    The FakeNewsDatabase dataset contains news in six different domains: technology, education, business, sports, politics, and entertainment. The legitimate news included in the dataset were collected from a variety of mainstream news websites predominantly in the US such as the ABCNews, CNN, USAToday, NewYorkTimes, FoxNews, Bloomberg, and CNET among others. The fake news included in this dataset consist of fake versions of the legitimate news in the dataset, written using Mechanical Turk. More details on the data collection are provided in section 3 of the paper.

    The Celebrity dataset contain news about celebrities (actors, singers, socialites, and politicians). The legitimate news in the dataset were obtained from entertainment, fashion and style news sections in mainstream news websites and from entertainment magazines websites. The fake news were obtained from gossip websites such as Entertainment Weekly, People Magazine, RadarOnline, and other tabloid and entertainment-oriented publications. The news articles were collected in pairs, with one article being legitimate and the other fake (rumors and false reports). The articles were manually verified using gossip-checking sites such as "GossipCop.com", and also cross-referenced with information from other entertainment news sources on the web.

    The data directory contains two fake news datasets:

    • Celebrity The fake and legitimate news are provided in two separate folders. The fake and legitimate labels are also provided as part of the filename.

    • FakeNewsAMT The fake and legitimate news are provided in two separate folders. Each folder contains 40 news from six different domains: technology, education, business, sports, politics, and entertainment. The file names indicate the news domain: business (biz), education (edu), entertainment (entmt), politics (polit), sports (sports) and technology (tech). The fake and legitimate labels are also provided as part of the filename.

    Dataset citation :

    @article{Perez-Rosas18Automatic, author = {Ver\’{o}nica P\'{e}rez-Rosas, Bennett Kleinberg, Alexandra Lefevre, Rada Mihalcea}, title = {Automatic Detection of Fake News}, journal = {International Conference on Computational Linguistics (COLING)}, year = {2018} }

  15. Datasets: fake news multimodal datasets (Twitter and Weibo). Credit: Data 1:...

    • figshare.com
    zip
    Updated Mar 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akinlolu Ojo (2025). Datasets: fake news multimodal datasets (Twitter and Weibo). Credit: Data 1: (Twitter dataset): The data that support the findings of this study are derived from “Detection and visualization of misleading content on Twitter” at https://github.com/MKLab-ITI/image-verification-corpus, DOI: "10.1007/s13735-017-0143-x." Data 2: (Weibo dataset): The data that support the findings of this study are derived from “EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection” at https://github.com/yaqingwang/EANN-KDD18?tab=readme-ov-file, DOI: “10.1145/3219819.3219903.” [Dataset]. http://doi.org/10.6084/m9.figshare.28516655.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 1, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Akinlolu Ojo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study proposes an innovative approach for multimodal fake news detection that utilizes a stick-breaking smoothed Dirichlet distribution. This approach enables the model to capture intricate, subtle interactions between modalities more effectively, thereby improving detection performance and enhancing the system's adaptability to various forms of fake news content

  16. f

    Overall performance of rumor detection task on Twitter15 and Twitter16.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiho Choi; Taewook Ko; Younhyuk Choi; Hyungho Byun; Chong-kwon Kim (2023). Overall performance of rumor detection task on Twitter15 and Twitter16. [Dataset]. http://doi.org/10.1371/journal.pone.0256039.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jiho Choi; Taewook Ko; Younhyuk Choi; Hyungho Byun; Chong-kwon Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overall performance of rumor detection task on Twitter15 and Twitter16.

  17. r

    WEIBO and TWITTER datasets for rumour detection

    • resodate.org
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Gao; Jing Li; Arkaitz Zubiaga; Binyang Li (2024). WEIBO and TWITTER datasets for rumour detection [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvd2VpYm8tYW5kLXR3aXR0ZXItZGF0YXNldHMtZm9yLXJ1bW91ci1kZXRlY3Rpb24=
    Explore at:
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Wei Gao; Jing Li; Arkaitz Zubiaga; Binyang Li
    Description

    The dataset consists of social media posts from WEIBO and TWITTER used for early rumour detection, capturing different events, posts, and user interactions involving users participating in the propagation of rumors.

  18. Profiling Fake News Spreaders on Twitter

    • zenodo.org
    Updated Sep 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU; FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU (2020). Profiling Fake News Spreaders on Twitter [Dataset]. http://doi.org/10.5281/zenodo.3692319
    Explore at:
    Dataset updated
    Sep 20, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU; FRANCISCO RANGEL; PAOLO ROSSO; BILAL GHANEM; ANASTASIA GIACHANOU
    Description

    Task

    Fake news has become one of the main threats of our society. Although fake news is not a new phenomenon, the exponential growth of social media has offered an easy platform for their fast propagation. A great amount of fake news, and rumors are propagated in online social networks with the aim, usually, to deceive users and formulate specific opinions. Users play a critical role in the creation and propagation of fake news online by consuming and sharing articles with inaccurate information either intentionally or unintentionally. To this end, in this task, we aim at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users.

    After having addressed several aspects of author profiling in social media from 2013 to 2019 (bot detection, age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating if it is possbile to discriminate authors that have shared some fake news in the past from those that, to the best of our knowledge, have never done it.

    As in previous years, we propose the task from a multilingual perspective:

    • English
    • Spanish

    NOTE: Although we recommend to participate in both languages (English and Spanish), it is possible to address the problem just for one language.

    Data

    Input

    The uncompressed dataset consists in a folder per language (en, es). Each folder contains:

    • A XML file per author (Twitter user) with 100 tweets. The name of the XML file correspond to the unique author id.
    • A truth.txt file with the list of authors and the ground truth.

    The format of the XML files is:

      
       

    The format of the truth.txt file is as follows. The first column corresponds to the author id. The second column contains the truth label.

      b2d5748083d6fdffec6c2d68d4d4442d:::0
      2bed15d46872169dc7deaf8d2b43a56:::0
      8234ac5cca1aed3f9029277b2cb851b:::1
      5ccd228e21485568016b4ee82deb0d28:::0
      60d068f9cafb656431e62a6542de2dc0:::1
      ...
      

    Output

    Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

      

    The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

    IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

    Evaluation

    The performance of your system will be ranked by accuracy. For each language, we will calculate individual accuracies in discriminating between the two classes. Finally, we will average the accuracy values per language to obtain the final ranking.

    Submission

    Once you finished tuning your approach on the validation set, your software will be tested on the test set. During the competition, the test set will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

    We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the test corpus and (ii) an absolute path to an empty output directory:

    mySoftware -i INPUT-DIRECTORY -o OUTPUT-DIRECTORY

    Within OUTPUT-DIRECTORY, we require two subfolders: en and es, one folder per language, respectively. As the provided output directory is guaranteed to be empty, your software needs to create those subfolders. Within each of these subfolders, you need to create one xml file per author. The xml file looks like this:

      

    The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

    Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

    Related Work

  19. s

    Citation Trends for "Construction on Framework of Rumor Detection and...

    • shibatadb.com
    Updated Jun 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yubetsu (2018). Citation Trends for "Construction on Framework of Rumor Detection and Warning System Based on Web Mining Technology" [Dataset]. https://www.shibatadb.com/article/PCmoaJ7G
    Explore at:
    Dataset updated
    Jun 15, 2018
    Dataset authored and provided by
    Yubetsu
    License

    https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt

    Time period covered
    2019 - 2022
    Variables measured
    New Citations per Year
    Description

    Yearly citation counts for the publication titled "Construction on Framework of Rumor Detection and Warning System Based on Web Mining Technology".

  20. Newly Emerged Rumors in Twitter

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amirhosein Bodaghi; Amirhosein Bodaghi (2020). Newly Emerged Rumors in Twitter [Dataset]. http://doi.org/10.5281/zenodo.2563864
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amirhosein Bodaghi; Amirhosein Bodaghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    *** Newly Emerged Rumors in Twitter ***

    These 12 datasets are the results of an empirical study on the spreading process of newly emerged rumors in Twitter. Newly emerged rumors are those rumors whose rise and fall happen in a short period of time, in contrast to the long standing rumors. Particularly, we have focused on those newly emerged rumors which have given rise to an anti-rumor spreading simultaneously against them. The story of each rumor is as follow :

    1- Dataset_R1 : The National Football League team in Washington D.C. changed its name to Redhawks.

    2- Dataset_R2 : A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.

    3- Dataset_R3 : Facebook CEO Mark Zuckerberg bought a "super-yacht" for $150 million.

    4- Dataset_R4 : Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."

    5- Dataset_R5 : Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.

    6- Dataset_R6 : Harley-Davidson's chief executive officer Matthew Levatich called President Trump "a moron."

    7- Dataset_R7 : The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.

    8- Dataset_R8 : Michael Jordan resigned from the board at Nike and took his Air Jordan line of apparel with him.

    9- Dataset_R9 : In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.

    10- Dataset_R10 : During confirmation hearings for Supreme Court nominee Brett Kavanaugh, congressional Democrats demanded that the nominee undergo DNA testing to prove he is not Adolf Hitler.

    11- Dataset_R11 : Singer Michael Bublé's upcoming album will be his last, as he is retiring from making music.Singer Michael Bublé's upcoming album will be his last, as he is retiring from making music.

    12- Dataset_R12 : A screenshot from MyLife.com confirms that mail bomb suspect Cesar Sayoc was registered as a Democrat.

    The structure of excel files for each dataset is as follow :

    - Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet :

    - User ID (user who has posted the current tweet/retweet)

    - The description sentence in the profile of the user who has published the tweet/retweet

    - The number of published tweet/retweet by the user at the time of posting the current tweet/retweet

    - Date and time of creation of the the account by which the current tweet/retweet has been posted

    - Language of the tweet/retweet

    - Number of followers

    - Number of followings (friends)

    - Date and time of posting the current tweet/retweet

    - Number of like (favorite) the current tweet had been acquired before crawling it

    - Number of times the current tweet had been retweeted before crawling it

    - Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)

    - The source (OS) of device by which the current tweet/retweet was posted

    - Tweet/Retweet ID

    - Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)

    - Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)

    - Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)

    - Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)

    - State of the tweet which can be one of the following forms (achieved by an agreement between the annotators) :

    r : The tweet/retweet is a rumor post

    a : The tweet/retweet is an anti-rumor post

    q : The tweet/retweet is a question about the rumor, however neither confirm nor deny it

    n : The tweet/retweet is not related to the rumor (even though it contains the queries related to the rumor, but does not refer to the rumor)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Arkaitz Zubiaga; Geraldine Wong Sak Hoi; Maria Liakata; Rob Procter (2023). PHEME dataset of rumours and non-rumours [Dataset]. http://doi.org/10.6084/m9.figshare.4010619.v1
Organization logo

Data from: PHEME dataset of rumours and non-rumours

Related Article
Explore at:
33 scholarly articles cite this dataset (View in Google Scholar)
bz2Available download formats
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Arkaitz Zubiaga; Geraldine Wong Sak Hoi; Maria Liakata; Rob Procter
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news. The five breaking news provided with the dataset are as follows:* Charlie Hebdo: 458 rumours (22.0%) and 1,621 non-rumours (78.0%).* Ferguson: 284 rumours (24.8%) and 859 non-rumours (75.2%).* Germanwings Crash: 238 rumours (50.7%) and 231 non-rumours (49.3%).* Ottawa Shooting: 470 rumours (52.8%) and 420 non-rumours (47.2%).* Sydney Siege: 522 rumours (42.8%) and 699 non-rumours (57.2%).The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet.This dataset was used in the paper 'Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media' for rumour detection. For more details, please refer to the paper.License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.

Search
Clear search
Close search
Google apps
Main menu