77 datasets found
  1. CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection

    • zenodo.org
    Updated Oct 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.5775511
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes. Due to these restrictions, the collection is not open data. Please fill out the form and upload the Data Sharing Agreement at Google Form.

    Citation

    Please cite our work as

    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

    Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Output data format

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    Sample file

    public_id, predicted_domain
    1, health
    2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:

    IMPORTANT!

    1. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

    Submission Link: Coming soon

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
    • Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.
    • Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.
    • Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
  2. 4

    Repository of fake news detection datasets

    • data.4tu.nl
    • figshare.com
    txt
    Updated Mar 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni (2021). Repository of fake news detection datasets [Dataset]. http://doi.org/10.4121/14151755.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 18, 2021
    Dataset provided by
    4TU.ResearchData
    Authors
    Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)

  3. o

    Data from: Fake news detection based on news content and social contexts: a...

    • omicsdi.org
    xml
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raza S (2024). Fake news detection based on news content and social contexts: a transformer-based approach [Dataset]. https://www.omicsdi.org/dataset/biostudies-literature/S-EPMC8800852
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Feb 27, 2024
    Authors
    Raza S
    Variables measured
    Unknown
    Description

    Fake news is a real problem in today’s world, and it has become more extensive and harder to identify. A major challenge in fake news detection is to detect it in the early phase. Another challenge in fake news detection is the unavailability or the shortage of labelled data for training the detection models. We propose a novel fake news detection framework that can address these challenges. Our proposed framework exploits the information from the news articles and the social contexts to detect fake news. The proposed model is based on a Transformer architecture, which has two parts: the encoder part to learn useful representations from the fake news data and the decoder part that predicts the future behaviour based on past observations. We also incorporate many features from the news content and social contexts into our model to help us classify the news better. In addition, we propose an effective labelling technique to address the label shortage problem. Experimental results on real-world data show that our model can detect fake news with higher accuracy within a few minutes after it propagates (early detection) than the baselines.

  4. A

    ‘Fake News Content Detection 📰’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Fake News Content Detection 📰’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-fake-news-content-detection-735a/62dc1870/?iid=000-709&v=presentation
    Explore at:
    Dataset updated
    Sep 30, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Fake News Content Detection 📰’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/anmolkumar/fake-news-content-detection on 30 September 2021.

    --- Dataset description provided by original source is as follows ---

    Overview

    Welcome to another weekend hackathon, this weekend we are providing a great opportunity to the machinehackers to flex their NLP muscles again by building a fake content detection algorithm. Fake contents are everywhere from social media platforms, news platforms and there is a big list. Considering the advancement in NLP research institutes are putting a lot of sweat, blood, and tears to detect the fake content generated across the platforms.

    Fake news, defined by the New York Times as “a made-up story with an intention to deceive”, often for a secondary gain, is arguably one of the most serious challenges facing the news industry today. In a December Pew Research poll, 64% of US adults said that “made-up news” has caused a “great deal of confusion” about the facts of current events

    In this hackathon, your goal as a data scientist is to create an NLP model, to combat fake content problems. We believe that these AI technologies hold promise for significantly automating parts of the procedure human fact-checkers use today to determine if a story is real or a hoax.

    Dataset Description:

    • Train.csv - 10240 rows x 3 columns (Inlcudes Labels Columns as Target)
    • Test.csv - 1267 rows x 2 columns
    • Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

    Attribute Description:

    • Text - Raw content from social media/ new platforms
    • Text_Tag - Different types of content tags (9 unique products)
    • Labels - Represents various classes of Labels
    • Half-True - 2
    • False - 1
    • Mostly-True - 3
    • True - 5
    • Barely-True - 0
    • Not-Known - 4

    Skills:

    • NLP, Sentiment Analysis
    • Feature extraction from raw text using TF-IDF, CountVectorizer
    • Using Word Embedding to represent words as vectors
    • Using Pretrained models like Transformers, BERT
    • Optimizing multi-class log loss to generalize well on unseen data

    --- Original source retains full ownership of the source dataset ---

  5. FakeNewsNet

    • kaggle.com
    • dataverse.harvard.edu
    zip
    Updated Nov 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepak Mahudeswaran (2018). FakeNewsNet [Dataset]. https://www.kaggle.com/mdepak/fakenewsnet
    Explore at:
    zip(17409594 bytes)Available download formats
    Dataset updated
    Nov 2, 2018
    Authors
    Deepak Mahudeswaran
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    FakeNewsNet

    This is a repository for an ongoing data collection project for fake news research at ASU. We describe and compare FakeNewsNet with other existing datasets in Fake News Detection on Social Media: A Data Mining Perspective. We also perform a detail analysis of FakeNewsNet dataset, and build a fake news detection model on this dataset in Exploiting Tri-Relationship for Fake News Detection

    JSON version of this dataset is available in github here. The new version of this dataset described in FakeNewNet will be published soon or you can email authors for more info.

    News Content

    It includes all the fake news articles, with the news content attributes as follows:

    1. source: It indicates the author or publisher of the news article
    2. headline: It refers to the short text that aims to catch the attention of readers and relates well to the major of the news topic.
    3. _body_text_: It elaborates the details of news story. Usually there is a major claim which shaped the angle of the publisher and is specifically highlighted and elaborated upon.
    4. _image_video_: It is an important part of body content of news article, which provides visual cues to frame the story.

    Social Context

    It includes the social engagements of fake news articles from Twitter. We extract profiles, posts and social network information for all relevant users.

    1. _user_profile_: It includes a set of profile fields that describe the users' basic information
    2. _user_content_: It collects the users' recent posts on Twitter
    3. _user_followers_: It includes the follower list of the relevant users
    4. _user_followees_: It includes list of users that are followed by relevant users

    References

    If you use this dataset, please cite the following papers:

    @article{shu2017fake, title={Fake News Detection on Social Media: A Data Mining Perspective}, author={Shu, Kai and Sliva, Amy and Wang, Suhang and Tang, Jiliang and Liu, Huan}, journal={ACM SIGKDD Explorations Newsletter}, volume={19}, number={1}, pages={22--36}, year={2017}, publisher={ACM} }

    @article{shu2017exploiting, title={Exploiting Tri-Relationship for Fake News Detection}, author={Shu, Kai and Wang, Suhang and Liu, Huan}, journal={arXiv preprint arXiv:1712.07709}, year={2017} }

    @article{shu2018fakenewsnet, title={FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media}, author={Shu, Kai and Mahudeswaran, Deepak and Wang, Suhang and Lee, Dongwon and Liu, Huan}, journal={arXiv preprint arXiv:1809.01286}, year={2018} }

  6. P

    MM-COVID Dataset

    • paperswithcode.com
    Updated Nov 4, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yichuan Li; Bohan Jiang; Kai Shu; Huan Liu (2021). MM-COVID Dataset [Dataset]. https://paperswithcode.com/dataset/mm-covid
    Explore at:
    Dataset updated
    Nov 4, 2021
    Authors
    Yichuan Li; Bohan Jiang; Kai Shu; Huan Liu
    Description

    MM-COVID is a dataset for fake news detection related to COVID-19. This dataset provides the multilingual fake news and the relevant social context. It contains 3,981 pieces of fake news content and 7,192 trustworthy information from English, Spanish, Portuguese, Hindi, French and Italian, 6 different languages.

  7. E

    BanFakeNews

    • live.european-language-grid.eu
    csv
    Updated Dec 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). BanFakeNews [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4959
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 30, 2020
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A Dataset for detecting fake news in Bangla. News articles were scraped from news portals in Bengladesh.

  8. Data from: On the Role of Images for Analyzing Claims in Social Media

    • zenodo.org
    Updated Apr 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gullal S. Cheema; Gullal S. Cheema; Sherzod Hakimov; Sherzod Hakimov; Eric Müller-Budack; Eric Müller-Budack; Ralph Ewerth; Ralph Ewerth (2021). On the Role of Images for Analyzing Claims in Social Media [Dataset]. http://doi.org/10.5281/zenodo.4592249
    Explore at:
    Dataset updated
    Apr 23, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gullal S. Cheema; Gullal S. Cheema; Sherzod Hakimov; Sherzod Hakimov; Eric Müller-Budack; Eric Müller-Budack; Ralph Ewerth; Ralph Ewerth
    Description

    This is a multimodal dataset used in the paper "On the Role of Images for Analyzing Claims in Social Media", accepted at CLEOPATRA-2021 (2nd International Workshop on Cross-lingual Event-centric Open Analytics), co-located with The Web Conference 2021.

    The four datasets are curated for two different tasks that broadly come under fake news detection. Originally, the datasets were released as part of challenges or papers for text-based NLP tasks and are further extended here with corresponding images.

    1. clef_en and clef_ar are English and Arabic Twitter datasets for claim check-worthiness detection released in CLEF CheckThat! 2020 Barrón-Cedeno et al. [1].
    2. lesa is an English Twitter dataset for claim detection released by Gupta et al.[2]
    3. mediaeval is an English Twitter dataset for conspiracy detection released in MediaEval 2020 Workshop by Pogorelov et al.[3]

    The dataset details like data curation and annotation process can be found in the cited papers.

    Datasets released here with corresponding images are relatively smaller than the original text-based tweets. The data statistics are as follows:
    1. clef_en: 281
    2. clef_ar: 2571
    3. lesa: 1395
    4. mediaeval: 1724

    Each folder has two sub-folders and a json file data.json that consists of crawled tweets. Two sub-folders are:
    1. images: This Contains crawled images with the same name as tweet-id in data.json.
    2. splits: This contains 5-fold splits used for training and evaluation in our paper. Each file in this folder is a csv with two columns

    Code for the paper: https://github.com/cleopatra-itn/image_text_claim_detection

    If you find the dataset and the paper useful, please cite our paper and the corresponding dataset papers[1,2,3]
    Cheema, Gullal S., et al. "On the Role of Images for Analyzing Claims in Social Media" 2nd International Workshop on Cross-lingual Event-centric Open Analytics (CLEOPATRA) co-located with The Web Conf 2021.

    [1] Barrón-Cedeno, Alberto, et al. "Overview of CheckThat! 2020: Automatic identification and verification of claims in social media." International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham, 2020.
    [2] Gupta, Shreya, et al. "LESA: Linguistic Encapsulation and Semantic Amalgamation Based Generalised Claim Detection from Online Content." arXiv preprint arXiv:2101.11891 (2021).
    [3] Pogorelov, Konstantin, et al. "FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020." MediaEval 2020 Workshop. 2020.

  9. E

    Some Like it Hoax

    • live.european-language-grid.eu
    json
    Updated Dec 30, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Some Like it Hoax [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/5091
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 30, 2017
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset contains 15,500 posts from 32 pages (14 conspiracy and 18 scientific).

  10. CT-FAN: A Multilingual dataset for Fake News Detection

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Oct 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.6555293
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel
    Description

    By downloading the data, you agree with the terms & conditions mentioned below:

    Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

    Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

    We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

    Citation

    Please cite our work as

    @InProceedings{clef-checkthat:2022:task3,
    author = {K{\"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas},
    title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection",
    year = {2022},
    booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum",
    series = {CLEF~'2022},
    address = {Bologna, Italy},}
    
    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

    Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Cross-Lingual Task (German)

    Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Output data format

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    IMPORTANT!

    1. We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

    Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
    • Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.
    • Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.
    • Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
  11. CT-FAN-21 corpus: A dataset for Fake News Detection

    • zenodo.org
    Updated Oct 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

    Citation

    Please cite our work as

    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

    Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    Task 3a

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Task 3b

    • public_id- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • domain - domain of the given news article(applicable only for task B)

    Output data format

    Task 3a

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    Task 3b

    • public_id- Unique identifier of the news article
    • predicted_domain- predicted domain

    Sample file

    public_id, predicted_domain
    1, health
    2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

    IMPORTANT!

    1. Fake news article used for task 3b is a subset of task 3a.
    2. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Submission Link: https://competitions.codalab.org/competitions/31238

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
  12. P

    UPFD Dataset

    • paperswithcode.com
    • trends.openbayes.com
    Updated Apr 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yingtong Dou; Kai Shu; Congying Xia; Philip S. Yu; Lichao Sun (2021). UPFD Dataset [Dataset]. https://paperswithcode.com/dataset/upfd
    Explore at:
    Dataset updated
    Apr 24, 2021
    Authors
    Yingtong Dou; Kai Shu; Congying Xia; Philip S. Yu; Lichao Sun
    Description

    For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS.

    The dataset has been integrated with Pytorch Geometric (PyG) and Deep Graph Library (DGL). You can load the dataset after installing the latest versions of PyG or DGL.

    The UPFD dataset includes two sets of tree-structured graphs curated for evaluating binary graph classification, graph anomaly detection, and fake/real news detection tasks. The dataset is dumped in the form of Pytorch-Geometric dataset object. You can easily load the data and run various GNN models using PyG.

    The dataset includes fake&real news propagation (retweet) networks on Twitter built according to fact-check information from Politifact and Gossipcop. The news retweet graphs were originally extracted by FakeNewsNet. Each graph is a hierarchical tree-structured graph where the root node represents the news; the leaf nodes are Twitter users who retweeted the root news. A user node has an edge to the news node if he/she retweeted the news tweet. Two user nodes have an edge if one user retweeted the news tweet from the other user.

    We crawled near 20 million historical tweets from users who participated in fake news propagation in FakeNewsNet to generate node features in the dataset. We incorporate four node feature types in the dataset, the 768-dimensional bert and 300-dimensional spacy features are encoded using pretrained BERT and spaCy word2vec, respectively. The 10-dimensional profile feature is obtained from a Twitter account's profile. You can refer to profile_feature.py for profile feature extraction. The 310-dimensional content feature is composed of a 300-dimensional user comment word2vec (spaCy) embedding plus a 10-dimensional profile feature.

    The dataset statistics is shown below:

    Data#Graphs#Fake News#Total Nodes#Total Edges#Avg. Nodes per Graph
    Politifact31415741,05440,740131
    Gossipcop54642732314,262308,79858

    Please refer to the paper for more details about the UPFD dataset.

    Due to the Twitter policy, we could not release the crawled user's historical tweets publicly. To get the corresponding Twitter user information, you can refer to the news lists under \data in our github repo and map the news id to FakeNewsNet. Then, you can crawl the user information by following the instruction on FakeNewsNet. In the UPFD project, we use Tweepy and Twitter Developer API to get the user information.

  13. WELFake dataset for fake news detection in text data

    • zenodo.org
    csv
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan (2021). WELFake dataset for fake news detection in text data [Dataset]. http://doi.org/10.5281/zenodo.4561253
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 9, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

    Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

    There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

    This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.

  14. Fake News Content Detection

    • kaggle.com
    zip
    Updated Sep 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ganesh (2020). Fake News Content Detection [Dataset]. https://www.kaggle.com/ganeshmg/fake-news-content-detection
    Explore at:
    zip(573472 bytes)Available download formats
    Dataset updated
    Sep 15, 2020
    Authors
    Ganesh
    Description

    Dataset

    This dataset was created by Ganesh

    Contents

  15. h

    fake_news_english

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The HF Datasets community, fake_news_english [Dataset]. https://huggingface.co/datasets/fake_news_english
    Explore at:
    Dataset authored and provided by
    The HF Datasets community
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Fake news has become a major societal issue and a technical challenge for social media companies to identify. This content is difficult to identify because the term "fake news" covers intentionally false, deceptive stories as well as factual errors, satire, and sometimes, stories that a person just does not like. Addressing the problem requires clear definitions and examples. In this work, we present a dataset of fake news and satire stories that are hand coded, verified, and, in the case of fake news, include rebutting stories. We also include a thematic content analysis of the articles, identifying major themes that include hyperbolic support or condemnation of a gure, conspiracy theories, racist themes, and discrediting of reliable sources. In addition to releasing this dataset for research use, we analyze it and show results based on language that are promising for classification purposes. Overall, our contribution of a dataset and initial analysis are designed to support future work by fake news researchers.

  16. f

    Data from: Do You Speak Disinformation? Computational Detection of Deceptive...

    • tandf.figshare.com
    txt
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noëlle Lebernegg; Jakob-Moritz Eberl; Petro Tolochko; Hajo Boomgaarden (2024). Do You Speak Disinformation? Computational Detection of Deceptive News-Like Content Using Linguistic and Stylistic Features [Dataset]. http://doi.org/10.6084/m9.figshare.25225350.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Noëlle Lebernegg; Jakob-Moritz Eberl; Petro Tolochko; Hajo Boomgaarden
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Amid growing concerns about the proliferation and belief in false or misleading information, the study addresses the need for automated detection in the public domain. It revisits and replicates scattered findings using a comprehensive, content-oriented, and feature-based approach. This method reliably identifies deceptive news-like content and highlights the importance of individual features in guiding the prediction algorithm. Employing explainable machine learning, the study explores content patterns for disinformation detection. Results from a tree-based approach on real-world data indicate that content-related characteristics can—when used in combination—facilitate the early detection of deceptive news-like articles. The study concludes by discussing the practical implications of computationally detecting the malicious language of disinformation.

  17. Global Fake Image Detection Market Size By Component (Software, Services),...

    • verifiedmarketresearch.com
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2024). Global Fake Image Detection Market Size By Component (Software, Services), By Application (Incident Reporting, Cyber Defense), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/fake-image-detection-market/
    Explore at:
    Dataset updated
    Apr 8, 2024
    Dataset provided by
    Verified Market Researchhttps://www.verifiedmarketresearch.com/
    Authors
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2024 - 2031
    Area covered
    Global
    Description

    Fake Image Detection Market size was valued at USD 964.45 Million in 2023 and is projected to reach USD 4,107.03 Million by 2031, growing at a CAGR of 23.00% from 2024 to 2031.

    Global Fake Image Detection Market Overview

    The widespread availability of image editing software and social media platforms has led to a surge in fake images, including digitally altered photos and manipulated visual content. This trend has fueled the demand for advanced detection solutions capable of identifying and flagging fake images in real-time. With the proliferation of fake news and misinformation online, there is an increasing awareness among consumers, businesses, and governments about the importance of combating digital fraud and preserving the authenticity of visual content. This heightened concern is driving investments in fake image detection technologies to mitigate the risks associated with misinformation.

    However, despite advancements in AI and ML, detecting fake images remains a complex and challenging task, especially when dealing with sophisticated techniques such as deepfakes and generative adversarial networks (GANs). Developing robust detection algorithms capable of identifying increasingly sophisticated forms of image manipulation poses a significant challenge for researchers and developers. The deployment of fake image detection technologies raises concerns about privacy and data ethics, particularly regarding the collection and analysis of visual content shared online. Balancing the need for effective detection with respect for user privacy and ethical considerations remains a key challenge for stakeholders in the Fake Image Detection Market.

  18. Fake News Detection

    • kaggle.com
    zip
    Updated Oct 12, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raj Jain (2021). Fake News Detection [Dataset]. https://www.kaggle.com/datasets/rpjain55/fake-news-detection
    Explore at:
    zip(1487878 bytes)Available download formats
    Dataset updated
    Oct 12, 2021
    Authors
    Raj Jain
    Description

    Dataset

    This dataset was created by Raj Jain

    Contents

  19. d

    Multi-Fake-DetectiVE - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Jan 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Multi-Fake-DetectiVE - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/957d18c1-c5b3-5b47-9524-95d9fea1d021
    Explore at:
    Dataset updated
    Jan 31, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset includes social media posts and news articles, containing both a textual and a visual component, concerning the Ukrainian-Russian war started in February 2022. The dataset was collected to perform two distinct sub-tasks: Multimodal Fake News Detection, and Cross-modal Relation Classification in fake and real news. Given a piece of content (e.g., a social media post or a news article) that includes both a visual and a textual component, the first sub-task aims to detect if the content is a real or a fake news. The second sub-task aims to understand how the visual and textual components of news can influence each other. Given a text and an accompanying image, the sub-task intends to determine whether the combination of the two aims to mislead the interpretation of the reader about one or the other, or not. The data to be used for the two sub-tasks are stored in two separate sub-folders. Each sub-folder includes: (i) a training set, which contains data collected from February 2022 to September 2022, (ii) a contemporary test set, which includes data collected in the same time window as the training set, and (iii) a future test set, which contains data collected in a subsequent time window, specifically from October 2022 to December 2022.

  20. O

    UPFD-GOS (User Preference-aware Fake News Detection)

    • opendatalab.com
    zip
    Updated Apr 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lehigh University (2023). UPFD-GOS (User Preference-aware Fake News Detection) [Dataset]. https://opendatalab.com/OpenDataLab/UPFD-GOS
    Explore at:
    zip(1601216611 bytes)Available download formats
    Dataset updated
    Apr 18, 2023
    Dataset provided by
    Illinois Institute of Technology
    Lehigh University
    University of Illinois at Chicago
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS. The dataset has been integrated with Pytorch Geometric (PyG) and Deep Graph Library (DGL). You can load the dataset after installing the latest versions of PyG or DGL. The UPFD dataset includes two sets of tree-structured graphs curated for evaluating binary graph classification, graph anomaly detection, and fake/real news detection tasks. The dataset is dumped in the form of Pytorch-Geometric dataset object. You can easily load the data and run various GNN models using PyG. The dataset includes fake&real news propagation (retweet) networks on Twitter built according to fact-check information from Politifact and Gossipcop. The news retweet graphs were originally extracted by FakeNewsNet. Each graph is a hierarchical tree-structured graph where the root node represents the news; the leaf nodes are Twitter users who retweeted the root news. A user node has an edge to the news node if he/she retweeted the news tweet. Two user nodes have an edge if one user retweeted the news tweet from the other user. We crawled near 20 million historical tweets from users who participated in fake news propagation in FakeNewsNet to generate node features in the dataset. We incorporate four node feature types in the dataset, the 768-dimensional bert and 300-dimensional spacy features are encoded using pretrained BERT and spaCy word2vec, respectively. The 10-dimensional profile feature is obtained from a Twitter account's profile. You can refer to profile_feature.py for profile feature extraction. The 310-dimensional content feature is composed of a 300-dimensional user comment word2vec (spaCy) embedding plus a 10-dimensional profile feature. The dataset statistics is shown below: Data

    Graphs

    Fake News

    Total Nodes

    Total Edges

    Avg. Nodes per Graph

    Politifact 314 157 41,054 40,740 131 Gossipcop 5464 2732 314,262 308,798 58 Please refer to the paper for more details about the UPFD dataset. Due to the Twitter policy, we could not release the crawled user's historical tweets publicly. To get the corresponding Twitter user information, you can refer to the news lists under \data in our github repo and map the news id to FakeNewsNet. Then, you can crawl the user information by following the instruction on FakeNewsNet. In the UPFD project, we use Tweepy and Twitter Developer API to get the user information.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni (2021). Repository of fake news detection datasets [Dataset]. http://doi.org/10.4121/14151755.v1

Repository of fake news detection datasets

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Mar 18, 2021
Dataset provided by
4TU.ResearchData
Authors
Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)

Search
Clear search
Close search
Google apps
Main menu