100+ datasets found
  1. Fake News Detection

    • kaggle.com
    zip
    Updated Dec 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhavik Jikadara (2023). Fake News Detection [Dataset]. https://www.kaggle.com/datasets/bhavikjikadara/fake-news-detection
    Explore at:
    zip(42975911 bytes)Available download formats
    Dataset updated
    Dec 17, 2023
    Authors
    Bhavik Jikadara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fake news detection is a process that involves analyzing news content to determine its truthfulness. It is a subtask of text classification, and is defined as the task of classifying news as real or fake.

    GitHub Link : https://github.com/Bhavik-Jikadara/Fake-News-Detection

  2. English Fake News Detection Dataset

    • kaggle.com
    zip
    Updated Aug 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). English Fake News Detection Dataset [Dataset]. https://www.kaggle.com/datasets/miadul/english-fake-news-detection-dataset
    Explore at:
    zip(17019 bytes)Available download formats
    Dataset updated
    Aug 7, 2025
    Authors
    Arif Miah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📰 English Fake News Detection Dataset (Synthetic, 2,212 rows)

    📌 Dataset Summary

    This is a synthetically generated but realistic dataset created for the purpose of training and evaluating machine learning models to detect fake vs real news articles in English. The dataset mimics real-world news reporting formats and includes fabricated content with varied sources and tones.

    📊 Dataset Size

    • Rows (News Articles): 2,212
    • Columns: 5

      • news_id: Unique identifier for each news article
      • headline: The title or headline of the article
      • body_text: The main content/body of the news
      • source: The source or publisher of the article (e.g., BBC, Unknown News)
      • label: Ground truth label — either "Fake" or "Real"

    📁 Column Descriptions

    Column NameTypeDescription
    news_idIntegerUnique ID for each article
    headlineStringA short headline summarizing the news
    body_textStringThe full body or main content of the article
    sourceStringThe news publisher/source name (e.g., BBC, CNN, Unknown News)
    labelString"Fake" or "Real" — indicates whether the article is fabricated or not

    🔍 Use Cases

    • Fake news detection using machine learning or NLP
    • Feature engineering on combined text fields (headline + body)
    • Model comparison: TF-IDF + RandomForest vs Deep Learning (LSTM, BERT)
    • Real vs Fake content classification using classical and modern techniques

    💡 Why This Dataset?

    • Clean, ready-to-use structure for binary classification tasks
    • Simulates realistic headline–body–source combinations
    • Can be expanded into multilingual datasets (Bangla, etc.)
    • Great for building ML/NLP portfolios

    📚 Example Use Case (ML Pipeline)

    1. Combine headline + body_text as input features
    2. Vectorize using TF-IDF or Word Embeddings
    3. Train classifiers like:

      • Random Forest
      • Logistic Regression
      • LSTM / GRU
      • BERT (fine-tuning with HuggingFace)

    ⚠️ Note

    This dataset is synthetic and should not be used for production-level decision-making. It is meant solely for research, academic projects, and model experimentation.

  3. Image and Text Fake News Detection Dataset

    • figshare.com
    zip
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esther Irawati Setiawan (2025). Image and Text Fake News Detection Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28735676.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 2, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Esther Irawati Setiawan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains multimodal content—images and text—from two sources:Fakeddit Subset: A collection of social media posts (primarily from Reddit) that often include misleading or questionable content.Snopes Crawled Data (Medical Fake News Only): Fact-checking information focused solely on medical misinformation, as curated and verified by Snopes.

  4. News Detection (Fake or Real) Dataset

    • kaggle.com
    zip
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nitish Jolly (2024). News Detection (Fake or Real) Dataset [Dataset]. https://www.kaggle.com/datasets/nitishjolly/news-detection-fake-or-real-dataset
    Explore at:
    zip(9823999 bytes)Available download formats
    Dataset updated
    Apr 17, 2024
    Authors
    Nitish Jolly
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Fake News Detection Dataset is created to assist researchers, data scientists, and machine learning enthusiasts in tackling the challenge of distinguishing between genuine and false information in today's digital landscape inundated with social media and online channels. With thousands of news items labeled as either "Fake" or "Real," this dataset provides a robust foundation for training and testing machine learning models aimed at automatically detecting deceptive content.

    Each entry in the dataset contains the full text of a news article alongside its corresponding label, facilitating the development of supervised learning projects. The inclusion of various types of content within the news articles, ranging from factual reporting to potentially misleading information or falsehoods, offers a comprehensive resource for algorithmic training.

    The dataset's structure, with a clear binary classification of news articles as either "Fake" or "Real," enables the exploration of diverse machine learning approaches, from traditional methods to cutting-edge deep learning techniques.

    By offering an accessible and practical dataset, the Fake News Detection Dataset aims to stimulate innovation in the ongoing battle against online misinformation. It serves as a catalyst for research and development within the realms of text analysis, natural language processing, and machine learning communities. Whether it's refining feature engineering, experimenting with state-of-the-art transformer models, or creating educational tools to enhance understanding of fake news, this dataset serves as an invaluable starting point for a wide range of impactful projects.

  5. D

    Machine Learning Frameworks for Fake News Detection and Datasets

    • dataverse.nl
    rar, text/markdown
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fadi Mohsen; Fadi Mohsen; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang (2024). Machine Learning Frameworks for Fake News Detection and Datasets [Dataset]. http://doi.org/10.34894/CUCITF
    Explore at:
    rar(133821784), text/markdown(6091)Available download formats
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    DataverseNL
    Authors
    Fadi Mohsen; Fadi Mohsen; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A web framework designed for researchers to perform comparative analysis of various machine learning algorithms in the context of fake news detection. The folder also includes several datasets for experimentation, alongside the source code. The rise of social media has transformed the landscape of news dissemination, presenting new challenges in combating the spread of fake news. This study addresses the automated detection of misinformation within written content, a task that has prompted extensive research efforts across various methodologies. We evaluate existing benchmarks, introduce a novel hybrid word embedding model, and implement a web framework for text classification. Our approach integrates traditional frequency–inverse document frequency (TF–IDF) methods with sophisticated feature extraction techniques, considering linguistic, psychological, morphological, and grammatical aspects of the text. Through a series of experiments on diverse datasets, applying transfer and incremental learning techniques, we demonstrate the effectiveness of our hybrid model in surpassing benchmarks and outperforming alternative experimental setups. Furthermore, our findings emphasize the importance of dataset alignment and balance in transfer learning, as well as the utility of incremental learning in maintaining high detection performance while reducing runtime. This research offers promising avenues for further advancements in fake news detection methodologies, with implications for future research and development in this critical domain.

  6. CT-FAN: A Multilingual dataset for Fake News Detection

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.6555293
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel
    Description

    By downloading the data, you agree with the terms & conditions mentioned below:

    Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

    Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

    We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

    Citation

    Please cite our work as

    @InProceedings{clef-checkthat:2022:task3,
    author = {K{\"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas},
    title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection",
    year = {2022},
    booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum",
    series = {CLEF~'2022},
    address = {Bologna, Italy},}
    
    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

    Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Cross-Lingual Task (German)

    Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Output data format

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    IMPORTANT!

    1. We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

    Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
    • Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.
    • Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.
    • Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
  7. Z

    CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahi Gautam Kishore; Struß Julia Maria; Thomas Mandl (2022). CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5775507
    Explore at:
    Dataset updated
    Jan 6, 2022
    Dataset provided by
    University of Applied Sciences Potsdam
    University of Hildesheim
    University of Duisburg-Essen, Germany
    Authors
    Shahi Gautam Kishore; Struß Julia Maria; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

    Citation

    Please cite our work as

    @article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

    Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    False - The main claim made in an article is untrue.

    Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    True - This rating indicates that the primary elements of the main claim are demonstrably true.

    Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    Task 3

    ID- Unique identifier of the news article

    Title- Title of the news article

    text- Text mentioned inside the news article

    our rating - class of the news article as false, partially false, true, other

    Output data format

    Task 3

    public_id- Unique identifier of the news article

    predicted_rating- predicted class

    Sample File

    public_id, predicted_rating 1, false 2, true

    Sample file

    public_id, predicted_domain 1, health 2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:

    Fakenews Classification Datasets

    Fake News Detection Challenge KDD 2020

    FakeNewsNet

    IMPORTANT!

    We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Submission Link: Coming soon

    Related Work

    Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

    Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

    Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.

    Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

    G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

    Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

  8. Real vs Fake News Story Detection Dataset

    • kaggle.com
    zip
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barkat Ali Arbab (2025). Real vs Fake News Story Detection Dataset [Dataset]. https://www.kaggle.com/datasets/barkataliarbab/real-vs-fake-news-story-detection-dataset
    Explore at:
    zip(2535 bytes)Available download formats
    Dataset updated
    Aug 8, 2025
    Authors
    Barkat Ali Arbab
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Real vs Fake News Story Detection Dataset is a curated collection of labeled news articles designed to support research and development of automated fake news detection systems. The dataset contains both real and fake news stories, enabling data scientists, researchers, and machine learning practitioners to build, train, and evaluate classification models for misinformation detection.

    This dataset is suitable for tasks such as binary classification, natural language processing (NLP), and text mining, and can be used to benchmark models in academic or applied settings.

  9. Fake News Detection

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KranNaik777 (2025). Fake News Detection [Dataset]. https://www.kaggle.com/datasets/krannaik777/train-news
    Explore at:
    zip(38846301 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    KranNaik777
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    The fake news detection dataset used in this project contains labeled news articles categorized as either "fake" or "real." These articles have been collected from credible real-world sources and fact-checking websites, ensuring diverse and high-quality data. The dataset includes textual features such as the news content, along with metadata like publication date, author, and source details. On average, articles vary in length, providing a rich linguistic variety for model training. The dataset is balanced to minimize bias between fake and real news categories, supporting robust classification. It often contains thousands to hundreds of thousands of articles, enabling effective machine learning model development and evaluation. Additionally, some versions of the dataset may also include image URLs for multimodal analysis, expanding the detection capability beyond text alone. This comprehensive dataset plays a critical role in training and validating the fake news detection model used in this project.

    Here is a description for each column header of the fake news dataset:

    id: A unique identifier assigned to each news article in the dataset for easy reference and indexing.

    headline: The title or headline of the news article, summarizing the key news story in brief.

    written by: The author or journalist who wrote the news article; this may sometimes be missing or anonymized.

    news: The full text content of the news article, which is the main body used for analysis and classification.

    label: The classification label indicating the authenticity of the news article, typically a binary value such as "fake" or "real" (or 0 for real and 1 for fake), indicating whether the news is deceptive or truthful.

    This detailed column description provides clarity on the structure and contents of the dataset used for fake news detection modeling.

  10. Fake News detection

    • kaggle.com
    zip
    Updated Dec 7, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jruvika (2017). Fake News detection [Dataset]. https://www.kaggle.com/datasets/jruvika/fake-news-detection
    Explore at:
    zip(5123662 bytes)Available download formats
    Dataset updated
    Dec 7, 2017
    Authors
    jruvika
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by jruvika

    Released under Database: Open Database, Contents: © Original Authors

    Contents

  11. Z

    Data from: On the Role of Images for Analyzing Claims in Social Media

    • data.niaid.nih.gov
    • data.europa.eu
    Updated Apr 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cheema, Gullal S.; Hakimov, Sherzod; Müller-Budack, Eric; Ewerth, Ralph (2021). On the Role of Images for Analyzing Claims in Social Media [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4592248
    Explore at:
    Dataset updated
    Apr 23, 2021
    Dataset provided by
    TIB - Leibniz Information Centre for Science and Technology, Hannover, Germany
    Authors
    Cheema, Gullal S.; Hakimov, Sherzod; Müller-Budack, Eric; Ewerth, Ralph
    Description

    This is a multimodal dataset used in the paper "On the Role of Images for Analyzing Claims in Social Media", accepted at CLEOPATRA-2021 (2nd International Workshop on Cross-lingual Event-centric Open Analytics), co-located with The Web Conference 2021.

    The four datasets are curated for two different tasks that broadly come under fake news detection. Originally, the datasets were released as part of challenges or papers for text-based NLP tasks and are further extended here with corresponding images.

    1. clef_en and clef_ar are English and Arabic Twitter datasets for claim check-worthiness detection released in CLEF CheckThat! 2020 Barrón-Cedeno et al. [1].
    2. lesa is an English Twitter dataset for claim detection released by Gupta et al.[2]
    3. mediaeval is an English Twitter dataset for conspiracy detection released in MediaEval 2020 Workshop by Pogorelov et al.[3]

    The dataset details like data curation and annotation process can be found in the cited papers.

    Datasets released here with corresponding images are relatively smaller than the original text-based tweets. The data statistics are as follows: 1. clef_en: 281 2. clef_ar: 2571 3. lesa: 1395 4. mediaeval: 1724

    Each folder has two sub-folders and a json file data.json that consists of crawled tweets. Two sub-folders are: 1. images: This Contains crawled images with the same name as tweet-id in data.json. 2. splits: This contains 5-fold splits used for training and evaluation in our paper. Each file in this folder is a csv with two columns .

    Code for the paper: https://github.com/cleopatra-itn/image_text_claim_detection

    If you find the dataset and the paper useful, please cite our paper and the corresponding dataset papers[1,2,3] Cheema, Gullal S., et al. "On the Role of Images for Analyzing Claims in Social Media" 2nd International Workshop on Cross-lingual Event-centric Open Analytics (CLEOPATRA) co-located with The Web Conf 2021.

    [1] Barrón-Cedeno, Alberto, et al. "Overview of CheckThat! 2020: Automatic identification and verification of claims in social media." International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham, 2020. [2] Gupta, Shreya, et al. "LESA: Linguistic Encapsulation and Semantic Amalgamation Based Generalised Claim Detection from Online Content." arXiv preprint arXiv:2101.11891 (2021). [3] Pogorelov, Konstantin, et al. "FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020." MediaEval 2020 Workshop. 2020.

  12. WELFake dataset for fake news detection in text data

    • zenodo.org
    • data.europa.eu
    csv
    Updated Apr 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan (2021). WELFake dataset for fake news detection in text data [Dataset]. http://doi.org/10.5281/zenodo.4561253
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 9, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

    Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

    There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

    This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.

  13. f

    Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset...

    • figshare.com
    application/csv
    Updated Sep 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steve Nwaiwu (2025). Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset for Explainable Misinformation Detection [Dataset]. http://doi.org/10.6084/m9.figshare.29539820.v2
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Sep 24, 2025
    Dataset provided by
    figshare
    Authors
    Steve Nwaiwu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is associated with the research article titled:"Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection"This corpus aggregates, harmonizes, and standardizes data from eight widely used fake news datasets. It supports multi-domain fake news detection with emphasis on explainability, cross-modal generalization, and robust performance.🗂️ Dataset ContentsThis repository contains the following resources:Aggregated Raw Corpus (aggregated_raw.csv)286,260 samples across 8 datasets.Binary labels (1 = Fake, 0 = Real)Includes metadata: source dataset, topic (if available), speaker/source, etc.Preprocessed Text Corpus (aggregated_cleaned.csv)Includes standardized and cleaned cleaned_text column.Text normalization applied using SpaCy (lowercasing, lemmatization, punctuation/URL/user removal).Fully Encoded Feature Matrix (xframe_features_encoded.csv)104 structured features derived from communication theory and media psychology.Includes source encoding, speaker credibility, social engagement, sentiment, subjectivity, sensationalism, and readability scores.All numerical features scaled to [0, 1]; categorical features one-hot encoded.Data Splitstrain.csv, val.csv, test.csv: Stratified splits of the cleaned and encoded data.Feature Metadata (feature_description.pdf)Documentation of all 104 features with descriptions, data sources, and rationales.🔧 Preprocessing OverviewTo ensure robust and generalizable modeling, the following standardized pipeline was applied:Text Preprocessing: Cleaned using SpaCy, lowercased, lemmatized, and stripped of stopwords, URLs, and usernames.Label Mapping:Datasets with multiclass labels (e.g., LIAR, FNC-1) were mapped to a unified binary schema using theory-informed rules.1 = Fake includes false, pants-on-fire, disagree, etc.; 0 = Real includes true, agree, mostly-true.Deduplication: Removed near-duplicate entries across datasets using fuzzy string matching and content hashing.Feature Engineering:Source credibility features (e.g., speaker credibility from LIAR).Social context (e.g., tweet volume, user mentions).Framing indicators (e.g., sentiment, subjectivity, sensationalism, readability).Feature Encoding: One-hot encoding for categorical attributes, Min-Max scaling for numerical features.📚 Original Data SourcesThis aggregated corpus was derived from the following datasets. Please cite them individually alongside this collection:LIAR – Wang (2017): https://doi.org/10.18653/v1/P17-2067FakeNewsNet (PolitiFact, BuzzFeed, GossipCop) – Shu et al.: https://doi.org/10.1145/3363574ISOT – Ahmed et al.: https://doi.org/10.48550/arXiv.1708.07104WELFake – Verma et al.: https://doi.org/10.1109/TCSS.2021.3068519FNC-1https://www.fakenewschallenge.org/FakeNewsAMT – Pérez-Rosas et al.: https://doi.org/10.18653/v1/C18-1287Celebrity Rumors – Horne & Adalı: https://doi.org/10.1609/icwsm.v11i1.15015PHEME – Zubiaga et al.: https://doi.org/10.6084/m9.figshare.4010619.v1📖 How to Cite This DatasetNwaiwu, S.; Jongsawat, N.; Tungkasthan, A. Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection. Appl. Sci. 2025, 15, 9498. https://doi.org/10.3390/app15179498

  14. f

    Repository of fake news detection datasets

    • figshare.com
    txt
    Updated Mar 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni (2021). Repository of fake news detection datasets [Dataset]. http://doi.org/10.4121/14151755.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 18, 2021
    Dataset provided by
    4TU.ResearchData
    Authors
    Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)

  15. Fake News Classification

    • kaggle.com
    zip
    Updated Oct 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurabh Shahane (2023). Fake News Classification [Dataset]. https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification
    Explore at:
    zip(96615040 bytes)Available download formats
    Dataset updated
    Oct 8, 2023
    Authors
    Saurabh Shahane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (WELFake) is a dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, authors merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

    Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

    There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

    Published in: IEEE Transactions on Computational Social Systems: pp. 1-13 (doi: 10.1109/TCSS.2021.3068519).

  16. Fake News Challenge

    • kaggle.com
    zip
    Updated Apr 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhinav Kumar Jha (2021). Fake News Challenge [Dataset]. https://www.kaggle.com/datasets/abhinavkrjha/fake-news-challenge
    Explore at:
    zip(5340415 bytes)Available download formats
    Dataset updated
    Apr 4, 2021
    Authors
    Abhinav Kumar Jha
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The issue of “fake news” has arisen recently as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge was organized in early 2017 to encourage development of machine learning-based classification systems that perform “stance detection” -- i.e. identifying whether a particular news headline “agrees” with, “disagrees” with, “discusses,” or is unrelated to a particular news article -- in order to allow journalists and others to more easily find and investigate possible instances of “fake news.”

    Content

    The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:

    train_bodies.csv

    This file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)

    train_stances.csv

    This file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).

    Distribution of the data

    The distribution of Stance classes in train_stances.csv is as follows:

    rowsunrelateddiscussagreedisagree
    499720.731310.178280.07360120.0168094

    There are 4 possible classifications: 1. The article text agrees with the headline. 2. The article text disagrees with the headline. 3. The article text is a discussion of the headline, without taking a position on it. 4. The article text is unrelated to the headline (i.e. it doesn’t address the same topic).

    Acknowledgements

    For details of the task, see FakeNewsChallenge.org

  17. h

    Data from: RealFakeNews

    • huggingface.co
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Transformer Ensembles for Fake News Detection: A Multimodal Perspective with ViT, BERT, and DeBERTa (2025). RealFakeNews [Dataset]. https://huggingface.co/datasets/fauxNeuz/RealFakeNews
    Explore at:
    Dataset updated
    Jul 27, 2025
    Dataset authored and provided by
    Transformer Ensembles for Fake News Detection: A Multimodal Perspective with ViT, BERT, and DeBERTa
    Description

    RealFakeNews: A Dataset for Detecting Fake News

    RealFakeNews is a dataset of over 108,000 news samples, created to support the development of models that detect misinformation. Each entry contains a short news article along with a label indicating whether it’s real or fake.

      What's in the Dataset?
    

    Samples: 108,032
    Columns:
    text: News content (string)
    label: Classification label (string: REAL or FAKE)

    Language: English
    Format: CSV
    License: CC BY‑NC‑SA 4.0… See the full description on the dataset page: https://huggingface.co/datasets/fauxNeuz/RealFakeNews.

  18. Fake News detection

    • kaggle.com
    zip
    Updated Mar 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khalid Ashik (2024). Fake News detection [Dataset]. https://www.kaggle.com/datasets/dkhalidashik/fake-news-detection
    Explore at:
    zip(48725919 bytes)Available download formats
    Dataset updated
    Mar 7, 2024
    Authors
    Khalid Ashik
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Khalid Ashik

    Released under Apache 2.0

    Contents

  19. w

    Global On-Premise Fake Image Detection Market Research Report: By...

    • wiseguyreports.com
    Updated Sep 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global On-Premise Fake Image Detection Market Research Report: By Application (Social Media Monitoring, Intellectual Property Protection, Content Verification, Fraud Detection), By Technology (Machine Learning, Image Processing, Deep Learning, Computer Vision), By Deployment Type (Software, Hardware, Integrated Solutions), By End Use Industry (Media and Entertainment, E-commerce, Advertising, Government) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/on-premise-fake-image-detection-market
    Explore at:
    Dataset updated
    Sep 15, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Sep 25, 2025
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20241.3(USD Billion)
    MARKET SIZE 20251.47(USD Billion)
    MARKET SIZE 20355.0(USD Billion)
    SEGMENTS COVEREDApplication, Technology, Deployment Type, End Use Industry, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICSrising concerns over misinformation, increasing digital content consumption, advancements in AI technologies, growing regulatory compliance requirements, demand for enhanced security measures
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDTruepic, Verisk Analytics, OpenAI, NVIDIA, Kaspersky, Microsoft, DeepTrace Technologies, Cognitech, Symantec, Sensity Systems, Adobe, Inception Technologies
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESGrowing demand for content authenticity, Increasing regulatory requirements on image verification, Rising threats of digital misinformation, Expanding applications in security sectors, Advancements in deep learning techniques
    COMPOUND ANNUAL GROWTH RATE (CAGR) 13.1% (2025 - 2035)
  20. Global Fake Image Detection Market Size By Component (Software, Services),...

    • verifiedmarketresearch.com
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2024). Global Fake Image Detection Market Size By Component (Software, Services), By Application (Incident Reporting, Cyber Defense), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/fake-image-detection-market/
    Explore at:
    Dataset updated
    Apr 8, 2024
    Dataset provided by
    Verified Market Researchhttps://www.verifiedmarketresearch.com/
    Authors
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2024 - 2031
    Area covered
    Global
    Description

    Fake Image Detection Market size was valued at USD 276.65 Million in 2024 and is projected to reach USD 1417.59 Million by 2031, growing at a CAGR of 22.66% from 2024 to 2031.

    Global Fake Image Detection Market Overview

    The widespread availability of image editing software and social media platforms has led to a surge in fake images, including digitally altered photos and manipulated visual content. This trend has fueled the demand for advanced detection solutions capable of identifying and flagging fake images in real-time. With the proliferation of fake news and misinformation online, there is an increasing awareness among consumers, businesses, and governments about the importance of combating digital fraud and preserving the authenticity of visual content. This heightened concern is driving investments in fake image detection technologies to mitigate the risks associated with misinformation.

    However, despite advancements in AI and ML, detecting fake images remains a complex and challenging task, especially when dealing with sophisticated techniques such as deepfakes and generative adversarial networks (GANs). Developing robust detection algorithms capable of identifying increasingly sophisticated forms of image manipulation poses a significant challenge for researchers and developers. The deployment of fake image detection technologies raises concerns about privacy and data ethics, particularly regarding the collection and analysis of visual content shared online. Balancing the need for effective detection with respect for user privacy and ethical considerations remains a key challenge for stakeholders in the Fake Image Detection Market.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bhavik Jikadara (2023). Fake News Detection [Dataset]. https://www.kaggle.com/datasets/bhavikjikadara/fake-news-detection
Organization logo

Fake News Detection

Fake news detection is a process that involves analyzing news.

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
zip(42975911 bytes)Available download formats
Dataset updated
Dec 17, 2023
Authors
Bhavik Jikadara
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Fake news detection is a process that involves analyzing news content to determine its truthfulness. It is a subtask of text classification, and is defined as the task of classifying news as real or fake.

GitHub Link : https://github.com/Bhavik-Jikadara/Fake-News-Detection

Search
Clear search
Close search
Google apps
Main menu