52 datasets found

Z
CT-FAN: A Multilingual dataset for Fake News Detection
data.niaid.nih.gov
Updated Oct 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4714516
Explore at:
Dataset updated
Oct 23, 2022
Dataset provided by
Thomas Mandl
Gautam Kishore Shahi
Julia Maria Struß
Michael Wiegand
Melanie Siegel
Juliane Köhler
Description
By downloading the data, you agree with the terms & conditions mentioned below:

Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

Citation

Please cite our work as

@InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Cross-Lingual Task (German)

Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

IMPORTANT!

We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
D
Machine Learning Frameworks for Fake News Detection and Datasets
dataverse.nl
rar, text/markdown
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fadi Mohsen; Fadi Mohsen; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang (2024). Machine Learning Frameworks for Fake News Detection and Datasets [Dataset]. http://doi.org/10.34894/CUCITF
Explore at:
rar(133821784), text/markdown(6091)Available download formats
Unique identifier
https://doi.org/10.34894/CUCITF
Dataset updated
Oct 30, 2024
Dataset provided by
DataverseNL
Authors
Fadi Mohsen; Fadi Mohsen; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A web framework designed for researchers to perform comparative analysis of various machine learning algorithms in the context of fake news detection. The folder also includes several datasets for experimentation, alongside the source code. The rise of social media has transformed the landscape of news dissemination, presenting new challenges in combating the spread of fake news. This study addresses the automated detection of misinformation within written content, a task that has prompted extensive research efforts across various methodologies. We evaluate existing benchmarks, introduce a novel hybrid word embedding model, and implement a web framework for text classification. Our approach integrates traditional frequency–inverse document frequency (TF–IDF) methods with sophisticated feature extraction techniques, considering linguistic, psychological, morphological, and grammatical aspects of the text. Through a series of experiments on diverse datasets, applying transfer and incremental learning techniques, we demonstrate the effectiveness of our hybrid model in surpassing benchmarks and outperforming alternative experimental setups. Furthermore, our findings emphasize the importance of dataset alignment and balance in transfer learning, as well as the utility of incremental learning in maintaining high detection performance while reducing runtime. This research offers promising avenues for further advancements in fake news detection methodologies, with implications for future research and development in this critical domain.
f
Repository of fake news detection datasets
figshare.com
data.4tu.nl
txt
Updated Mar 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni (2021). Repository of fake news detection datasets [Dataset]. http://doi.org/10.4121/14151755.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.4121/14151755.v1
Dataset updated
Mar 18, 2021
Dataset provided by
4TU.ResearchData
Authors
Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)
Image and Text Fake News Detection Dataset
figshare.com
zip
Updated May 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esther Irawati Setiawan (2025). Image and Text Fake News Detection Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28735676.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28735676.v1
Dataset updated
May 2, 2025
Dataset provided by
figshare
Authors
Esther Irawati Setiawan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains multimodal content—images and text—from two sources:Fakeddit Subset: A collection of social media posts (primarily from Reddit) that often include misleading or questionable content.Snopes Crawled Data (Medical Fake News Only): Fact-checking information focused solely on medical misinformation, as curated and verified by Snopes.
CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection
zenodo.org
data.niaid.nih.gov
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.5775511
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5775511
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes. Due to these restrictions, the collection is not open data. Please fill out the form and upload the Data Sharing Agreement at Google Form.

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Submission Link: Coming soon

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
Datasets: fake news multimodal datasets (Twitter and Weibo). Credit: Data 1:...
figshare.com
zip
Updated Mar 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akinlolu Ojo (2025). Datasets: fake news multimodal datasets (Twitter and Weibo). Credit: Data 1: (Twitter dataset): The data that support the findings of this study are derived from “Detection and visualization of misleading content on Twitter” at https://github.com/MKLab-ITI/image-verification-corpus, DOI: "10.1007/s13735-017-0143-x." Data 2: (Weibo dataset): The data that support the findings of this study are derived from “EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection” at https://github.com/yaqingwang/EANN-KDD18?tab=readme-ov-file, DOI: “10.1145/3219819.3219903.” [Dataset]. http://doi.org/10.6084/m9.figshare.28516655.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28516655.v2
Dataset updated
Mar 1, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Akinlolu Ojo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study proposes an innovative approach for multimodal fake news detection that utilizes a stick-breaking smoothed Dirichlet distribution. This approach enables the model to capture intricate, subtle interactions between modalities more effectively, thereby improving detection performance and enhancing the system's adaptability to various forms of fake news content
News detection
kaggle.com
Updated Jul 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivam Chaurasia (2020). News detection [Dataset]. https://www.kaggle.com/shivamchaurasia/news-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 14, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shivam Chaurasia
Description
Dataset

This dataset was created by Shivam Chaurasia

Contents
Data from: On the Role of Images for Analyzing Claims in Social Media
zenodo.org
data.niaid.nih.gov
Updated Apr 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gullal S. Cheema; Gullal S. Cheema; Sherzod Hakimov; Sherzod Hakimov; Eric Müller-Budack; Eric Müller-Budack; Ralph Ewerth; Ralph Ewerth (2021). On the Role of Images for Analyzing Claims in Social Media [Dataset]. http://doi.org/10.5281/zenodo.4592249
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4592249
Dataset updated
Apr 23, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gullal S. Cheema; Gullal S. Cheema; Sherzod Hakimov; Sherzod Hakimov; Eric Müller-Budack; Eric Müller-Budack; Ralph Ewerth; Ralph Ewerth
Description
This is a multimodal dataset used in the paper "On the Role of Images for Analyzing Claims in Social Media", accepted at CLEOPATRA-2021 (2nd International Workshop on Cross-lingual Event-centric Open Analytics), co-located with The Web Conference 2021.

The four datasets are curated for two different tasks that broadly come under fake news detection. Originally, the datasets were released as part of challenges or papers for text-based NLP tasks and are further extended here with corresponding images.

1. clef_en and clef_ar are English and Arabic Twitter datasets for claim check-worthiness detection released in CLEF CheckThat! 2020 Barrón-Cedeno et al.^[1].
2. lesa is an English Twitter dataset for claim detection released by Gupta et al.^[2]
3. mediaeval is an English Twitter dataset for conspiracy detection released in MediaEval 2020 Workshop by Pogorelov et al.^[3]

The dataset details like data curation and annotation process can be found in the cited papers.

Datasets released here with corresponding images are relatively smaller than the original text-based tweets. The data statistics are as follows:
1. clef_en: 281
2. clef_ar: 2571
3. lesa: 1395
4. mediaeval: 1724

Each folder has two sub-folders and a json file data.json that consists of crawled tweets. Two sub-folders are:
1. images: This Contains crawled images with the same name as tweet-id in data.json.
2. splits: This contains 5-fold splits used for training and evaluation in our paper. Each file in this folder is a csv with two columns

Code for the paper: https://github.com/cleopatra-itn/image_text_claim_detection

If you find the dataset and the paper useful, please cite our paper and the corresponding dataset papers^[1,2,3]
Cheema, Gullal S., et al. "On the Role of Images for Analyzing Claims in Social Media" 2^nd International Workshop on Cross-lingual Event-centric Open Analytics (CLEOPATRA) co-located with The Web Conf 2021.

[1] Barrón-Cedeno, Alberto, et al. "Overview of CheckThat! 2020: Automatic identification and verification of claims in social media." International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham, 2020.
[2] Gupta, Shreya, et al. "LESA: Linguistic Encapsulation and Semantic Amalgamation Based Generalised Claim Detection from Online Content." arXiv preprint arXiv:2101.11891 (2021).
[3] Pogorelov, Konstantin, et al. "FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020." MediaEval 2020 Workshop. 2020.
Machine Hack: Fake News Content Detection
kaggle.com
Updated Sep 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Saha (2020). Machine Hack: Fake News Content Detection [Dataset]. https://www.kaggle.com/datasets/ssismasterchief/machine-hack-fake-news-content-detection/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 11, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sumit Saha
Description
Dataset

This dataset was created by Sumit Saha

Contents
WELFake dataset for fake news detection in text data
zenodo.org
csv
Updated Apr 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan (2021). WELFake dataset for fake news detection in text data [Dataset]. http://doi.org/10.5281/zenodo.4561253
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4561253
Dataset updated
Apr 9, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.
P
MM-COVID Dataset
paperswithcode.com
Updated Apr 29, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yichuan Li; Bohan Jiang; Kai Shu; Huan Liu (2021). MM-COVID Dataset [Dataset]. https://paperswithcode.com/dataset/mm-covid
Explore at:
Dataset updated
Apr 29, 2021
Authors
Yichuan Li; Bohan Jiang; Kai Shu; Huan Liu
Description
MM-COVID is a dataset for fake news detection related to COVID-19. This dataset provides the multilingual fake news and the relevant social context. It contains 3,981 pieces of fake news content and 7,192 trustworthy information from English, Spanish, Portuguese, Hindi, French and Italian, 6 different languages.
Twitter dataset
figshare.com
txt
Updated Dec 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mehdi khalil (2024). Twitter dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28069163.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28069163.v1
Dataset updated
Dec 20, 2024
Dataset provided by
figshare
Authors
mehdi khalil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Truth Seeker Dataset is designed to support research in the detection and classification of misinformation on social media platforms, particularly focusing on Twitter. This dataset is part of a broader initiative to enhance the understanding of how machine learning (ML) and natural language processing (NLP) can be leveraged to identify fake news and misleading content in real-time.Dataset CompositionThe Truth Seeker Dataset comprises a substantial collection of social media posts that have been meticulously labeled as either real or fake. It was constructed using advanced ML algorithms and NLP techniques to analyze the language patterns in social media communications. The dataset includes:Raw Social Media Posts: A diverse range of tweets that reflect various topics and sentiments.Labeling: Each post is annotated with binary labels indicating its authenticity (real or fake).Feature Sets: Two distinct subsets of the dataset have been created using different NLP vectorization methods—Word2Vec and TF-IDF. This allows researchers to explore how different feature representations impact model performance.Research ApplicationsThe primary aim of the Truth Seeker Dataset is to facilitate the development and validation of models that can accurately classify social media content. Key applications include:Fake News Detection: Utilizing various ML algorithms, including Random Forest and AdBoost, which have demonstrated high F1 scores in preliminary evaluations.Model Comparison: Researchers can compare the effectiveness of different ML approaches on the same dataset, enabling a clearer understanding of which methods yield the best results in detecting misinformation.Algorithm Development: The dataset serves as a benchmark for developing new algorithms aimed at improving accuracy in fake news detection.
Fake News detection
kaggle.com
Updated Aug 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samrat Sinha (2020). Fake News detection [Dataset]. https://www.kaggle.com/samrat96/fake-news-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2020
Dataset provided by
Kaggle
Authors
Samrat Sinha
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

In our society, the spread of fake news is increasing drastically due to which people are believing in unreal incidents. So it is utmost necessary to differentiate the real news from the fake ones and present them to society.

Content

There are three CSV files: 1.train.csv- 25117 rows and 5 columns named id, title, author, text, and label. 2.test.csv- 5881 rows and 4 columns named id, title, author, and text. 3.submit.csv- It is a sample file of how the output file should be.

Inspiration

Everyone deserves to know the actual happenings of the world. A model should be developed which will be able to differentiate the fake news from the real ones. Use the train data to build your model and use the test data to evaluate that model.

BanFakeNews

kaggle.com

Updated Mar 10, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Sudipta Kar (2021). BanFakeNews [Dataset]. https://www.kaggle.com/cryptexcode/banfakenews/metadata

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 10, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Sudipta Kar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

BanFakeNews: A Dataset for Detecting Fake News in Bangla

This work is accepted at LREC 2020. Paper is available at https://arxiv.org/pdf/2004.08789.pdf

Abstract

Observing the damages that can be done by the rapid propagation of fake news in various sectors like politics and finance, automatic identification of fake news using linguistic analysis has drawn the attention of the research community. However, such methods are largely being developed for English where low resource languages remain out of the focus. But the risks spawned by fake and manipulative news are not confined by languages. In this work, we propose an annotated dataset of ~50K news that can be used for building automated fake news detection systems for a low resource language like Bangla. Additionally, we provide an analysis of the dataset and develop a benchmark system with state of the art NLP techniques to identify Bangla fake news. To create this system, we explore traditional linguistic features and neural network based methods. We expect this dataset will be a valuable resource for building technologies to prevent the spreading of fake news and contribute in research with low resource languages.

List of files

Authentic-48K.csv
Fake-1K.csv
LabeledAuthentic-7K.csv
LabeledFake-1K.csv

File Format Authentic-48K.csv and Fake-1K.csv

Column Title	Description
articleID	ID of the news
domain	News publisher's site name
date	Category of the news
category	Category of the news
headline	Headline of the news
content	Article or body of the news
label	1 or 0 . '1' for authentic '0' for fake

LabeledAuthentic-7K.csv, LabeledFake-1K.csv

Column Title	Description
articleID	ID of the news
domain	News publisher's site name
date	Published Date
category	Category of the news
source	Source of the news. (One who can verify the claim of the news)
relation	Related or Unrelated. Related if headline matches with content's claim otherwise it is labeled as Unrelated
headline	Headline of the news
content	Article or body of the news
label	1 or 0 . '1' for authentic '0' for fake
F-type	Type of fake news (Clickbait, Satire, Fake(Misleading or False Context))

F-type is only present in LabeledFake-1K.csv

Bibtex for citation

@InProceedings{Hossain20.1084,
 author = {Md Zobaer Hossain, Md Ashraful Rahman, Md Saiful Islam, Sudipta Kar},
 title = "{BanFakeNews: A Dataset for Detecting Fake News in Bangla}",
 booktitle = {Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020)},
 year = {2020},
 publisher = {European Language Resources Association (ELRA)},
language = {english}
}

Fake News Detection
kaggle.com
Updated Mar 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raj Jain (2022). Fake News Detection [Dataset]. https://www.kaggle.com/datasets/rpjain55/fake-news-detection/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 23, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Raj Jain
Description
Dataset

This dataset was created by Raj Jain

Contents
Fake news
kaggle.com
Updated May 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohit (2019). Fake news [Dataset]. https://www.kaggle.com/mohit28rawat/fake-news
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2019
Dataset provided by
Kaggle
Authors
Mohit
Description
Dataset

This dataset was created by Mohit

Contents
A
AI Detector Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). AI Detector Report [Dataset]. https://www.marketreportanalytics.com/reports/ai-detector-74126
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Apr 10, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The AI content detection market is experiencing rapid growth, driven by the increasing prevalence of AI-generated content and the rising need for authenticity verification across various sectors. The market, estimated at $2 billion in 2025, is projected to exhibit a robust Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated $10 billion by 2033. This expansion is fueled by several key factors. The educational sector is a significant driver, with institutions increasingly employing AI detection tools to combat plagiarism and ensure academic integrity. Furthermore, the news and media industries are adopting these technologies to identify and mitigate the spread of misinformation generated by AI. The development of sophisticated algorithms capable of detecting subtle nuances in AI-generated text, images, and audio is another significant contributing factor. Different types of detectors – text, image/video, and audio – cater to diverse needs, driving market segmentation. However, challenges remain, including the ongoing arms race between AI content generators and detectors, the potential for false positives, and concerns surrounding data privacy and ethical implications. The market's geographical distribution reflects the higher adoption in technologically advanced regions like North America and Europe, but rapid growth is anticipated in Asia Pacific, driven by rising internet penetration and increasing awareness of AI-generated content issues. The competitive landscape is dynamic, with both established players and emerging startups vying for market share. Companies like Turnitin and Copyleaks are well-positioned with their existing platforms, while newcomers are innovating with specialized detectors and AI-powered solutions. The market is characterized by both subscription-based models and one-time purchases, providing various options for users. Future growth will depend on ongoing technological advancements, the ability to adapt to evolving AI writing techniques, and the expansion into new applications and industries. The increasing integration of AI detection tools into existing platforms and workflows will further accelerate market adoption, making it a significant investment opportunity in the rapidly evolving technological landscape.
f
Data from: Do You Speak Disinformation? Computational Detection of Deceptive...
tandf.figshare.com
txt
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noëlle Lebernegg; Jakob-Moritz Eberl; Petro Tolochko; Hajo Boomgaarden (2025). Do You Speak Disinformation? Computational Detection of Deceptive News-Like Content Using Linguistic and Stylistic Features [Dataset]. http://doi.org/10.6084/m9.figshare.25225350.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25225350.v1
Dataset updated
Jun 11, 2025
Dataset provided by
Taylor & Francis
Authors
Noëlle Lebernegg; Jakob-Moritz Eberl; Petro Tolochko; Hajo Boomgaarden
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Amid growing concerns about the proliferation and belief in false or misleading information, the study addresses the need for automated detection in the public domain. It revisits and replicates scattered findings using a comprehensive, content-oriented, and feature-based approach. This method reliably identifies deceptive news-like content and highlights the importance of individual features in guiding the prediction algorithm. Employing explainable machine learning, the study explores content patterns for disinformation detection. Results from a tree-based approach on real-world data indicate that content-related characteristics can—when used in combination—facilitate the early detection of deceptive news-like articles. The study concludes by discussing the practical implications of computationally detecting the malicious language of disinformation.
fake news detection
kaggle.com
Updated Dec 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarthak malik (2020). fake news detection [Dataset]. https://www.kaggle.com/srthk5/fake-news-detection/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 6, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sarthak malik
Description
Dataset

This dataset was created by Sarthak malik

Contents
Global Fake Image Detection Market Size By Component (Software, Services),...
verifiedmarketresearch.com
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VERIFIED MARKET RESEARCH (2024). Global Fake Image Detection Market Size By Component (Software, Services), By Application (Incident Reporting, Cyber Defense), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/fake-image-detection-market/
Explore at:
Dataset updated
Apr 8, 2024
Dataset provided by
Verified Market Researchhttps://www.verifiedmarketresearch.com/
Authors
VERIFIED MARKET RESEARCH
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2024 - 2031
Area covered
Global
Description
Fake Image Detection Market size was valued at USD 276.65 Million in 2024 and is projected to reach USD 1417.59 Million by 2031, growing at a CAGR of 22.66% from 2024 to 2031.

Global Fake Image Detection Market Overview

The widespread availability of image editing software and social media platforms has led to a surge in fake images, including digitally altered photos and manipulated visual content. This trend has fueled the demand for advanced detection solutions capable of identifying and flagging fake images in real-time. With the proliferation of fake news and misinformation online, there is an increasing awareness among consumers, businesses, and governments about the importance of combating digital fraud and preserving the authenticity of visual content. This heightened concern is driving investments in fake image detection technologies to mitigate the risks associated with misinformation.

However, despite advancements in AI and ML, detecting fake images remains a complex and challenging task, especially when dealing with sophisticated techniques such as deepfakes and generative adversarial networks (GANs). Developing robust detection algorithms capable of identifying increasingly sophisticated forms of image manipulation poses a significant challenge for researchers and developers. The deployment of fake image detection technologies raises concerns about privacy and data ethics, particularly regarding the collection and analysis of visual content shared online. Balancing the need for effective detection with respect for user privacy and ethical considerations remains a key challenge for stakeholders in the Fake Image Detection Market.

Facebook

Twitter

Click to copy link

Link copied

Cite

Gautam Kishore Shahi (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4714516

CT-FAN: A Multilingual dataset for Fake News Detection

Explore at:

Dataset updated

Oct 23, 2022

Dataset provided by

Thomas Mandl
Gautam Kishore Shahi
Julia Maria Struß
Michael Wiegand
Melanie Siegel
Juliane Köhler

Description

By downloading the data, you agree with the terms & conditions mentioned below:

Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

Citation

Please cite our work as

@InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Cross-Lingual Task (German)

Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

IMPORTANT!

We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.

Clear search

Close search

Google apps

Main menu

CT-FAN: A Multilingual dataset for Fake News Detection

Machine Learning Frameworks for Fake News Detection and Datasets

Repository of fake news detection datasets

Image and Text Fake News Detection Dataset

CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection

Datasets: fake news multimodal datasets (Twitter and Weibo). Credit: Data 1:...

News detection

Dataset

Contents

Data from: On the Role of Images for Analyzing Claims in Social Media

Machine Hack: Fake News Content Detection

Dataset

Contents

WELFake dataset for fake news detection in text data

MM-COVID Dataset

Twitter dataset

Fake News detection

Context

Content

Inspiration

BanFakeNews

BanFakeNews: A Dataset for Detecting Fake News in Bangla

Abstract

List of files

Bibtex for citation

Fake News Detection

Dataset

Contents

Fake news

Dataset

Contents

AI Detector Report

Data from: Do You Speak Disinformation? Computational Detection of Deceptive...

fake news detection

Dataset

Contents

Global Fake Image Detection Market Size By Component (Software, Services),...

CT-FAN: A Multilingual dataset for Fake News Detection