Facebook
TwitterA survey conducted in December 2020 assessing if news consumers in the United States had ever unknowingly shared fake news or information on social media found that 38.2 percent had done so. A similar share had not, whereas seven percent were unsure if they had accidentally disseminated misinformation on social networks.
Fake news in the U.S.
Fake news, or news that contains misinformation, has become a prevalent issue within the American media landscape. Fake news can be circulated online as news stories with deliberately misleading headings, or clickbait, but the rise of misinformation cannot be solely accredited to online social media. Forms of fake news are also found in print media, with 47 percent of Americans witnessing fake news in newspapers and magazines as of January 2019.
News consumers in the United States are aware of the spread of misinformation, with many Americans believing online news websites regularly report fake news stories. With such a high volume of online news websites publishing false information, it can be difficult to assess the credibility of a story. This can have damaging effects on society in that the public struggled to keep informed, creating a great deal of confusion about even basic facts and contributing to incivility.
Facebook
TwitterA 2024 study on news consumption among children in the United Kingdom found that ** percent of respondents aged 12 to 15 years old had come across deliberately untrue or misleading news online or on social media in the year before the survey was conducted. ** percent said they had not seen any false news.
Facebook
Twitterhttps://sqmagazine.co.uk/privacy-policy/https://sqmagazine.co.uk/privacy-policy/
In the spring of 2020, a simple tweet claimed that sipping hot water every 15 minutes could kill the coronavirus. No medical source backed it, yet the post quickly amassed over 150,000 shares. Fast forward to 2025, and we’ve learned that misinformation online is not a bug; it’s a system...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset is provided to facilitate reproducibility of results presented in the following paper:
Chengcheng Shao, Pik-Mai Hui, Lei Wang, Xinwen Jiang, Alessandro Flammini, Filippo Menczer and Giovanni Luca Ciampaglia (2018): Anatomy of an online misinformation network. Preprint arXiv:1801.06122, arxiv.org/abs/1801.06122
Please read carefully both the paper and the README file attached to understand what is contained in this dataset before proceeding. These data are provided for non-commercial purposes only. If you use this dataset for research, please be sure to cite the above preprint, or preferably the final published version that will be shown on the arXiv.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is discussed in far more detail in the corresponding paper, AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild.
The rise of convincing, photorealistic AI-generated images and video have heightened already intense concern over online misinformation and its associated harms. However, despite huge coverage in the press and interest by the general public, it's not clear if AI is widely used in misinformation. In fact, there is little systematic data available whatsoever about the forms misinformation takes online, the use of images and video in misinformation contexts, and what types of manipulations are taking place.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20210666%2F1d22656f5bc8f2f974c0e1cb4977cfc4%2Fgalileo_examples_nolabel.jpg?generation=1712195771239016&alt=media" alt="">
The AMMeBa (Annotated Misinformation, Media-**Ba**sed) dataset seeks to provide a survey of online misinformation, allowing first-of-its-kind quantification of manipulations like deepfakes and photoshopped media as well as trends in how those populations are changing over time.
Recognizing the enormous value and work of fact checkers, AMMeBa uses publicly-available fact checks to identify misinformation claims, which were annotated by highly trained human annotators, providing detailed characterization of the misinformation claim. Media-based misinformation, which uses images, video and audio to bolster the claim, are a particular focus, especially images.
Annotations took place over two years. The resulting dataset comprises millions of individual hand-applied labels, applied to over a hundred thousand English-language fact checks published between 1995 and today. More than fifty thousand misinfo-associated images were identified and annotated.
Image URLs were obtained in a best-effort manner. We provide them as a possible pointer to the correct image. However, URLs are absent for several reasons:
In the majority of cases, though, the URL under misinfo_source in all provided CSVs will point to the page where the image occurred, and in general they are still present (this is checked explicitly by raters when a fact check / source is passed to a subsequent stage, like Stage 1M → Stage 2M. If the entry is not "disqualified," then the image was present on the page at the time of subsequent annotation, and may still be fetchable by matching against the provided hashes.
To allow users to fetch the images themselves, we provide three hashes of the image data. These hashes use the open-source "imagehash" Image Hashing Library from Github ([README, with explanat...
Facebook
TwitterAccording to a survey conducted in the United Kingdom in May 2024, 75 percent of adults thought that digitally-altered content contributed to the spread of online misinformation. Additionally, 67 percent felt that AI-generated content contributed to the spread of misnformation on online platforms.
Facebook
TwitterIn a digital news consumption survey conducted in India in March 2023, ** percent of respondents stated that observing how news spreads and its absence from other digital platforms was a common method they used to spot online misinformation. In comparison, ** percent of the surveyed consumers selected poorly designed graphics or one-sided news as common indicators of online misinformation.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This is a repository for an ongoing data collection project for fake news research at ASU. We describe and compare FakeNewsNet with other existing datasets in Fake News Detection on Social Media: A Data Mining Perspective. We also perform a detail analysis of FakeNewsNet dataset, and build a fake news detection model on this dataset in Exploiting Tri-Relationship for Fake News Detection
JSON version of this dataset is available in github here. The new version of this dataset described in FakeNewNet will be published soon or you can email authors for more info.
It includes all the fake news articles, with the news content attributes as follows:
It includes the social engagements of fake news articles from Twitter. We extract profiles, posts and social network information for all relevant users.
If you use this dataset, please cite the following papers:
@article{shu2017fake,
title={Fake News Detection on Social Media: A Data Mining Perspective},
author={Shu, Kai and Sliva, Amy and Wang, Suhang and Tang, Jiliang and Liu, Huan},
journal={ACM SIGKDD Explorations Newsletter},
volume={19},
number={1},
pages={22--36},
year={2017},
publisher={ACM}
}
@article{shu2017exploiting,
title={Exploiting Tri-Relationship for Fake News Detection},
author={Shu, Kai and Wang, Suhang and Liu, Huan},
journal={arXiv preprint arXiv:1712.07709},
year={2017}
}
@article{shu2018fakenewsnet,
title={FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media},
author={Shu, Kai and Mahudeswaran, Deepak and Wang, Suhang and Lee, Dongwon and Liu, Huan},
journal={arXiv preprint arXiv:1809.01286},
year={2018}
}
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
A dataset containing 79k articles of misinformation, fake news and propaganda. - 34975 'true' articles. --> MisinfoSuperset_TRUE.csv - 43642 articles of misinfo, fake news or propaganda --> MisinfoSuperset_FAKE.csv
The 'true' articles comes from a variety of sources, such as Reuters, the New York TImes, the Washington Post and more.
The 'fake' articles are sourced from: 1. American right wing extremist websites (such as Redflag Newsdesk, Beitbart, Truth Broadcast Network) 2. A previously made public dataset described in the following article: Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138). 3. Disinformation and propaganda cases collected by the EUvsDisinfo project. A project started in 2015 that identifies and fact checks disinformation cases originating from pro-Kremlin media that are spread across the EU.
The articles have all information except the actual text removed and are split up into a set with all the fake news / misinformation, and one with al the true articles.
// For those only interested in Russian propaganda (and not so much misinformation in general), I have added the Russian propaganda in a separate csv called 'EXTRA_RussianPropagandaSubset.csv..'
--
Note. While this might immediately seem like a great classification task, I would suggest also considering clustering / topic modelling. Why clustering? Because by clustering we make a model that can match a newly written article to a previously debunked lie / misinformation narrative, thereby we can immediately debunk a new article (or at least link it to a actual fact-checked statement) without either using an algorithm as argument , or encountering a time delay with regards to waiting for confirmation of a fact checking organisation.
An example disinformation project using this dataset can be found on https://stevenpeutz.com/disinformation/
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Fake News Detection Dataset is created to assist researchers, data scientists, and machine learning enthusiasts in tackling the challenge of distinguishing between genuine and false information in today's digital landscape inundated with social media and online channels. With thousands of news items labeled as either "Fake" or "Real," this dataset provides a robust foundation for training and testing machine learning models aimed at automatically detecting deceptive content.
Each entry in the dataset contains the full text of a news article alongside its corresponding label, facilitating the development of supervised learning projects. The inclusion of various types of content within the news articles, ranging from factual reporting to potentially misleading information or falsehoods, offers a comprehensive resource for algorithmic training.
The dataset's structure, with a clear binary classification of news articles as either "Fake" or "Real," enables the exploration of diverse machine learning approaches, from traditional methods to cutting-edge deep learning techniques.
By offering an accessible and practical dataset, the Fake News Detection Dataset aims to stimulate innovation in the ongoing battle against online misinformation. It serves as a catalyst for research and development within the realms of text analysis, natural language processing, and machine learning communities. Whether it's refining feature engineering, experimenting with state-of-the-art transformer models, or creating educational tools to enhance understanding of fake news, this dataset serves as an invaluable starting point for a wide range of impactful projects.
Facebook
TwitterData Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .
Citation
Please cite our work as
@article{shahi2021overview,
title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
journal={Working Notes of CLEF},
year={2021}
}
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.
Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Task 3a
Task 3b
Output data format
Task 3a
Sample File
public_id, predicted_rating
1, false
2, true
Task 3b
Sample file
public_id, predicted_domain
1, health
2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:
IMPORTANT!
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Submission Link: https://competitions.codalab.org/competitions/31238
Related Work
Facebook
TwitterThe internet and social media have led to a major problem—fake news. Fake news is false information presented as real news, often with the goal of tricking or influencing people. It's difficult to identify fake news because it can look very similar to real news. The Fake News detection dataset deals with the problem indirectly by using tabular summary statistics about each news article to attempt to predict whether the article is real or fake. This dataset is in a tabular format and contains features such as word count, sentence length, unique words, average word length, and a label indicating whether the article is fake or real.
Facebook
TwitterIn response to a survey conducted in ************, ** percent of social media users in India reported having been misled by fake news circulated online about once or twice which was slightly higher than active internet users. Meanwhile, ** percent of all internet users had experienced this a few times. Notably, more than half the respondents claimed to have never been misled by fake news online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.
How do we figure out what is true and what is fake? Can we do something about it?
The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!
The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.
This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains all replication files to perform the analysis in the manuscript and the online appendix.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Survey data collected in 2019 in Canada (n=1539). Seeing misinformation online, trust in federal government, political ideology and Facebook use.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset captures realistic simulations of news articles and social media posts circulating between 2024–2025, labeled for potential AI-generated misinformation.
It includes 500 rows × 31 columns, combining:
- Temporal features → date, time, month, day of week
- Text-based metadata → platform, region, language, topic
- Quantitative engagement metrics → likes, shares, comments, CTR, views
- Content quality indicators → sentiment polarity, toxicity score, readability index
- Fact-checking signals → credibility source score, manual check flag, claim verification status
- Target variable → is_misinformation (0 = authentic, 1 = misinformation)
This dataset is designed for machine learning, deep learning, NLP, data visualization, and predictive analysis research.
This dataset can be applied to multiple domains:
- 🧠 Machine Learning / Deep Learning: Binary classification of misinformation
- 📊 Data Visualization: Engagement trends, regional misinformation heatmaps
- 🔍 NLP Research: Fake news detection, text classification, sentiment-based filtering
- 🌐 PhD & Academic Research: AI misinformation studies, disinformation propagation models
- 📈 Model Evaluation: Feature engineering, ROC-AUC, precision-recall tradeoff
Facebook
Twitterhttp://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
The replication material includes the do files and datasets to replicate the results in tables, figures and text of the main manuscript and the appendix. The code constructs the results from the field data and additional experiments we ran on Prolific and MTurk. The material contains 4 code files, all ending with “.do”. The code was last run using Stata (version 18.0) on MacOS. The replicator should expect the code to run under 5 minutes on a standard (2024) desktop machine.Background The spread of misinformation has been linked to increased social divisions and adverse health outcomes, but less is known about the production of disinformation, which is misinformation intended to mislead.Method The main data used in this paper has been collected by the authors using the Mturk interface (Field Experiment) or Qualtrics (Manipulation Check, Downstream Consequences, and Platform Interventions). It is available in the replication package. Our survey design and selection eligibility are included in the Supplementary Document in this depository.Results In a field experiment on MTurk (N=1,197), we found that while 70% of workers accepted a control job, 61% accepted a disinformation job requiring them to manipulate COVID-19 data. To quantify the trade-off between ethical and financial considerations in job acceptance, we introduced a lower-pay condition offering half the wage of the control job; 51% of workers accepted this job, suggesting that the ethical compromise in the disinformation task reduced the acceptance rate by about the same amount as a 25% wage reduction.A survey experiment with a nationally representative sample shows that viewing a disinformation graph from the field experiment negatively affected people’s beliefs and behavioral intentions related to the COVID-19 pandemic, including increased vaccine hesitancy.Conclusion Using a “wisdom-of-crowds” approach, we highlight how online labor markets can introduce features, such as increased worker accountability, to reduce the likelihood of workers engaging in the production of disinformation. Our findings emphasize the importance of addressing the supply side of disinformation in online labor markets to mitigate its harmful societal effects.
Facebook
TwitterBy downloading the data, you agree with the terms & conditions mentioned below:
Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.
Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.
We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.
Citation
Please cite our work as
@InProceedings{clef-checkthat:2022:task3,
author = {K{\"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas},
title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection",
year = {2022},
booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum",
series = {CLEF~'2022},
address = {Bologna, Italy},}
@article{shahi2021overview,
title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
journal={Working Notes of CLEF},
year={2021}
}
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.
Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Cross-Lingual Task (German)
Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Output data format
Sample File
public_id, predicted_rating
1, false
2, true
IMPORTANT!
Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498
Related Work
Facebook
TwitterA survey conducted in December 2020 assessing if news consumers in the United States had ever unknowingly shared fake news or information on social media found that 38.2 percent had done so. A similar share had not, whereas seven percent were unsure if they had accidentally disseminated misinformation on social networks.
Fake news in the U.S.
Fake news, or news that contains misinformation, has become a prevalent issue within the American media landscape. Fake news can be circulated online as news stories with deliberately misleading headings, or clickbait, but the rise of misinformation cannot be solely accredited to online social media. Forms of fake news are also found in print media, with 47 percent of Americans witnessing fake news in newspapers and magazines as of January 2019.
News consumers in the United States are aware of the spread of misinformation, with many Americans believing online news websites regularly report fake news stories. With such a high volume of online news websites publishing false information, it can be difficult to assess the credibility of a story. This can have damaging effects on society in that the public struggled to keep informed, creating a great deal of confusion about even basic facts and contributing to incivility.