100+ datasets found

Sharing of made-up news on social networks in the U.S. 2020
statista.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, Sharing of made-up news on social networks in the U.S. 2020 [Dataset]. https://www.statista.com/statistics/657111/fake-news-sharing-online/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Dec 8, 2020
Area covered
United States
Description
A survey conducted in December 2020 assessing if news consumers in the United States had ever unknowingly shared fake news or information on social media found that 38.2 percent had done so. A similar share had not, whereas seven percent were unsure if they had accidentally disseminated misinformation on social networks.

Fake news in the U.S.

Fake news, or news that contains misinformation, has become a prevalent issue within the American media landscape. Fake news can be circulated online as news stories with deliberately misleading headings, or clickbait, but the rise of misinformation cannot be solely accredited to online social media. Forms of fake news are also found in print media, with 47 percent of Americans witnessing fake news in newspapers and magazines as of January 2019.

News consumers in the United States are aware of the spread of misinformation, with many Americans believing online news websites regularly report fake news stories. With such a high volume of online news websites publishing false information, it can be difficult to assess the credibility of a story. This can have damaging effects on society in that the public struggled to keep informed, creating a great deal of confusion about even basic facts and contributing to incivility.
Children reading fake news online United Kingdom (UK) 2024
statista.com
Updated Nov 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Children reading fake news online United Kingdom (UK) 2024 [Dataset]. https://www.statista.com/statistics/1268671/children-reading-fake-news-online-united-kingdom-uk/
Explore at:
Dataset updated
Nov 27, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United Kingdom
Description
A 2024 study on news consumption among children in the United Kingdom found that ** percent of respondents aged 12 to 15 years old had come across deliberately untrue or misleading news online or on social media in the year before the survey was conducted. ** percent said they had not seen any false news.
S
Social Media Misinformation Statistics 2025: How Social Platforms Amplify...
sqmagazine.co.uk
Updated Oct 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SQ Magazine (2025). Social Media Misinformation Statistics 2025: How Social Platforms Amplify False Content (with Data) [Dataset]. https://sqmagazine.co.uk/social-media-misinformation-statistics/
Explore at:
Dataset updated
Oct 3, 2025
Dataset authored and provided by
SQ Magazine
License
https://sqmagazine.co.uk/privacy-policy/https://sqmagazine.co.uk/privacy-policy/
Time period covered
Jan 1, 2024 - Dec 31, 2025
Area covered
Global
Description
In the spring of 2020, a simple tweet claimed that sipping hot water every 15 minutes could kill the coronavirus. No medical source backed it, yet the post quickly amassed over 150,000 shares. Fast forward to 2025, and we’ve learned that misinformation online is not a bug; it’s a system...
Z
Data from: Anatomy of an online misinformation network
data.niaid.nih.gov
Updated Aug 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chengcheng Shao; Pik-Mai Hui; Lei Wang; Xinwen Jiang; Alessandro Flammini; Filippo Menczer; Giovanni Luca Ciampaglia (2021). Anatomy of an online misinformation network [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1154839
Explore at:
Dataset updated
Aug 3, 2021
Dataset provided by
The MOE Key Laboratory of Intelligent Computing and Information Processing, Xiangtan University, China
School of Informatics, Computing, and Engineering, Indiana University, Bloomington, USA
ndiana University Network Science Institute, Bloomington, USA
College of Computer, National University of Defense Technology, China
Authors
Chengcheng Shao; Pik-Mai Hui; Lei Wang; Xinwen Jiang; Alessandro Flammini; Filippo Menczer; Giovanni Luca Ciampaglia
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset is provided to facilitate reproducibility of results presented in the following paper:

Chengcheng Shao, Pik-Mai Hui, Lei Wang, Xinwen Jiang, Alessandro Flammini, Filippo Menczer and Giovanni Luca Ciampaglia (2018): Anatomy of an online misinformation network. Preprint arXiv:1801.06122, arxiv.org/abs/1801.06122

Please read carefully both the paper and the README file attached to understand what is contained in this dataset before proceeding. These data are provided for non-commercial purposes only. If you use this dataset for research, please be sure to cite the above preprint, or preferably the final published version that will be shown on the arXiv.
AMMeBa: Annotated Misinformation, Media-Based
kaggle.com
zip
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google AI (2024). AMMeBa: Annotated Misinformation, Media-Based [Dataset]. https://www.kaggle.com/datasets/googleai/in-the-wild-misinformation-media
Explore at:
zip(48436539 bytes)Available download formats
Dataset updated
Apr 24, 2024
Dataset authored and provided by
Google AIhttp://ai.google/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is discussed in far more detail in the corresponding paper, AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild.

Background

The rise of convincing, photorealistic AI-generated images and video have heightened already intense concern over online misinformation and its associated harms. However, despite huge coverage in the press and interest by the general public, it's not clear if AI is widely used in misinformation. In fact, there is little systematic data available whatsoever about the forms misinformation takes online, the use of images and video in misinformation contexts, and what types of manipulations are taking place.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20210666%2F1d22656f5bc8f2f974c0e1cb4977cfc4%2Fgalileo_examples_nolabel.jpg?generation=1712195771239016&alt=media" alt="">

The AMMeBa (Annotated Misinformation, Media-**Ba**sed) dataset seeks to provide a survey of online misinformation, allowing first-of-its-kind quantification of manipulations like deepfakes and photoshopped media as well as trends in how those populations are changing over time.

Recognizing the enormous value and work of fact checkers, AMMeBa uses publicly-available fact checks to identify misinformation claims, which were annotated by highly trained human annotators, providing detailed characterization of the misinformation claim. Media-based misinformation, which uses images, video and audio to bolster the claim, are a particular focus, especially images.

Annotations took place over two years. The resulting dataset comprises millions of individual hand-applied labels, applied to over a hundred thousand English-language fact checks published between 1995 and today. More than fifty thousand misinfo-associated images were identified and annotated.

Findings

Online misinformation is popularly conceptualized as false claims and rumors rendered in text. Our data indicates that the majority of misinformation (recently, about 80%) involves media of some kind: images, video, or audio.

Images are historically the most common type of media associated with misinformation. However, in the past two years, video-based misinformation has become increasingly common and is now the most common type of media associated with misinformation.

Among images, screenshots are common, peaking at about 1/5th of misinformation-associated images. The majority of these are screenshots of social media posts, nearly 20% are screenshots of fake social media posts.

While image-based misinformation is commonly thought of as consisting of photoshop-like manipulations, or, more recently, AI-generated content, our data show that the most common type historically is context manipulations without any pixel manipulation i.e. the original un-edited image is shown alongside a false claim about what that image shows.

The prevalence of technologically simple context manipulations underscores the fact that misinformation does not need to be sophisticated or elaborate to be effective.

While widespread concern around the use of deepfakes in misinformation began in 2018, our data show that AI-generated content was a negligible proportion of overall image-based misinfo until early 2023, when it exploded in popularity. By the time data annotation ended, it accounted for nearly 30% of all fact checked content manipulations.

Dataset Notes

Image URLs

Image URLs were obtained in a best-effort manner. We provide them as a possible pointer to the correct image. However, URLs are absent for several reasons:

Attrition: The image has been removed from that location; see "Data Attrition" in the paper. We are working to identify other versions of the images, if available, and will make them available in dataset updates.

URL Dynamism: The images were obtained by following a fact check link to the original page or an archived version of it. Some pages, particularly archival services, dynamically generate image URLs on load or update the URLs periodically. This instability in the URL means collected URLs are soon useless for these images.

In the majority of cases, though, the URL under misinfo_source in all provided CSVs will point to the page where the image occurred, and in general they are still present (this is checked explicitly by raters when a fact check / source is passed to a subsequent stage, like Stage 1M → Stage 2M. If the entry is not "disqualified," then the image was present on the page at the time of subsequent annotation, and may still be fetchable by matching against the provided hashes.

Image Hashes

To allow users to fetch the images themselves, we provide three hashes of the image data. These hashes use the open-source "imagehash" Image Hashing Library from Github ([README, with explanat...
UK: digitally-altered and AI generated content and online misinformation...
statista.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, UK: digitally-altered and AI generated content and online misinformation 2024 [Dataset]. https://www.statista.com/statistics/1489655/uk-digitally-altered-ai-generated-content-online-misinformation/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 1, 2024 - May 2, 2024
Area covered
United Kingdom
Description
According to a survey conducted in the United Kingdom in May 2024, 75 percent of adults thought that digitally-altered content contributed to the spread of online misinformation. Additionally, 67 percent felt that AI-generated content contributed to the spread of misnformation on online platforms.
Ways that consumers identify online misinformation India 2023
statista.com
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2023). Ways that consumers identify online misinformation India 2023 [Dataset]. https://www.statista.com/statistics/1406290/india-fake-news-indicators/
Explore at:
Dataset updated
May 15, 2023
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Mar 2023
Area covered
India
Description
In a digital news consumption survey conducted in India in March 2023, ** percent of respondents stated that observing how news spreads and its absence from other digital platforms was a common method they used to spot online misinformation. In comparison, ** percent of the surveyed consumers selected poorly designed graphics or one-sided news as common indicators of online misinformation.
FakeNewsNet
kaggle.com
dataverse.harvard.edu
zip
Updated Nov 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepak Mahudeswaran (2018). FakeNewsNet [Dataset]. https://www.kaggle.com/mdepak/fakenewsnet
Explore at:
zip(17409594 bytes)Available download formats
Dataset updated
Nov 2, 2018
Authors
Deepak Mahudeswaran
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
FakeNewsNet

This is a repository for an ongoing data collection project for fake news research at ASU. We describe and compare FakeNewsNet with other existing datasets in Fake News Detection on Social Media: A Data Mining Perspective. We also perform a detail analysis of FakeNewsNet dataset, and build a fake news detection model on this dataset in Exploiting Tri-Relationship for Fake News Detection

JSON version of this dataset is available in github here. The new version of this dataset described in FakeNewNet will be published soon or you can email authors for more info.

News Content

It includes all the fake news articles, with the news content attributes as follows:

source: It indicates the author or publisher of the news article

headline: It refers to the short text that aims to catch the attention of readers and relates well to the major of the news topic.

_body_text_: It elaborates the details of news story. Usually there is a major claim which shaped the angle of the publisher and is specifically highlighted and elaborated upon.

_image_video_: It is an important part of body content of news article, which provides visual cues to frame the story.

Social Context

It includes the social engagements of fake news articles from Twitter. We extract profiles, posts and social network information for all relevant users.

_user_profile_: It includes a set of profile fields that describe the users' basic information

_user_content_: It collects the users' recent posts on Twitter

_user_followers_: It includes the follower list of the relevant users

_user_followees_: It includes list of users that are followed by relevant users

References

If you use this dataset, please cite the following papers:

@article{shu2017fake, title={Fake News Detection on Social Media: A Data Mining Perspective}, author={Shu, Kai and Sliva, Amy and Wang, Suhang and Tang, Jiliang and Liu, Huan}, journal={ACM SIGKDD Explorations Newsletter}, volume={19}, number={1}, pages={22--36}, year={2017}, publisher={ACM} }

@article{shu2017exploiting, title={Exploiting Tri-Relationship for Fake News Detection}, author={Shu, Kai and Wang, Suhang and Liu, Huan}, journal={arXiv preprint arXiv:1712.07709}, year={2017} }

@article{shu2018fakenewsnet, title={FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media}, author={Shu, Kai and Mahudeswaran, Deepak and Wang, Suhang and Lee, Dongwon and Liu, Huan}, journal={arXiv preprint arXiv:1809.01286}, year={2018} }
Misinformation & Fake News text dataset 79k
kaggle.com
zip
Updated May 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
steven (2022). Misinformation & Fake News text dataset 79k [Dataset]. https://www.kaggle.com/datasets/stevenpeutz/misinformation-fake-news-text-dataset-79k
Explore at:
zip(88691612 bytes)Available download formats
Dataset updated
May 9, 2022
Authors
steven
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Misinformation, fake news & propaganda data set

A dataset containing 79k articles of misinformation, fake news and propaganda. - 34975 'true' articles. --> MisinfoSuperset_TRUE.csv - 43642 articles of misinfo, fake news or propaganda --> MisinfoSuperset_FAKE.csv

The 'true' articles comes from a variety of sources, such as Reuters, the New York TImes, the Washington Post and more.

The 'fake' articles are sourced from: 1. American right wing extremist websites (such as Redflag Newsdesk, Beitbart, Truth Broadcast Network) 2. A previously made public dataset described in the following article: Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138). 3. Disinformation and propaganda cases collected by the EUvsDisinfo project. A project started in 2015 that identifies and fact checks disinformation cases originating from pro-Kremlin media that are spread across the EU.

The articles have all information except the actual text removed and are split up into a set with all the fake news / misinformation, and one with al the true articles.

// For those only interested in Russian propaganda (and not so much misinformation in general), I have added the Russian propaganda in a separate csv called 'EXTRA_RussianPropagandaSubset.csv..'

--

Note. While this might immediately seem like a great classification task, I would suggest also considering clustering / topic modelling. Why clustering? Because by clustering we make a model that can match a newly written article to a previously debunked lie / misinformation narrative, thereby we can immediately debunk a new article (or at least link it to a actual fact-checked statement) without either using an algorithm as argument , or encountering a time delay with regards to waiting for confirmation of a fact checking organisation.

An example disinformation project using this dataset can be found on https://stevenpeutz.com/disinformation/

Enjoy! You have chosen an incredibly important topic for your project!
News Detection (Fake or Real) Dataset
kaggle.com
zip
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nitish Jolly (2024). News Detection (Fake or Real) Dataset [Dataset]. https://www.kaggle.com/datasets/nitishjolly/news-detection-fake-or-real-dataset
Explore at:
zip(9823999 bytes)Available download formats
Dataset updated
Apr 17, 2024
Authors
Nitish Jolly
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The Fake News Detection Dataset is created to assist researchers, data scientists, and machine learning enthusiasts in tackling the challenge of distinguishing between genuine and false information in today's digital landscape inundated with social media and online channels. With thousands of news items labeled as either "Fake" or "Real," this dataset provides a robust foundation for training and testing machine learning models aimed at automatically detecting deceptive content.

Each entry in the dataset contains the full text of a news article alongside its corresponding label, facilitating the development of supervised learning projects. The inclusion of various types of content within the news articles, ranging from factual reporting to potentially misleading information or falsehoods, offers a comprehensive resource for algorithmic training.

The dataset's structure, with a clear binary classification of news articles as either "Fake" or "Real," enables the exploration of diverse machine learning approaches, from traditional methods to cutting-edge deep learning techniques.

By offering an accessible and practical dataset, the Fake News Detection Dataset aims to stimulate innovation in the ongoing battle against online misinformation. It serves as a catalyst for research and development within the realms of text analysis, natural language processing, and machine learning communities. Whether it's refining feature engineering, experimenting with state-of-the-art transformer models, or creating educational tools to enhance understanding of fake news, this dataset serves as an invaluable starting point for a wide range of impactful projects.
CT-FAN-21 corpus: A dataset for Fake News Detection
zenodo.org
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4714517
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3a

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Task 3b

public_id- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

domain - domain of the given news article(applicable only for task B)

Output data format

Task 3a

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Task 3b

public_id- Unique identifier of the news article

predicted_domain- predicted domain

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

Fake news article used for task 3b is a subset of task 3a.

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: https://competitions.codalab.org/competitions/31238

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
Fake News Detection Data
kaggle.com
zip
Updated Apr 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tasnim Niger (2024). Fake News Detection Data [Dataset]. https://www.kaggle.com/datasets/tasnimniger/fake-news-detection-data
Explore at:
zip(55829 bytes)Available download formats
Dataset updated
Apr 27, 2024
Authors
Tasnim Niger
Description
The internet and social media have led to a major problem—fake news. Fake news is false information presented as real news, often with the goal of tricking or influencing people. It's difficult to identify fake news because it can look very similar to real news. The Fake News detection dataset deals with the problem indirectly by using tabular summary statistics about each news article to attempt to predict whether the article is real or fake. This dataset is in a tabular format and contains features such as word count, sentence length, unique words, average word length, and a label indicating whether the article is fake or real.
Experience of being misled by misinformation online India 2022
statista.com
Updated Jun 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2023). Experience of being misled by misinformation online India 2022 [Dataset]. https://www.statista.com/statistics/1388664/india-frequency-of-being-misled-by-fake-news-online/
Explore at:
Dataset updated
Jun 12, 2023
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2022
Area covered
India
Description
In response to a survey conducted in ************, ** percent of social media users in India reported having been misled by fake news circulated online about once or twice which was slightly higher than active internet users. Meanwhile, ** percent of all internet users had experienced this a few times. Notably, more than half the respondents claimed to have never been misled by fake news online.
COVID-19 rumor dataset
figshare.com
html
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cheng (2023). COVID-19 rumor dataset [Dataset]. http://doi.org/10.6084/m9.figshare.14456385.v2
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14456385.v2
Dataset updated
Jun 10, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
cheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }
Fake News data set
kaggle.com
zip
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bjørn-Jostein (2021). Fake News data set [Dataset]. https://www.kaggle.com/datasets/bjoernjostein/fake-news-data-set
Explore at:
zip(56446259 bytes)Available download formats
Dataset updated
Dec 17, 2021
Authors
Bjørn-Jostein
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.

How do we figure out what is true and what is fake? Can we do something about it?

Content

The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!

The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.

Acknowledgements

This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.
H
Replication Data for: Trends in the Diffusion of Misinformation on Social...
dataverse.harvard.edu
Updated Jun 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hunt Allcott; Matthew Gentzkow; Chuan Yu (2023). Replication Data for: Trends in the Diffusion of Misinformation on Social Media [Dataset]. http://doi.org/10.7910/DVN/YAR9FU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/YAR9FU
Dataset updated
Jun 12, 2023
Dataset provided by
Harvard Dataverse
Authors
Hunt Allcott; Matthew Gentzkow; Chuan Yu
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains all replication files to perform the analysis in the manuscript and the online appendix.
B
Replication Data for: Seeing Misinformation and Trust, Political Ideology...
borealisdata.ca
search.dataone.org
Updated Apr 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trish Anderson (2023). Replication Data for: Seeing Misinformation and Trust, Political Ideology and Facebook Use [Dataset]. http://doi.org/10.5683/SP3/MHNHBV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/MHNHBV
Dataset updated
Apr 17, 2023
Dataset provided by
Borealis
Authors
Trish Anderson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Survey data collected in 2019 in Canada (n=1539). Seeing misinformation online, trust in federal government, political ideology and Facebook use.
Gen AI Misinformation Detection Data (2024–2025)
kaggle.com
zip
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atharva Soundankar (2025). Gen AI Misinformation Detection Data (2024–2025) [Dataset]. https://www.kaggle.com/datasets/atharvasoundankar/gen-ai-misinformation-detection-datase-20242025
Explore at:
zip(32023 bytes)Available download formats
Dataset updated
Sep 23, 2025
Authors
Atharva Soundankar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset captures realistic simulations of news articles and social media posts circulating between 2024–2025, labeled for potential AI-generated misinformation.

It includes 500 rows × 31 columns, combining:
- Temporal features → date, time, month, day of week
- Text-based metadata → platform, region, language, topic
- Quantitative engagement metrics → likes, shares, comments, CTR, views
- Content quality indicators → sentiment polarity, toxicity score, readability index
- Fact-checking signals → credibility source score, manual check flag, claim verification status
- Target variable → is_misinformation (0 = authentic, 1 = misinformation)

This dataset is designed for machine learning, deep learning, NLP, data visualization, and predictive analysis research.

🎯 Use Cases

This dataset can be applied to multiple domains:
- 🧠 Machine Learning / Deep Learning: Binary classification of misinformation
- 📊 Data Visualization: Engagement trends, regional misinformation heatmaps
- 🔍 NLP Research: Fake news detection, text classification, sentiment-based filtering
- 🌐 PhD & Academic Research: AI misinformation studies, disinformation propagation models
- 📈 Model Evaluation: Feature engineering, ROC-AUC, precision-recall tradeoff
e
Disinformation for Hire
datarepository.eur.nl
pdf
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Stoop; Alain Cohn (2024). Disinformation for Hire [Dataset]. http://doi.org/10.25397/eur.27868341.v3
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25397/eur.27868341.v3
Dataset updated
Dec 13, 2024
Dataset provided by
Erasmus University Rotterdam (EUR)
Authors
Jan Stoop; Alain Cohn
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
The replication material includes the do files and datasets to replicate the results in tables, figures and text of the main manuscript and the appendix. The code constructs the results from the field data and additional experiments we ran on Prolific and MTurk. The material contains 4 code files, all ending with “.do”. The code was last run using Stata (version 18.0) on MacOS. The replicator should expect the code to run under 5 minutes on a standard (2024) desktop machine.Background The spread of misinformation has been linked to increased social divisions and adverse health outcomes, but less is known about the production of disinformation, which is misinformation intended to mislead.Method The main data used in this paper has been collected by the authors using the Mturk interface (Field Experiment) or Qualtrics (Manipulation Check, Downstream Consequences, and Platform Interventions). It is available in the replication package. Our survey design and selection eligibility are included in the Supplementary Document in this depository.Results In a field experiment on MTurk (N=1,197), we found that while 70% of workers accepted a control job, 61% accepted a disinformation job requiring them to manipulate COVID-19 data. To quantify the trade-off between ethical and financial considerations in job acceptance, we introduced a lower-pay condition offering half the wage of the control job; 51% of workers accepted this job, suggesting that the ethical compromise in the disinformation task reduced the acceptance rate by about the same amount as a 25% wage reduction.A survey experiment with a nationally representative sample shows that viewing a disinformation graph from the field experiment negatively affected people’s beliefs and behavioral intentions related to the COVID-19 pandemic, including increased vaccine hesitancy.Conclusion Using a “wisdom-of-crowds” approach, we highlight how online labor markets can introduce features, such as increased worker accountability, to reduce the likelihood of workers engaging in the production of disinformation. Our findings emphasize the importance of addressing the supply side of disinformation in online labor markets to mitigate its harmful societal effects.
CT-FAN: A Multilingual dataset for Fake News Detection
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.6555293
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6555293
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Juliane Köhler; Michael Wiegand; Melanie Siegel
Description
By downloading the data, you agree with the terms & conditions mentioned below:

Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

Citation

Please cite our work as

@InProceedings{clef-checkthat:2022:task3, author = {K{\"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Cross-Lingual Task (German)

Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

IMPORTANT!

We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista, Sharing of made-up news on social networks in the U.S. 2020 [Dataset]. https://www.statista.com/statistics/657111/fake-news-sharing-online/

Sharing of made-up news on social networks in the U.S. 2020

Explore at:

19 scholarly articles cite this dataset (View in Google Scholar)

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

Dec 8, 2020

Area covered

United States

Description

A survey conducted in December 2020 assessing if news consumers in the United States had ever unknowingly shared fake news or information on social media found that 38.2 percent had done so. A similar share had not, whereas seven percent were unsure if they had accidentally disseminated misinformation on social networks.

Fake news in the U.S.

Fake news, or news that contains misinformation, has become a prevalent issue within the American media landscape. Fake news can be circulated online as news stories with deliberately misleading headings, or clickbait, but the rise of misinformation cannot be solely accredited to online social media. Forms of fake news are also found in print media, with 47 percent of Americans witnessing fake news in newspapers and magazines as of January 2019.

News consumers in the United States are aware of the spread of misinformation, with many Americans believing online news websites regularly report fake news stories. With such a high volume of online news websites publishing false information, it can be difficult to assess the credibility of a story. This can have damaging effects on society in that the public struggled to keep informed, creating a great deal of confusion about even basic facts and contributing to incivility.

Clear search

Close search

Google apps

Main menu

Sharing of made-up news on social networks in the U.S. 2020

Children reading fake news online United Kingdom (UK) 2024

Social Media Misinformation Statistics 2025: How Social Platforms Amplify...

Data from: Anatomy of an online misinformation network

AMMeBa: Annotated Misinformation, Media-Based

Background

Findings

Dataset Notes

Image URLs

Image Hashes

UK: digitally-altered and AI generated content and online misinformation...

Ways that consumers identify online misinformation India 2023

FakeNewsNet

FakeNewsNet

News Content

Social Context

References

Misinformation & Fake News text dataset 79k

Misinformation, fake news & propaganda data set

Enjoy! You have chosen an incredibly important topic for your project!

News Detection (Fake or Real) Dataset

CT-FAN-21 corpus: A dataset for Fake News Detection

Fake News Detection Data

Experience of being misled by misinformation online India 2022

COVID-19 rumor dataset

Fake News data set

Context

Content

Acknowledgements

Replication Data for: Trends in the Diffusion of Misinformation on Social...

Replication Data for: Seeing Misinformation and Trust, Political Ideology...

Gen AI Misinformation Detection Data (2024–2025)

🎯 Use Cases

Disinformation for Hire

CT-FAN: A Multilingual dataset for Fake News Detection

Sharing of made-up news on social networks in the U.S. 2020