CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This service detects Fake News in a German text about COVID-19. It uses a German BERT model as binary text classifier. The result is given as a probability between 0 and 1: How likely is the information in that text to be reliable, without any Fake News?
The model was trained on the FANG-COVID dataset. The dataset contains 41,242 documents labeled as either real (68%) or fake (32%). The ground truth was derived from automatic annotation based on the publication platform of a text (newspapers, websites, etc.). The publication platforms were associated with global labels (real or fake) as introduced by independent organizations such as Correctiv or NewsGuard.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.
Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).
There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.
This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.
This dataset was created by Ganesh
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .
Citation
Please cite our work as
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.
Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Task 3
Output data format
Task 3
Sample File
public_id, predicted_rating
1, false
2, true
Sample file
public_id, predicted_domain
1, health
2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:
IMPORTANT!
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Submission Link: Coming soon
Related Work
MM-COVID is a dataset for fake news detection related to COVID-19. This dataset provides the multilingual fake news and the relevant social context. It contains 3,981 pieces of fake news content and 7,192 trustworthy information from English, Spanish, Portuguese, Hindi, French and Italian, 6 different languages.
For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS.
The dataset has been integrated with Pytorch Geometric (PyG) and Deep Graph Library (DGL). You can load the dataset after installing the latest versions of PyG or DGL.
The UPFD dataset includes two sets of tree-structured graphs curated for evaluating binary graph classification, graph anomaly detection, and fake/real news detection tasks. The dataset is dumped in the form of Pytorch-Geometric dataset object. You can easily load the data and run various GNN models using PyG.
The dataset includes fake&real news propagation (retweet) networks on Twitter built according to fact-check information from Politifact and Gossipcop. The news retweet graphs were originally extracted by FakeNewsNet. Each graph is a hierarchical tree-structured graph where the root node represents the news; the leaf nodes are Twitter users who retweeted the root news. A user node has an edge to the news node if he/she retweeted the news tweet. Two user nodes have an edge if one user retweeted the news tweet from the other user.
We crawled near 20 million historical tweets from users who participated in fake news propagation in FakeNewsNet to generate node features in the dataset. We incorporate four node feature types in the dataset, the 768-dimensional bert and 300-dimensional spacy features are encoded using pretrained BERT and spaCy word2vec, respectively. The 10-dimensional profile feature is obtained from a Twitter account's profile. You can refer to profile_feature.py for profile feature extraction. The 310-dimensional content feature is composed of a 300-dimensional user comment word2vec (spaCy) embedding plus a 10-dimensional profile feature.
The dataset statistics is shown below:
Data | #Graphs | #Fake News | #Total Nodes | #Total Edges | #Avg. Nodes per Graph |
---|---|---|---|---|---|
Politifact | 314 | 157 | 41,054 | 40,740 | 131 |
Gossipcop | 5464 | 2732 | 314,262 | 308,798 | 58 |
Please refer to the paper for more details about the UPFD dataset.
Due to the Twitter policy, we could not release the crawled user's historical tweets publicly. To get the corresponding Twitter user information, you can refer to the news lists under \data in our github repo and map the news id to FakeNewsNet. Then, you can crawl the user information by following the instruction on FakeNewsNet. In the UPFD project, we use Tweepy and Twitter Developer API to get the user information.
This dataset was created by Marvel Samuel
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Fake Image Detection Market size was valued at USD 964.45 Million in 2023 and is projected to reach USD 4,107.03 Million by 2031, growing at a CAGR of 23.00% from 2024 to 2031.
Global Fake Image Detection Market Overview
The widespread availability of image editing software and social media platforms has led to a surge in fake images, including digitally altered photos and manipulated visual content. This trend has fueled the demand for advanced detection solutions capable of identifying and flagging fake images in real-time. With the proliferation of fake news and misinformation online, there is an increasing awareness among consumers, businesses, and governments about the importance of combating digital fraud and preserving the authenticity of visual content. This heightened concern is driving investments in fake image detection technologies to mitigate the risks associated with misinformation.
However, despite advancements in AI and ML, detecting fake images remains a complex and challenging task, especially when dealing with sophisticated techniques such as deepfakes and generative adversarial networks (GANs). Developing robust detection algorithms capable of identifying increasingly sophisticated forms of image manipulation poses a significant challenge for researchers and developers. The deployment of fake image detection technologies raises concerns about privacy and data ethics, particularly regarding the collection and analysis of visual content shared online. Balancing the need for effective detection with respect for user privacy and ethical considerations remains a key challenge for stakeholders in the Fake Image Detection Market.
This dataset was created by kalaivani
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS. The dataset has been integrated with Pytorch Geometric (PyG) and Deep Graph Library (DGL). You can load the dataset after installing the latest versions of PyG or DGL. The UPFD dataset includes two sets of tree-structured graphs curated for evaluating binary graph classification, graph anomaly detection, and fake/real news detection tasks. The dataset is dumped in the form of Pytorch-Geometric dataset object. You can easily load the data and run various GNN models using PyG. The dataset includes fake&real news propagation (retweet) networks on Twitter built according to fact-check information from Politifact and Gossipcop. The news retweet graphs were originally extracted by FakeNewsNet. Each graph is a hierarchical tree-structured graph where the root node represents the news; the leaf nodes are Twitter users who retweeted the root news. A user node has an edge to the news node if he/she retweeted the news tweet. Two user nodes have an edge if one user retweeted the news tweet from the other user. We crawled near 20 million historical tweets from users who participated in fake news propagation in FakeNewsNet to generate node features in the dataset. We incorporate four node feature types in the dataset, the 768-dimensional bert and 300-dimensional spacy features are encoded using pretrained BERT and spaCy word2vec, respectively. The 10-dimensional profile feature is obtained from a Twitter account's profile. You can refer to profile_feature.py for profile feature extraction. The 310-dimensional content feature is composed of a 300-dimensional user comment word2vec (spaCy) embedding plus a 10-dimensional profile feature. The dataset statistics is shown below: Data
Politifact 314 157 41,054 40,740 131 Gossipcop 5464 2732 314,262 308,798 58 Please refer to the paper for more details about the UPFD dataset. Due to the Twitter policy, we could not release the crawled user's historical tweets publicly. To get the corresponding Twitter user information, you can refer to the news lists under \data in our github repo and map the news id to FakeNewsNet. Then, you can crawl the user information by following the instruction on FakeNewsNet. In the UPFD project, we use Tweepy and Twitter Developer API to get the user information.
The BuzzFeed-Webis Fake News Corpus 16 comprises the output of 9 publishers in a week close to the US elections. Among the selected publishers are 6 prolific hyperpartisan ones (three left-wing and three right-wing), and three mainstream publishers (see Table 1). All publishers earned Facebook’s blue checkmark, indicating authenticity and an elevated status within the network. For seven weekdays (September 19 to 23 and September 26 and 27), every post and linked news article of the 9 publishers was fact-checked by professional journalists at BuzzFeed. In total, 1,627 articles were checked, 826 mainstream, 256 left-wing and 545 right-wing. The imbalance between categories results from differing publication frequencies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset includes social media posts and news articles, containing both a textual and a visual component, concerning the Ukrainian-Russian war started in February 2022. The dataset was collected to perform two distinct sub-tasks: Multimodal Fake News Detection, and Cross-modal Relation Classification in fake and real news. Given a piece of content (e.g., a social media post or a news article) that includes both a visual and a textual component, the first sub-task aims to detect if the content is a real or a fake news. The second sub-task aims to understand how the visual and textual components of news can influence each other. Given a text and an accompanying image, the sub-task intends to determine whether the combination of the two aims to mislead the interpretation of the reader about one or the other, or not. The data to be used for the two sub-tasks are stored in two separate sub-folders. Each sub-folder includes: (i) a training set, which contains data collected from February 2022 to September 2022, (ii) a contemporary test set, which includes data collected in the same time window as the training set, and (iii) a future test set, which contains data collected in a subsequent time window, specifically from October 2022 to December 2022.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
AI Content Detector Market size is growing at a moderate pace with substantial growth rates over the last few years and is estimated that the market will grow significantly in the forecasted period i.e. 2024 to 2031.
Global AI Content Detector Market Drivers
Rising Concerns Over Misinformation: The proliferation of fake news, misinformation, and inappropriate content on digital platforms has led to increased demand for AI content detectors. These systems can identify and flag misleading or harmful content, helping to combat the spread of misinformation online.
Regulatory Compliance Requirements: Stringent regulations and legal obligations regarding content moderation, data privacy, and online safety drive the adoption of AI content detectors. Organizations need to comply with regulations such as the General Data Protection Regulation (GDPR) and the Digital Millennium Copyright Act (DMCA), spurring investment in AI-powered content moderation solutions.
Growing Volume of User-Generated Content: The exponential growth of user-generated content on social media platforms, forums, and websites has overwhelmed traditional moderation methods. AI content detectors offer scalable and efficient solutions for analyzing vast amounts of content in real-time, enabling platforms to maintain a safe and healthy online environment for users.
Advancements in AI and Machine Learning Technologies: Continuous advancements in artificial intelligence and machine learning algorithms have enhanced the capabilities of content detection systems. AI models trained on large datasets can accurately identify various types of content, including text, images, videos, and audio, with high precision and speed.
Brand Protection and Reputation Management: Businesses prioritize brand protection and reputation management in the digital age, as negative content or misinformation can severely impact brand image and consumer trust. AI content detectors help organizations identify and address potentially damaging content proactively, safeguarding their reputation and brand integrity.
Demand for Personalized User Experiences: Consumers increasingly expect personalized online experiences tailored to their preferences and interests. AI content detectors analyze user behavior and content interactions to deliver relevant and engaging content, driving user engagement and satisfaction.
Adoption of AI-Powered Moderation Tools by Social Media Platforms: Major social media platforms and online communities are investing in AI-powered moderation tools to enforce community guidelines, prevent abuse and harassment, and maintain a positive user experience. The need to address content moderation challenges at scale drives the adoption of AI content detectors.
Mitigation of Online Risks and Threats: Online platforms face various risks and threats, including cyberbullying, hate speech, terrorist propaganda, and child exploitation content. AI content detectors help mitigate these risks by identifying and removing harmful content, thereby creating a safer online environment for users.
Cost and Resource Efficiency: Traditional content moderation methods, such as manual review by human moderators, are time-consuming, labor-intensive, and costly. AI content detectors automate the moderation process, reducing the need for human intervention and minimizing operational expenses for organizations.
This dataset was created by Sabriar Bishal
This dataset was created by Sumit Saha
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The model classifies the political bias of a German text into 5 classes: far-left, center-left, center, center-right, far-right. It uses a TF-IDF vectorizer to preprocess documents. Then, a Random Forest classifier is applied on the resulting vectors to determine the final class.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To investigate how people assess whether politically consistent news is real or fake, two studies (N = 1,008; N = 1,397) with adult American participants conducted in 2020 and 2022 utilized a within-subjects experimental design to investigate perceptions of news accuracy. When a mock Facebook post with either fake (Study 1) or real (Study 2) news content was attributed to an alternative (vs. a mainstream) news outlet, it was, on average, perceived to be less accurate. Those with beliefs reflecting News Media Literacy demonstrated greater sensitivity to the outlet’s status. This relationship was itself contingent on the strength of the participant’s partisan identity. Strong partisans high in News Media Literacy defended the accuracy of politically consistent content, even while recognizing that an outlet was unfamiliar. These results highlight the fundamental importance of looking at the interaction between user-traits and features of social media news posts when examining learning from political news on social media.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Ahmed Khursheed
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract:
Analyzing the spread of information related to a specific event in the news has many potential applications. Consequently, various systems have been developed to facilitate the analysis of information spreadings such as detection of disease propagation and identification of the spreading of fake news through social media. There are several open challenges in the process of discerning information propagation, among them the lack of resources for training and evaluation. This paper describes the process of compiling a corpus from the EventRegistry global media monitoring system. We focus on information spreading in three domains: sports (i.e. the FIFA WorldCup), natural disasters (i.e. earthquakes), and climate change (i.e.global warming). This corpus is a valuable addition to the currently available datasets to examine the spreading of information about various kinds of events.Introduction:Domain-specific gaps in information spreading are ubiquitous and may exist due to economic conditions, political factors, or linguistic, geographical, time-zone, cultural, and other barriers. These factors potentially contribute to obstructing the flow of local as well as international news. We believe that there is a lack of research studies that examine, identify, and uncover the reasons for barriers in information spreading. Additionally, there is limited availability of datasets containing news text and metadata including time, place, source, and other relevant information. When a piece of information starts spreading, it implicitly raises questions such as asHow far does the information in the form of news reach out to the public?Does the content of news remain the same or changes to a certain extent?Do the cultural values impact the information especially when the same news will get translated in other languages?Statistics about datasets:
Statistics about datasets:
--------------------------------------------------------------------------------------------------------------------------------------
# Domain Event Type Articles Per Language Total Articles
1 Sports FIFA World Cup 983-en, 762-sp, 711-de, 10-sl, 216-pt 2679
2 Natural Disaster Earthquake 941-en, 999-sp, 937-de, 19-sl, 251-pt 3194
3 Climate Changes Global Warming 996-en, 298-sp, 545-de, 8-sl, 97-pt 1945
--------------------------------------------------------------------------------------------------------------------------------------
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)