19 datasets found

C
Repository of fake news detection datasets
data.4tu.nl
zip
Updated Mar 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni (2021). Repository of fake news detection datasets [Dataset]. http://doi.org/10.4121/14151755.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/14151755.v1
Dataset updated
Mar 18, 2021
Dataset provided by
4TU.ResearchData
Authors
Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2000 - 2019
Description
The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)
CT-FAN-21 corpus: A dataset for Fake News Detection
zenodo.org
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4714517
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3a

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Task 3b

public_id- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

domain - domain of the given news article(applicable only for task B)

Output data format

Task 3a

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Task 3b

public_id- Unique identifier of the news article

predicted_domain- predicted domain

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

Fake news article used for task 3b is a subset of task 3a.

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: https://competitions.codalab.org/competitions/31238

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
Z
CT-FAN: A Multilingual dataset for Fake News Detection
data.niaid.nih.gov
Updated Oct 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melanie Siegel (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4714516
Explore at:
Dataset updated
Oct 23, 2022
Dataset provided by
Gautam Kishore Shahi
Thomas Mandl
Julia Maria Struß
Michael Wiegand
Melanie Siegel
Juliane Köhler
Description
By downloading the data, you agree with the terms & conditions mentioned below:

Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

Citation

Please cite our work as

@InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Cross-Lingual Task (German)

Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

IMPORTANT!

We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
h
fake-news-detection-dataset-English
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erfan Moosavi Monazzah, fake-news-detection-dataset-English [Dataset]. https://huggingface.co/datasets/ErfanMoosaviMonazzah/fake-news-detection-dataset-English
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Erfan Moosavi Monazzah
License
https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
Description
This is a cleaned and splitted version of this dataset (https://www.kaggle.com/datasets/sadikaljarif/fake-news-detection-dataset-english) Labels:

Fake News: 0 Real News: 1 You can find the cleansing script at: https://github.com/ErfanMoosaviMonazzah/Fake-News-Detection
d
Data from: Supersharers of fake news on Twitter
dataone.org
datadryad.org
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahar Baribi-Bartov; Briony Swire-Thompson; Nir Grinberg (2025). Supersharers of fake news on Twitter [Dataset]. http://doi.org/10.5061/dryad.44j0zpcmq
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.44j0zpcmq
Dataset updated
Jul 31, 2025
Dataset provided by
Dryad Digital Repository
Authors
Sahar Baribi-Bartov; Briony Swire-Thompson; Nir Grinberg
Time period covered
Jan 1, 2024
Description
Governments may have the capacity to flood social media with fake news, but little is known about the use of flooding by ordinary voters. In this work, we identify 2107 registered US voters that account for 80% of the fake news shared on Twitter during the 2020 US presidential election by an entire panel of 664,391 voters. We find that supersharers are important members of the network, reaching a sizable 5.2% of registered voters on the platform. Supersharers have a significant overrepresentation of women, older adults, and registered Republicans. Supersharers' massive volume does not seem automated but is rather generated through manual and persistent retweeting. These findings highlight a vulnerability of social media for democracy, where a small group of people distort the political reality for many., This dataset contains aggregated information necessary to replicate the results reported in our work on Supersharers of Fake News on Twitter while respecting and preserving the privacy expectations of individuals included in the analysis. No individual-level data is provided as part of this dataset.Â The data collection process that enabled the creation of this dataset leveraged a large-scale panel of registered U.S. voters matched to Twitter accounts. We examined the activity of 664,391 panel members who were active on Twitter during the months of the 2020 U.S. presidential election (August to November 2020, inclusive), and identified a subset of 2,107 supersharers, which are the most prolific sharers of fake news in the panel that together account for 80% of fake news content shared on the platform. We rely on a source-level definition of fake news, that uses the manually-labeled list of fake news sites by Grinberg et al. 2019 and an updated list based on NewsGuard ratings (commercial..., , # Supersharers of Fake News on Twitter

This repository contains data and code for replication of the results presented in the paper.

The folders are mostly organized by research questions as detailed below. Each folder contains the code and publicly available data necessary for the replication of results. Importantly, no individual-level data is provided as part of this repository. De-identified individual-level data can be attained for IRB-approved uses under the terms and conditions specified in the paper. Once access is granted, the restricted-access data is expected to be located under ./restricted_data.

The folders in this repository are the following:

Preprocessing

Code under the preprocessing folder contains the following:

source classifier - the code used to train a classifier based on NewsGuard domain flags to match the fake news labels source definition use in Grinberg et el. 2019 labels.

political classifier - the code used to identify political tweets, i...

Social media as a news outlet worldwide 2024

statista.com
es.statista.com
+1more

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Amy Watson, Social media as a news outlet worldwide 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Amy Watson

Description

During a 2024 survey, 77 percent of respondents from Nigeria stated that they used social media as a source of news. In comparison, just 23 percent of Japanese respondents said the same. Large portions of social media users around the world admit that they do not trust social platforms either as media sources or as a way to get news, and yet they continue to access such networks on a daily basis.

              Social media: trust and consumption

              Despite the majority of adults surveyed in each country reporting that they used social networks to keep up to date with news and current affairs, a 2018 study showed that social media is the least trusted news source in the world. Less than 35 percent of adults in Europe considered social networks to be trustworthy in this respect, yet more than 50 percent of adults in Portugal, Poland, Romania, Hungary, Bulgaria, Slovakia and Croatia said that they got their news on social media.

              What is clear is that we live in an era where social media is such an enormous part of daily life that consumers will still use it in spite of their doubts or reservations. Concerns about fake news and propaganda on social media have not stopped billions of users accessing their favorite networks on a daily basis.
              Most Millennials in the United States use social media for news every day, and younger consumers in European countries are much more likely to use social networks for national political news than their older peers.
              Like it or not, reading news on social is fast becoming the norm for younger generations, and this form of news consumption will likely increase further regardless of whether consumers fully trust their chosen network or not.

i
Multimodal fake news dataset Weibo23
ieee-dataport.org
Updated Dec 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wanlong bing (2023). Multimodal fake news dataset Weibo23 [Dataset]. https://ieee-dataport.org/documents/multimodal-fake-news-dataset-weibo23
Explore at:
Dataset updated
Dec 23, 2023
Authors
wanlong bing
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
we formed a comprehensive collection of fake news for Weibo23. Fabricated news articles were thoroughly examined and authenticated by certified experts.
WELFake dataset for fake news detection in text data
zenodo.org
data.europa.eu
csv
Updated Apr 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan (2021). WELFake dataset for fake news detection in text data [Dataset]. http://doi.org/10.5281/zenodo.4561253
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4561253
Dataset updated
Apr 9, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.
O
UPFD (User Preference-aware Fake News Detection)
opendatalab.com
zip
Updated Apr 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lehigh University (2023). UPFD (User Preference-aware Fake News Detection) [Dataset]. https://opendatalab.com/OpenDataLab/UPFD
Explore at:
zip(1601216611 bytes)Available download formats
Dataset updated
Apr 18, 2023
Dataset provided by
Illinois Institute of Technology
University of Illinois at Chicago
Lehigh University
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS. The dataset has been integrated with Pytorch Geometric (PyG) and Deep Graph Library (DGL). You can load the dataset after installing the latest versions of PyG or DGL. The UPFD dataset includes two sets of tree-structured graphs curated for evaluating binary graph classification, graph anomaly detection, and fake/real news detection tasks. The dataset is dumped in the form of Pytorch-Geometric dataset object. You can easily load the data and run various GNN models using PyG. The dataset includes fake&real news propagation (retweet) networks on Twitter built according to fact-check information from Politifact and Gossipcop. The news retweet graphs were originally extracted by FakeNewsNet. Each graph is a hierarchical tree-structured graph where the root node represents the news; the leaf nodes are Twitter users who retweeted the root news. A user node has an edge to the news node if he/she retweeted the news tweet. Two user nodes have an edge if one user retweeted the news tweet from the other user. We crawled near 20 million historical tweets from users who participated in fake news propagation in FakeNewsNet to generate node features in the dataset. We incorporate four node feature types in the dataset, the 768-dimensional bert and 300-dimensional spacy features are encoded using pretrained BERT and spaCy word2vec, respectively. The 10-dimensional profile feature is obtained from a Twitter account's profile. You can refer to profile_feature.py for profile feature extraction. The 310-dimensional content feature is composed of a 300-dimensional user comment word2vec (spaCy) embedding plus a 10-dimensional profile feature. The dataset statistics is shown below: Data

Graphs

Fake News

Total Nodes

Total Edges

Avg. Nodes per Graph

Politifact 314 157 41,054 40,740 131 Gossipcop 5464 2732 314,262 308,798 58 Please refer to the paper for more details about the UPFD dataset. Due to the Twitter policy, we could not release the crawled user's historical tweets publicly. To get the corresponding Twitter user information, you can refer to the news lists under \data in our github repo and map the news id to FakeNewsNet. Then, you can crawl the user information by following the instruction on FakeNewsNet. In the UPFD project, we use Tweepy and Twitter Developer API to get the user information.
d
Data from: Three Models for Audio-Visual Data in Politics
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas, Christopher (2023). Three Models for Audio-Visual Data in Politics [Dataset]. http://doi.org/10.7910/DVN/FHD6M2
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/FHD6M2
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Lucas, Christopher
Description
Audio-visual data is ubiquitous in politics. Campaign advertisements, political debates, and the news cycle all constantly generate sound bites and imagery, which in turn inform and affect voters. Though these sources of information have been a topic of research in political science for decades, their study has been limited by the cost of human coding. To name but one example, to answer questions about the effects of negative campaign advertisements, humans must watch tens of thousands of advertisements and manually label them. And even if the necessary resources can be mustered for such a study, future researchers may be interested in a different set of labels, and so must either recode every advertisement or discard the exercise entirely. Through three separate models, this dissertation resolves this limitation by developing automated methods to study the most common types of audio-video data in political science. The first two models are neural networks, the third a hierarchical hidden Markov model. In Chapter 1, I introduce neural networks and their complications to political science, building up from familiar statistical methods. I then develop a novel neural network for classifying newspaper articles, using both the text of the article and the imagery as data. The model is applied to an original data set of articles about fake news, which I collected by developing and deploying bots to concurrently crawl the online pages of newspapers and download news text and images. This is a novel engineering effort that future researchers can leverage to collect effectively limitless amounts of data about the news. Building on the methodological foundations established in Chapter 1, in Chapter 2 I develop a second neural network for classifying political video and demonstrate that the model can automate classification of campaign advertisements, using both the visual and the audio information. In Chapter 3 (joint with Dean Knox), I develop a hierarchical hidden Markov model for speech classification and demonstrate it with an application to speech on the Supreme Court. Finally, in Chapter 4 (joint with Volha Charnysh and Prerna Singh), I demonstrate the behavioral effects of imagery through a dictator game in which a visual image reduces out-group bias. In sum, this dissertation introduces a new type of data to political science, validates its substantive importance, and develops models for its study in the substantive context of politics.
d
NCRB: Year-wise Court disposal of cases of fake/false news and rumours...
dataful.in
Updated Sep 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataful (Factly) (2025). NCRB: Year-wise Court disposal of cases of fake/false news and rumours registered under Section 505 of IPC [Dataset]. https://dataful.in/datasets/459
Explore at:
application/x-parquet, xlsx, csvAvailable download formats
Dataset updated
Sep 12, 2025
Dataset authored and provided by
Dataful (Factly)
License
https://dataful.in/terms-and-conditionshttps://dataful.in/terms-and-conditions
Area covered
India
Variables measured
cases
Description
The dataset contains the overall conviction rate and pendency percentage of cases of fake/false news and rumours registered under Section 505 of IPC as recorded by states in the National Crime Record Bureau's annual Crime in India report. NCRB started publishing data on this since 2017. This is helpful in tracking the courts' disposal of cases. Conviction rate is defined as the share of convicted cases out of the total cases in which the trial was complete in a particular year. The pendency percentage is the share of cases pending trial at the end of the year as against the total cases for trial.
Z
INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET
data.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Sadman (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Nafiz Sadman
Nishat Anjum
Kishor Datta Gupta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States, Bangladesh
Description
Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
H
Data from: FakeNewsPerception: An Eye Movement Dataset on the Perceived...
dataverse.harvard.edu
Updated Dec 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ömer Sümer; Efe Bozkir; Thomas Kübler; Sven Grüner; Sonja Utz; Enkelejda Kasneci (2020). FakeNewsPerception: An Eye Movement Dataset on the Perceived Believability of News Stories [Dataset]. http://doi.org/10.7910/DVN/C1UD2A
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/C1UD2A
Dataset updated
Dec 1, 2020
Dataset provided by
Harvard Dataverse
Authors
Ömer Sümer; Efe Bozkir; Thomas Kübler; Sven Grüner; Sonja Utz; Enkelejda Kasneci
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
FakeNewsPerception: An Eye Movement Dataset on the Perceived Believability of News Stories Extensive use of the internet has enabled easy access to many different sources, such as news and social media. Content shared on the internet cannot be fully fact-checked, and as a result, misinformation can spread in a fast and easy way. Recently, psychologists and economists have shown in many experiments that prior beliefs, knowledge, and the willingness to think deliberately are important determinants to explain who falls for fake news. Many of these studies only rely on self-reports, which suffer from social desirability, and that we need more objective measures of information processing such as eye movements during reading news. To provide the research community the opportunity to study human behaviors on the news truthness, we propose the FakeNewsPerception dataset. FakeNewsPerception consists of eye movements during reading, perceived believability scores, questionnaires including Cognitive Reflection Test (CRT) and News-Find-Me (NFM) perception, and political orientation, collected from 25 participants with 60 news items. Initial analyses of the eye movements revealed that human perception differs when viewing true and fake news.
Z
Data from: A Data set for Information Spreading over the News
data.niaid.nih.gov
zenodo.org
+1more
Updated Apr 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdul Sittar (2021). A Data set for Information Spreading over the News [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3950064
Explore at:
Dataset updated
Apr 11, 2021
Dataset provided by
Dunja Mladenic
Abdul Sittar
Tomaz Erjavec
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract:

Analyzing the spread of information related to a specific event in the news has many potential applications. Consequently, various systems have been developed to facilitate the analysis of information spreadings such as detection of disease propagation and identification of the spreading of fake news through social media. There are several open challenges in the process of discerning information propagation, among them the lack of resources for training and evaluation. This paper describes the process of compiling a corpus from the EventRegistry global media monitoring system. We focus on information spreading in three domains: sports (i.e. the FIFA WorldCup), natural disasters (i.e. earthquakes), and climate change (i.e.global warming). This corpus is a valuable addition to the currently available datasets to examine the spreading of information about various kinds of events.

Introduction:

Domain-specific gaps in information spreading are ubiquitous and may exist due to economic conditions, political factors, or linguistic, geographical, time-zone, cultural, and other barriers. These factors potentially contribute to obstructing the flow of local as well as international news. We believe that there is a lack of research studies that examine, identify, and uncover the reasons for barriers in information spreading. Additionally, there is limited availability of datasets containing news text and metadata including time, place, source, and other relevant information. When a piece of information starts spreading, it implicitly raises questions such as as

How far does the information in the form of news reach out to the public?

Does the content of news remain the same or changes to a certain extent?

Do the cultural values impact the information especially when the same news will get translated in other languages?

Statistics about datasets:

Domain Event Type Articles Per Language Total Articles

1 Sports FIFA World Cup 983-en, 762-sp, 711-de, 10-sl, 216-pt 2679

2 Natural Disaster Earthquake 941-en, 999-sp, 937-de, 19-sl, 251-pt 3194

3 Climate Changes Global Warming 996-en, 298-sp, 545-de, 8-sl, 97-pt 1945
e
Content Analysis of Misinformation on Facebook by Alternative Media in CH,...
b2find.eudat.eu
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Content Analysis of Misinformation on Facebook by Alternative Media in CH, DE, FR, UK, and US in 2020 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/0c8f0359-dee0-57b2-ad4d-c5e79fecf2ed
Explore at:
Dataset updated
Jun 15, 2024
Area covered
United Kingdom, United States
Description
This project aims to assess the extent of online disinformation problem (“fake news”) in Western Europe compared to the US: Although scholarship on disinformation has increased substantially since 2016, there is a lack of work comparing these findings with the situation in other Western democracies outside the US. The main goal of this project is to understand which contextual and individual factors foster the dissemination and consumption of online disinformation and with what effects. More specifically, which actors spread false information, how is it consumed and perceived, and which societal groups are most susceptible to being affected by it. Switching the perspective, we are also interested in the factors enabling the resilience of countries facing the problem of online disinformation. The project entails two datasets to analyze the producers' as well as consumers' side: First, a representative online survey in six countries (Belgium (Flanders): n = 1,063; France: n = 1,255; Germany: n = 1,019; Switzerland (German-speaking regions): n = 1,251; UK: n = 1,380; US: n = 1,038) that incorporates an experimental part (N = 7,006). To investigate the interaction with disinformation, we created stimuli that consist of false claims covering the three issues Covid-19, climate change, and migration - all designed as a social media post, varying in reporting style elements and amount of popularity cues (number of likes, shares, comments). Key variables and concepts that we measured prior to the stimuli are: Socio-demographics, issue salience, political attitude, news consumption, social media usage, exposure to politicians' posts, and media trust. Key variables and concepts that we measured after the experimental treatment are: Emotional reaction, agreement with false claim, willingness to interact with the false post (like, share, comment, no reaction) and motivations for that interaction, personality traits, political interest and orientation, disinformation concerns and exposure, and (mis-)information literacy. Second, a manual quantitative content analysis of Facebook-posts published by alternative media in five countries (N = 1,661): France, n = 350; Germany, n = 350; Switzerland, n = 261; UK, n = 350; US, n = 350. This dataset focuses on different disinformation types and styles which we grouped in disinformation categories (i.e., fabricated falsehoods, false connections, ideological biases) and genre-typical features (i.e., conspiracy rhetoric, call for skepticism, use of pseudo-experts, clickbait journalism, emotionality, sensationalism, counter-positioning, critizism of various targets, and political topics).

Average daily time spent on social media worldwide 2012-2024

statista.com
es.statista.com
+1more

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Average daily time spent on social media worldwide 2012-2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

How much time do people spend on social media?

              As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in
              the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively.
              People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general.
              During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.

Number of global social network users 2017-2028

statista.com
es.statista.com
+1more

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Number of global social network users 2017-2028 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

How many people use social media?

              Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.

              Who uses social media?
              Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
              when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.

              How much time do people spend on social media?
              Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.

              What are the most popular social media platforms?
              Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.

Global social media subscriptions comparison 2023
statista.com
es.statista.com
+1more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stacy Jo Dixon, Global social media subscriptions comparison 2023 [Dataset]. https://www.statista.com/topics/1164/social-networks/
Explore at:
Dataset provided by
Statistahttp://statista.com/
Authors
Stacy Jo Dixon
Description
Social media companies are starting to offer users the option to subscribe to their platforms in exchange for monthly fees. Until recently, social media has been predominantly free to use, with tech companies relying on advertising as their main revenue generator. However, advertising revenues have been dropping following the COVID-induced boom. As of July 2023, Meta Verified is the most costly of the subscription services, setting users back almost 15 U.S. dollars per month on iOS or Android. Twitter Blue costs between eight and 11 U.S. dollars per month and ensures users will receive the blue check mark, and have the ability to edit tweets and have NFT profile pictures. Snapchat+, drawing in four million users as of the second quarter of 2023, boasts a Story re-watch function, custom app icons, and a Snapchat+ badge.

Leading social media platforms used by marketers worldwide 2024

statista.com
de.statista.com
+1more

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Christopher Ross, Leading social media platforms used by marketers worldwide 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Christopher Ross

Description

During a 2024 survey among marketers worldwide, around 86 percent reported using Facebook for marketing purposes. Instagram and LinkedIn followed, respectively mentioned by 79 and 65 percent of the respondents.

              The global social media marketing segment

              According to the same study, 59 percent of responding marketers intended to increase their organic use of YouTube for marketing purposes throughout that year. LinkedIn and Instagram followed with similar shares, rounding up the top three social media platforms attracting a planned growth in organic use among global marketers in 2024. Their main driver is increasing brand exposure and traffic, which led the ranking of benefits of social media marketing worldwide.

              Social media for B2B marketing

              Social media platform adoption rates among business-to-consumer (B2C) and business-to-business (B2B) marketers vary according to each subsegment's focus. While B2C professionals prioritize Facebook and Instagram – both run by Meta, Inc. – due to their popularity among online audiences, B2B marketers concentrate their endeavors on Microsoft-owned LinkedIn due to its goal to connect people and companies in a corporate context.

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni (2021). Repository of fake news detection datasets [Dataset]. http://doi.org/10.4121/14151755.v1

Repository of fake news detection datasets

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.4121/14151755.v1

Dataset updated

Mar 18, 2021

Dataset provided by

4TU.ResearchData

Authors

Arianna D'Ulizia; Maria Chiara Caschera; Fernando ferri; Patrizia Grifoni

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Time period covered

2000 - 2019

Description

The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics (i.e., news domain, application purpose, type of disinformation, language, size, news content, rating scale, spontaneity, media platform, availability, and extraction time)

Clear search

Close search

Google apps

Main menu

Repository of fake news detection datasets

CT-FAN-21 corpus: A dataset for Fake News Detection

CT-FAN: A Multilingual dataset for Fake News Detection

fake-news-detection-dataset-English

Data from: Supersharers of fake news on Twitter

Preprocessing

Social media as a news outlet worldwide 2024

Multimodal fake news dataset Weibo23

WELFake dataset for fake news detection in text data

UPFD (User Preference-aware Fake News Detection)

Graphs

Fake News

Total Nodes

Total Edges

Avg. Nodes per Graph

Data from: Three Models for Audio-Visual Data in Politics

NCRB: Year-wise Court disposal of cases of fake/false news and rumours...

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

Data from: FakeNewsPerception: An Eye Movement Dataset on the Perceived...

Data from: A Data set for Information Spreading over the News

Domain Event Type Articles Per Language Total Articles

Content Analysis of Misinformation on Facebook by Alternative Media in CH,...

Average daily time spent on social media worldwide 2012-2024

Number of global social network users 2017-2028

Global social media subscriptions comparison 2023

Leading social media platforms used by marketers worldwide 2024

Repository of fake news detection datasets