100+ datasets found
  1. i

    Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset)

    • ieee-dataport.org
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DIKSHA SHUKLA (2025). Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset) [Dataset]. https://ieee-dataport.org/open-access/covid-19-fake-news-infodemic-research-dataset-covid19-fnir-dataset
    Explore at:
    Dataset updated
    Jul 29, 2025
    Authors
    DIKSHA SHUKLA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The United States of America

  2. Covid-19 News Dataset Both Fake and Real

    • zenodo.org
    • explore.openaire.eu
    csv
    Updated Jul 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shagoto Rahman; M. Raihan; M. Raihan; Laboni Akter; Md. Mohsin Sarker Raihan; Shagoto Rahman; Laboni Akter; Md. Mohsin Sarker Raihan (2021). Covid-19 News Dataset Both Fake and Real [Dataset]. http://doi.org/10.5281/zenodo.4722484
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 2, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shagoto Rahman; M. Raihan; M. Raihan; Laboni Akter; Md. Mohsin Sarker Raihan; Shagoto Rahman; Laboni Akter; Md. Mohsin Sarker Raihan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains fake and real news. There are 16898 unique rows that points out the numbers of news as well. The dataset is merged from two datasets one is from different source of CBC news (link: https://zenodo.org/record/4722470) and other is from different web portals (link: https://zenodo.org/record/4282522).

    Data Description:

    Text: Text contains the news that is either fake or real.

    Outcome: Contains either fake or real which is the status of the news.

  3. i

    Data from: COVID-19 News Articles

    • ieee-dataport.org
    Updated May 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piyush Ghasiya (2022). COVID-19 News Articles [Dataset]. https://ieee-dataport.org/documents/covid-19-news-articles
    Explore at:
    Dataset updated
    May 18, 2022
    Authors
    Piyush Ghasiya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    India

  4. m

    COVID-19 Fake News Dataset

    • data.mendeley.com
    Updated Feb 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Koirala (2021). COVID-19 Fake News Dataset [Dataset]. http://doi.org/10.17632/zwfdmp5syg.1
    Explore at:
    Dataset updated
    Feb 22, 2021
    Authors
    Abhishek Koirala
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of a collection of true and fake news related to COVID-19. The dataset consists of news between the period of December 2019- July 2020.

  5. COVID Fake News Dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated Nov 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Banik; Sumit Banik (2020). COVID Fake News Dataset [Dataset]. http://doi.org/10.5281/zenodo.4282522
    Explore at:
    Dataset updated
    Nov 27, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sumit Banik; Sumit Banik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    The dataset contains the list of COVID Fake News/Claims which is shared all over the internet.

    Content

    1. Headlines: String attribute consisting of the headlines/fact shared.
    2. Outcome: It is binary data where 0 means the headline is fake and 1 means that it is true.

    Inspiration

    In many research portals, there was this common question in which the combined fake news dataset is available or not. This led to the publication of this dataset.

  6. m

    Covid-19 and vaccine news dataset

    • data.mendeley.com
    Updated Oct 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajat Thakur (2021). Covid-19 and vaccine news dataset [Dataset]. http://doi.org/10.17632/hwrdzw26vk.1
    Explore at:
    Dataset updated
    Oct 27, 2021
    Authors
    Rajat Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the latest world news related to Covid-19 and Covid vaccine with the news article's available metadata.

  7. m

    Covid-19 latest news dataset

    • data.mendeley.com
    Updated Oct 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajat Thakur (2021). Covid-19 latest news dataset [Dataset]. http://doi.org/10.17632/8rbm7d874k.1
    Explore at:
    Dataset updated
    Oct 27, 2021
    Authors
    Rajat Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Coronavirus disease 2019 (COVID19) time series that lists confirmed cases, reported deaths, and reported recoveries. Data is broken down by country (and sometimes by sub-region).

    Coronavirus disease (COVID19) is caused by severe acute respiratory syndrome Coronavirus 2 (SARSCoV2) and has had an effect worldwide. On March 11, 2020, the World Health Organization (WHO) declared it a pandemic, currently indicating more than 118,000 cases of coronavirus disease in more than 110 countries and territories around the world.

    This dataset contains the latest news related to Covid-19 and it was fetched with the help of Newsdata.io news API.

  8. O

    COVID-19 Fake News Dataset (COVID19 Fake News Detection in English)

    • opendatalab.com
    zip
    Updated Oct 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    International Institute for Information Technology, Hyderabad (2020). COVID-19 Fake News Dataset (COVID19 Fake News Detection in English) [Dataset]. https://opendatalab.com/OpenDataLab/COVID-19_Fake_News_Dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 1, 2020
    Dataset provided by
    Wipro Reseach
    International Institute for Information Technology, Hyderabad
    License

    https://competitions.codalab.org/competitions/26655#learn_the_details-terms_and_conditionshttps://competitions.codalab.org/competitions/26655#learn_the_details-terms_and_conditions

    Description

    Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm. This is further exacerbated at the time of a pandemic. To tackle this, we curate and release a manually annotated dataset of 10,700 social media posts and articles of real and fake news on COVID-19. We benchmark the annotated dataset with four machine learning baselines - Decision Tree, Logistic Regression , Gradient Boost , and Support Vector Machine (SVM). We obtain the best performance of 93.46\% F1-score with SVM.

  9. COVID-19 Fake News Dataset

    • kaggle.com
    zip
    Updated Nov 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2020). COVID-19 Fake News Dataset [Dataset]. https://www.kaggle.com/arashnic/covid19-fake-news
    Explore at:
    zip(3948402 bytes)Available download formats
    Dataset updated
    Nov 4, 2020
    Authors
    Möbius
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    As the COVID-19 virus quickly spreads around the world, unfortunately, misinformation related to COVID-19 also gets created and spreads like wild fire. Such misinformation has caused confusion among people, disruptions in society, and even deadly consequences in health problems. To be able to understand, detect, and mitigate such COVID-19 misinformation, therefore, has not only deep intellectual values but also huge societal impacts. To help researchers combat COVID-19 health misinformation, this dataset created.

    #
    #

    https://img.etimg.com/thumb/msid-65836641,width-640,resizemode-4,imgsize-272192/fake-news.jpg" width="700">

    Content

    The datasets is a diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users' social engagement about such news. It includes 4,251 news, 296,000 related user engagements, 926 social platform posts about COVID-19, and ground truth labels.

    • Version 0.1 (05/17/2020) initial version corresponding to arXiv paper CoAID: COVID-19 HEALTHCARE MISINFORMATION DATASET

    • Version 0.2 (08/03/2020) added data from May 1, 2020 through July 1, 2020

    • Version 0.3 (11/03/2020) added data from July 1, 2020 through September 1, 2020

    Acknowledgements

    Limeng Cui Dongwon Lee, Pennsylvania State University.

  10. h

    covid_fake_news

    • huggingface.co
    Updated Mar 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiyang Nan (2024). covid_fake_news [Dataset]. https://huggingface.co/datasets/nanyy1025/covid_fake_news
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2024
    Authors
    Yiyang Nan
    Description

    Constraint@AAAI2021 - COVID19 Fake News Detection in English @misc{patwa2020fighting, title={Fighting an Infodemic: COVID-19 Fake News Dataset}, author={Parth Patwa and Shivam Sharma and Srinivas PYKL and Vineeth Guptha and Gitanjali Kumari and Md Shad Akhtar and Asif Ekbal and Amitava Das and Tanmoy Chakraborty}, year={2020}, eprint={2011.03327}, archivePrefix={arXiv}, primaryClass={cs.CL} }

  11. COVID-19 rumor dataset

    • figshare.com
    html
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cheng (2023). COVID-19 rumor dataset [Dataset]. http://doi.org/10.6084/m9.figshare.14456385.v2
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    cheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }

  12. f

    Covid_News.json

    • figshare.com
    txt
    Updated Oct 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajat Thakur (2021). Covid_News.json [Dataset]. http://doi.org/10.6084/m9.figshare.16871881.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 26, 2021
    Dataset provided by
    figshare
    Authors
    Rajat Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Track and monitor Covid-19 related news from the world.

  13. i

    COVIFN : Fake News on COVID19

    • ieee-dataport.org
    Updated Nov 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isha Agarwal (2021). COVIFN : Fake News on COVID19 [Dataset]. https://ieee-dataport.org/documents/covifn-fake-news-covid19
    Explore at:
    Dataset updated
    Nov 3, 2021
    Authors
    Isha Agarwal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    the removal of special characters and non-vital information is performed.The file contains columns such as:Date: publish date of news article country: country the article is abouttext: the news article contentlabel: fake or real news labelURL: the fact-checked sitesource: original news source site

  14. t

    FakeCovid - A Multilingual Cross-domain Fact Check News Dataset for COVID-19...

    • service.tib.eu
    Updated Dec 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). FakeCovid - A Multilingual Cross-domain Fact Check News Dataset for COVID-19 - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/fakecovid---a-multilingual-cross-domain-fact-check-news-dataset-for-covid-19
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The FakeCovid dataset contains 5182 fact-checked news articles for COVID-19 collected from January to May 2020.

  15. Z

    INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nafiz Sadman (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Nafiz Sadman
    Nishat Anjum
    Kishor Datta Gupta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh, United States
    Description

    Introduction

    There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

    However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

    2 Data-set Introduction

    2.1 Data Collection

    We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

    The headline must have one or more words directly or indirectly related to COVID-19.

    The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

    The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

    Avoid taking duplicate reports.

    Maintain a time frame for the above mentioned newspapers.

    To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

    2.2 Data Pre-processing and Statistics

    Some pre-processing steps performed on the newspaper report dataset are as follows:

    Remove hyperlinks.

    Remove non-English alphanumeric characters.

    Remove stop words.

    Lemmatize text.

    While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

    The primary data statistics of the two dataset are shown in Table 1 and 2.

    Table 1: Covid-News-USA-NNK data statistics

    No of words per headline

    7 to 20

    No of words per body content

    150 to 2100

    Table 2: Covid-News-BD-NNK data statistics No of words per headline

    10 to 20

    No of words per body content

    100 to 1500

    2.3 Dataset Repository

    We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

    3 Literature Review

    Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

    Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

    Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

    Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

    4 Our experiments and Result analysis

    We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

    In February, both the news paper have talked about China and source of the outbreak.

    StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

    Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

    Washington Post discussed global issues more than StarTribune.

    StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

    While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

    We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

    where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,

  16. CT-FAN-21 corpus: A dataset for Fake News Detection

    • zenodo.org
    Updated Oct 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

    Citation

    Please cite our work as

    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

    Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    Task 3a

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Task 3b

    • public_id- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • domain - domain of the given news article(applicable only for task B)

    Output data format

    Task 3a

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    Task 3b

    • public_id- Unique identifier of the news article
    • predicted_domain- predicted domain

    Sample file

    public_id, predicted_domain
    1, health
    2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

    IMPORTANT!

    1. Fake news article used for task 3b is a subset of task 3a.
    2. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Submission Link: https://competitions.codalab.org/competitions/31238

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
  17. Actions taken after reading fake COVID-19 news in the UK 2020-2021

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Actions taken after reading fake COVID-19 news in the UK 2020-2021 [Dataset]. https://www.statista.com/statistics/1113700/coronavirus-fake-news-actions-uk/
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United Kingdom
    Description

    A survey carried out in the United Kingdom in September 2021 found that ** percent of respondents did not take any action after encountering what they believed to be false or misleading information on the COVID-19 outbreak. Whilst this figure was lower than the share who said the same in the 2020 survey, taking no action remained the most common response to fake coronavirus news. Meanwhile, ** percent used a fact checking site or tool to determine whether or not the information they found was true, and ** percent turned to family or friends for help in confirming the legitimacy of news they suspected to be false.

    For further information about the coronavirus (COVID-19) pandemic, please visit our dedicated Facts and Figures page.

  18. Share of online fake news related to coronavirus (COVID-19) in Italy 2020

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Share of online fake news related to coronavirus (COVID-19) in Italy 2020 [Dataset]. https://www.statista.com/statistics/1109490/share-of-coronavirus-fake-news-italy/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2020 - May 2020
    Area covered
    Italy
    Description

    In May 2020, up to six percent of all online news and posts related to the coronavirus (COVID-19) and released in Italy were false or not accurate. The percentage was calculated on the average volume of posts and articles published by the Italian media outlets, including posts on social media. The peak in the release of fake news was registered in the early stage of the pandemic at the end of January 2020, with 7.3 percent of the coronavirus-related information.

    For further information about the coronavirus (COVID-19) pandemic, please visit our dedicated Fact and Figures page.

  19. Mexico: social networks in which users saw more COVID-19 fake news

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Mexico: social networks in which users saw more COVID-19 fake news [Dataset]. https://www.statista.com/statistics/1136738/social-networks-users-received-more-false-coronavirus-information-mexico/
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Mar 18, 2020 - Mar 25, 2020
    Area covered
    Mexico
    Description

    In March 2020, nearly **** percent of social media users surveyed in Mexico claimed to have received the largest amount of false information regarding COVID-19 via WhatsApp, while **** percent of respondents said Facebook was the platform through which they got the biggest number of fake news on the matter.

  20. u

    Spanish Fake News Dataset

    • produccioncientifica.ucm.es
    • zenodo.org
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tretiakov, Arsenii; D'Antonio Maceiras, Sergio; Martín, Alejandro; Tretiakov, Arsenii; D'Antonio Maceiras, Sergio; Martín, Alejandro (2025). Spanish Fake News Dataset [Dataset]. https://produccioncientifica.ucm.es/documentos/685699246364e456d3a66786
    Explore at:
    Dataset updated
    2025
    Authors
    Tretiakov, Arsenii; D'Antonio Maceiras, Sergio; Martín, Alejandro; Tretiakov, Arsenii; D'Antonio Maceiras, Sergio; Martín, Alejandro
    Description

    Spanish Fake News Dataset

    This dataset contains a structured and annotated collection of false news items in Spanish (Castilian), gathered and processed for academic research on misinformation.

    Dataset Scope

    The dataset represents most of the recorded false news messages and their variations up to 01.02.2021.

    Content Description

    The dataset includes samples of false information in various formats:

    News articles and headlines

    Tweets and Facebook/Instagram/Telegram posts

    YouTube video captions

    WhatsApp text and voice message transcripts

    Transcribed video/audio fragments with false claims

    Fake government documents

    Captions from photos and memes

    Text extracted from images using OCR

    Only Spanish (Castilian) texts were used, excluding regional variants (e.g., Catalan, Basque, Galician) for consistency.

    Sources

    The data was collected from the following verified fact-checking initiatives:

    Maldito Bulo

    Newtral

    AFP Factual

    Fact-checkers from these organizations provide detailed articles identifying and explaining falsehoods, often including:

    General context of the event

    Quotes or links to false claims

    Analysis and explanation of why the claims are false

    Verified information or corrections

    Collection Method

    The dataset was built using both manual extraction (e.g., identifying and quoting false statements) and automated parsing:

    MyNews service: an archive of Spanish mass media

    Custom scripts: for parsing and extracting structured data

    OCR tools: for extracting text from images (e.g., memes and screenshots)

    Fields Description

    Column Name

    Description

    Topic

    The thematic category of the news item (e.g., Politics, Health, COVID-19, Crime). Normalized and translated to English.

    Link source

    URL to the original news piece, fact-check report, or source of the claim. Invalid links were removed.

    Media

    The platform or outlet where the false claim appeared (e.g., Facebook, YouTube, WhatsApp). Normalized for consistent spelling and language.

    Date

    Publication or verification date of the news item, in YYYY-MM-DD format.

    Author

    (Optional) Author of the news or platform source, if available. May be empty.

    Headlines

    Title or summary of the news item or article containing the false information.

    Fake statement

    Quoted false claim or misinformation as cited in the verification article.

    ⚠️ Notes

    The dataset was preprocessed to remove duplicates, invalid links, and non-textual clutter.

    Field values were normalized to support multilingual and cross-platform analysis.

    Only Castilian Spanish was retained for consistency and clarity.

    📚 License & Use

    This dataset is intended for non-commercial academic and research purposes. Please cite the original fact-checking organizations and this dataset if used in publications or analysis.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
DIKSHA SHUKLA (2025). Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset) [Dataset]. https://ieee-dataport.org/open-access/covid-19-fake-news-infodemic-research-dataset-covid19-fnir-dataset

Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset)

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 29, 2025
Authors
DIKSHA SHUKLA
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The United States of America

Search
Clear search
Close search
Google apps
Main menu