100+ datasets found

i
Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset)
ieee-dataport.org
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DIKSHA SHUKLA (2025). Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset) [Dataset]. https://ieee-dataport.org/open-access/covid-19-fake-news-infodemic-research-dataset-covid19-fnir-dataset
Explore at:
Dataset updated
Jul 29, 2025
Authors
DIKSHA SHUKLA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The United States of America
Covid-19 News Dataset Both Fake and Real
zenodo.org
explore.openaire.eu
csv
Updated Jul 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shagoto Rahman; M. Raihan; M. Raihan; Laboni Akter; Md. Mohsin Sarker Raihan; Shagoto Rahman; Laboni Akter; Md. Mohsin Sarker Raihan (2021). Covid-19 News Dataset Both Fake and Real [Dataset]. http://doi.org/10.5281/zenodo.4722484
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4722484
Dataset updated
Jul 2, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shagoto Rahman; M. Raihan; M. Raihan; Laboni Akter; Md. Mohsin Sarker Raihan; Shagoto Rahman; Laboni Akter; Md. Mohsin Sarker Raihan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains fake and real news. There are 16898 unique rows that points out the numbers of news as well. The dataset is merged from two datasets one is from different source of CBC news (link: https://zenodo.org/record/4722470) and other is from different web portals (link: https://zenodo.org/record/4282522).

Data Description:

Text: Text contains the news that is either fake or real.

Outcome: Contains either fake or real which is the status of the news.
i
Data from: COVID-19 News Articles
ieee-dataport.org
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piyush Ghasiya (2022). COVID-19 News Articles [Dataset]. https://ieee-dataport.org/documents/covid-19-news-articles
Explore at:
Dataset updated
May 18, 2022
Authors
Piyush Ghasiya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
India
COVID Fake News Dataset
zenodo.org
data.niaid.nih.gov
Updated Nov 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Banik; Sumit Banik (2020). COVID Fake News Dataset [Dataset]. http://doi.org/10.5281/zenodo.4282522
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4282522
Dataset updated
Nov 27, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sumit Banik; Sumit Banik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

The dataset contains the list of COVID Fake News/Claims which is shared all over the internet.

Content

Headlines: String attribute consisting of the headlines/fact shared.

Outcome: It is binary data where 0 means the headline is fake and 1 means that it is true.

Inspiration

In many research portals, there was this common question in which the combined fake news dataset is available or not. This led to the publication of this dataset.
m
COVID-19 Fake News Dataset
data.mendeley.com
Updated Feb 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Koirala (2021). COVID-19 Fake News Dataset [Dataset]. http://doi.org/10.17632/zwfdmp5syg.1
Explore at:
Unique identifier
https://doi.org/10.17632/zwfdmp5syg.1
Dataset updated
Feb 22, 2021
Authors
Abhishek Koirala
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of a collection of true and fake news related to COVID-19. The dataset consists of news between the period of December 2019- July 2020.
m
Covid-19 and vaccine news dataset
data.mendeley.com
Updated Oct 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajat Thakur (2021). Covid-19 and vaccine news dataset [Dataset]. http://doi.org/10.17632/hwrdzw26vk.1
Explore at:
Unique identifier
https://doi.org/10.17632/hwrdzw26vk.1
Dataset updated
Oct 27, 2021
Authors
Rajat Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the latest world news related to Covid-19 and Covid vaccine with the news article's available metadata.
m
Covid-19 latest news dataset
data.mendeley.com
Updated Oct 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajat Thakur (2021). Covid-19 latest news dataset [Dataset]. http://doi.org/10.17632/8rbm7d874k.1
Explore at:
Unique identifier
https://doi.org/10.17632/8rbm7d874k.1
Dataset updated
Oct 27, 2021
Authors
Rajat Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Coronavirus disease 2019 (COVID19) time series that lists confirmed cases, reported deaths, and reported recoveries. Data is broken down by country (and sometimes by sub-region).

Coronavirus disease (COVID19) is caused by severe acute respiratory syndrome Coronavirus 2 (SARSCoV2) and has had an effect worldwide. On March 11, 2020, the World Health Organization (WHO) declared it a pandemic, currently indicating more than 118,000 cases of coronavirus disease in more than 110 countries and territories around the world.

This dataset contains the latest news related to Covid-19 and it was fetched with the help of Newsdata.io news API.
O
COVID-19 Fake News Dataset (COVID19 Fake News Detection in English)
opendatalab.com
zip
Updated Oct 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
International Institute for Information Technology, Hyderabad (2020). COVID-19 Fake News Dataset (COVID19 Fake News Detection in English) [Dataset]. https://opendatalab.com/OpenDataLab/COVID-19_Fake_News_Dataset
Explore at:
zipAvailable download formats
Dataset updated
Oct 1, 2020
Dataset provided by
Wipro Reseach
International Institute for Information Technology, Hyderabad
License
https://competitions.codalab.org/competitions/26655#learn_the_details-terms_and_conditionshttps://competitions.codalab.org/competitions/26655#learn_the_details-terms_and_conditions
Description
Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm. This is further exacerbated at the time of a pandemic. To tackle this, we curate and release a manually annotated dataset of 10,700 social media posts and articles of real and fake news on COVID-19. We benchmark the annotated dataset with four machine learning baselines - Decision Tree, Logistic Regression , Gradient Boost , and Support Vector Machine (SVM). We obtain the best performance of 93.46\% F1-score with SVM.
i
COVIFN : Fake News on COVID19
ieee-dataport.org
Updated Nov 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isha Agarwal (2021). COVIFN : Fake News on COVID19 [Dataset]. https://ieee-dataport.org/documents/covifn-fake-news-covid19
Explore at:
Dataset updated
Nov 3, 2021
Authors
Isha Agarwal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
the removal of special characters and non-vital information is performed.The file contains columns such as:Date: publish date of news article country: country the article is abouttext: the news article contentlabel: fake or real news labelURL: the fact-checked sitesource: original news source site
COVID-19 Fake News Dataset
kaggle.com
zip
Updated Nov 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2020). COVID-19 Fake News Dataset [Dataset]. https://www.kaggle.com/arashnic/covid19-fake-news
Explore at:
zip(3948402 bytes)Available download formats
Dataset updated
Nov 4, 2020
Authors
Möbius
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

As the COVID-19 virus quickly spreads around the world, unfortunately, misinformation related to COVID-19 also gets created and spreads like wild fire. Such misinformation has caused confusion among people, disruptions in society, and even deadly consequences in health problems. To be able to understand, detect, and mitigate such COVID-19 misinformation, therefore, has not only deep intellectual values but also huge societal impacts. To help researchers combat COVID-19 health misinformation, this dataset created.

#
#

https://img.etimg.com/thumb/msid-65836641,width-640,resizemode-4,imgsize-272192/fake-news.jpg" width="700">

Content

The datasets is a diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users' social engagement about such news. It includes 4,251 news, 296,000 related user engagements, 926 social platform posts about COVID-19, and ground truth labels.

Version 0.1 (05/17/2020) initial version corresponding to arXiv paper CoAID: COVID-19 HEALTHCARE MISINFORMATION DATASET

Version 0.2 (08/03/2020) added data from May 1, 2020 through July 1, 2020

Version 0.3 (11/03/2020) added data from July 1, 2020 through September 1, 2020

Acknowledgements

Limeng Cui Dongwon Lee, Pennsylvania State University.
h
covid_fake_news
huggingface.co
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yiyang Nan (2024). covid_fake_news [Dataset]. https://huggingface.co/datasets/nanyy1025/covid_fake_news
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 7, 2024
Authors
Yiyang Nan
Description
Constraint@AAAI2021 - COVID19 Fake News Detection in English @misc{patwa2020fighting, title={Fighting an Infodemic: COVID-19 Fake News Dataset}, author={Parth Patwa and Shivam Sharma and Srinivas PYKL and Vineeth Guptha and Gitanjali Kumari and Md Shad Akhtar and Asif Ekbal and Amitava Das and Tanmoy Chakraborty}, year={2020}, eprint={2011.03327}, archivePrefix={arXiv}, primaryClass={cs.CL} }
t
FakeCovid - A Multilingual Cross-domain Fact Check News Dataset for COVID-19...
service.tib.eu
Updated Dec 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). FakeCovid - A Multilingual Cross-domain Fact Check News Dataset for COVID-19 - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/fakecovid---a-multilingual-cross-domain-fact-check-news-dataset-for-covid-19
Explore at:
Dataset updated
Dec 16, 2024
Description
The FakeCovid dataset contains 5182 fact-checked news articles for COVID-19 collected from January to May 2020.
i
Salient sentence extraction dataset from COVID-19 news reports
ieee-dataport.org
Updated Jan 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumanta Banerjee (2023). Salient sentence extraction dataset from COVID-19 news reports [Dataset]. https://ieee-dataport.org/documents/salient-sentence-extraction-dataset-covid-19-news-reports
Explore at:
Dataset updated
Jan 31, 2023
Authors
Sumanta Banerjee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
(3) deaths
Z
INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET
data.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Sadman (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Nafiz Sadman
Kishor Datta Gupta
Nishat Anjum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh, United States
Description
Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
n
Coronavirus (Covid-19) Data in the United States
nytimes.com
openicpsr.org
+2more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Explore at:
Dataset provided by
New York Times
Description
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
Actions taken after reading fake COVID-19 news in the UK 2020-2021
statista.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Actions taken after reading fake COVID-19 news in the UK 2020-2021 [Dataset]. https://www.statista.com/statistics/1113700/coronavirus-fake-news-actions-uk/
Explore at:
Dataset updated
Jul 9, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United Kingdom
Description
A survey carried out in the United Kingdom in September 2021 found that ** percent of respondents did not take any action after encountering what they believed to be false or misleading information on the COVID-19 outbreak. Whilst this figure was lower than the share who said the same in the 2020 survey, taking no action remained the most common response to fake coronavirus news. Meanwhile, ** percent used a fact checking site or tool to determine whether or not the information they found was true, and ** percent turned to family or friends for help in confirming the legitimacy of news they suspected to be false.

For further information about the coronavirus (COVID-19) pandemic, please visit our dedicated Facts and Figures page.
CMU-MisCov19: A Novel Twitter Dataset for Characterizing COVID-19...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahan Ali Memon; Shahan Ali Memon; Kathleen M. Carley; Kathleen M. Carley (2024). CMU-MisCov19: A Novel Twitter Dataset for Characterizing COVID-19 Misinformation [Dataset]. http://doi.org/10.5281/zenodo.4024154
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4024154
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shahan Ali Memon; Shahan Ali Memon; Kathleen M. Carley; Kathleen M. Carley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
From conspiracy theories to fake cures and fake treatments, COVID-19 has become a hot-bed for the spread of misinformation online. It is more important than ever to identify methods to debunk and correct false information online. Detection and characterization of misinformation requires an availability of annotated datasets. Most of the published COVID-19 Twitter datasets are generic, lack annotations or labels, employ automated annotations using transfer learning or semi-supervised methods, or are not specifically designed for misinformation. Annotated datasets are either only focused on "fake news", are small in size, or have less diversity in terms of classes.

Here, we present a novel Twitter misinformation dataset called "CMU-MisCov19" with 4573 annotated tweets over 17 themes around the COVID-19 discourse. We also present our annotation codebook for the different COVID-19 themes on Twitter, along with their descriptions and examples, for the community to use for collecting further annotations. Further details related to the dataset, and our analysis based on this dataset can be found at https://arxiv.org/abs/2008.00791. In adherence to the Twitter’s terms and conditions, we do not provide the full tweet JSONs but provide a ".csv" file with the tweet IDs so that the tweets can be rehydrated. We also provide the annotations, and the date of creation for each tweet for the reproduction of the results of our analyses.

Note: If for any reason, you are not able to rehydrate all the tweets, reach out to Shahan Ali Memon at (shahan@nyu.edu).

If you use this data, please cite our paper as follows:

"Shahan Ali Memon and Kathleen M. Carley. Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset, In Proceedings of The 5th International Workshop on Mining Actionable Insights from Social Networks (MAISoN 2020), co-located with CIKM, virtual event due to COVID-19, 2020."
Coronavirus COVID-19 Global Cases
redivis.com
application/jsonl +7
Updated Jul 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Center for Population Health Sciences (2020). Coronavirus COVID-19 Global Cases [Dataset]. http://doi.org/10.57761/pyf5-4e40
Explore at:
sas, csv, application/jsonl, spss, stata, parquet, arrow, avroAvailable download formats
Unique identifier
https://doi.org/10.57761/pyf5-4e40
Dataset updated
Jul 13, 2020
Dataset provided by
Redivis Inc.
Authors
Stanford Center for Population Health Sciences
Time period covered
Jan 22, 2020 - Jul 12, 2020
Description
Abstract

JHU Coronavirus COVID-19 Global Cases, by country

Documentation

PHS is updating the Coronavirus Global Cases dataset weekly, Monday, Wednesday and Friday from Cloud Marketplace.

This data comes from the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). This database was created in response to the Coronavirus public health emergency to track reported cases in real-time. The data include the location and number of confirmed COVID-19 cases, deaths, and recoveries for all affected countries, aggregated at the appropriate province or state. It was developed to enable researchers, public health authorities and the general public to track the outbreak as it unfolds. Additional information is available in the blog post.

Visual Dashboard (desktop): https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

Section 2

Included Data Sources are:

World Health Organization (WHO): https://www.who.int/

DXY.cn. Pneumonia. 2020. http://3g.dxy.cn/newh5/view/pneumonia.

BNO News: https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/

National Health Commission of the People’s Republic of China (NHC): http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml

China CDC (CCDC): http://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm

Hong Kong Department of Health: https://www.chp.gov.hk/en/features/102465.html

Macau Government: https://www.ssm.gov.mo/portal/

Taiwan CDC: https://sites.google.com/cdc.gov.tw/2019ncov/taiwan?authuser=0

US CDC: https://www.cdc.gov/coronavirus/2019-ncov/index.html

Government of Canada: https://www.canada.ca/en/public-health/services/diseases/coronavirus.html

Australia Government Department of Health: https://www.health.gov.au/news/coronavirus-update-at-a-glance

European Centre for Disease Prevention and Control (ECDC): https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases

Ministry of Health Singapore (MOH): https://www.moh.gov.sg/covid-19

Italy Ministry of Health: http://www.salute.gov.it/nuovocoronavirus

1Point3Arces: https://coronavirus.1point3acres.com/en

WorldoMeters: https://www.worldometers.info/coronavirus/

%3C!-- --%3E

Section 3

**Terms of Use: **

This GitHub repo and its contents herein, including all data, mapping, and analysis, copyright 2020 Johns Hopkins University, all rights reserved, is provided to the public strictly for educational and academic research purposes. The Website relies upon publicly available data from multiple sources, that do not always agree. The Johns Hopkins University hereby disclaims any and all representations and warranties with respect to the Website, including accuracy, fitness for use, and merchantability. Reliance on the Website for medical guidance or use of the Website in commerce is strictly prohibited.

Section 4

**U.S. county-level characteristics relevant to COVID-19 **

Chin, Kahn, Krieger, Buckee, Balsari and Kiang (forthcoming) show that counties differ significantly in biological, demographic and socioeconomic factors that are associated with COVID-19 vulnerability. A range of publicly available county-specific data identifying these key factors, guided by international experiences and consideration of epidemiological parameters of importance, have been combined by the authors and are available for use:

https://github.com/mkiang/county_preparedness/

Spanish Fake News Dataset

zenodo.org
produccioncientifica.ucm.es

csv, txt

Updated Jun 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arsenii Tretiakov; Arsenii Tretiakov; Sergio D'Antonio Maceiras; Sergio D'Antonio Maceiras; Alejandro Martín; Alejandro Martín (2025). Spanish Fake News Dataset [Dataset]. http://doi.org/10.5281/zenodo.15592391

Explore at:

txt, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15592391

Dataset updated

Jun 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Arsenii Tretiakov; Arsenii Tretiakov; Sergio D'Antonio Maceiras; Sergio D'Antonio Maceiras; Alejandro Martín; Alejandro Martín

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Feb 2021

Description

Spanish Fake News Dataset

This dataset contains a structured and annotated collection of false news items in Spanish (Castilian), gathered and processed for academic research on misinformation.

Dataset Scope

The dataset represents most of the recorded false news messages and their variations up to 01.02.2021.

Content Description

The dataset includes samples of false information in various formats:

News articles and headlines
Tweets and Facebook/Instagram/Telegram posts
YouTube video captions
WhatsApp text and voice message transcripts
Transcribed video/audio fragments with false claims
Fake government documents
Captions from photos and memes
Text extracted from images using OCR

Only Spanish (Castilian) texts were used, excluding regional variants (e.g., Catalan, Basque, Galician) for consistency.

Sources

The data was collected from the following verified fact-checking initiatives:

Fact-checkers from these organizations provide detailed articles identifying and explaining falsehoods, often including:

General context of the event
Quotes or links to false claims
Analysis and explanation of why the claims are false
Verified information or corrections

Collection Method

The dataset was built using both manual extraction (e.g., identifying and quoting false statements) and automated parsing:

MyNews service: an archive of Spanish mass media
Custom scripts: for parsing and extracting structured data
OCR tools: for extracting text from images (e.g., memes and screenshots)

Fields Description

Column Name	Description
Topic	The thematic category of the news item (e.g., Politics, Health, COVID-19, Crime). Normalized and translated to English.
Link source	URL to the original news piece, fact-check report, or source of the claim. Invalid links were removed.
Media	The platform or outlet where the false claim appeared (e.g., Facebook, YouTube, WhatsApp). Normalized for consistent spelling and language.
Date	Publication or verification date of the news item, in YYYY-MM-DD format.
Author	(Optional) Author of the news or platform source, if available. May be empty.
Headlines	Title or summary of the news item or article containing the false information.
Fake statement	Quoted false claim or misinformation as cited in the verification article.

⚠️ Notes

The dataset was preprocessed to remove duplicates, invalid links, and non-textual clutter.
Field values were normalized to support multilingual and cross-platform analysis.
Only Castilian Spanish was retained for consistency and clarity.

📚 License & Use

This dataset is intended for non-commercial academic and research purposes. Please cite the original fact-checking organizations and this dataset if used in publications or analysis.

COVID-19 in Italy
kaggle.com
Updated Dec 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SRK (2020). COVID-19 in Italy [Dataset]. https://www.kaggle.com/datasets/sudalairajkumar/covid19-in-italy
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SRK
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Italy
Description
Context

Coronaviruses are a large family of viruses which may cause illness in animals or humans. In humans, several coronaviruses are known to cause respiratory infections ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). The most recently discovered coronavirus causes coronavirus disease COVID-19 - WHO

People can catch COVID-19 from others who have the virus. This has been spreading rapidly around the world and Italy is one of the most affected country.

On March 8, 2020 - Italy’s prime minister announced a sweeping coronavirus quarantine early Sunday, restricting the movements of about a quarter of the country’s population in a bid to limit contagions at the epicenter of Europe’s outbreak. - TIME

Content

This dataset is from https://github.com/pcm-dpc/COVID-19 collected by Sito del Dipartimento della Protezione Civile - Emergenza Coronavirus: la risposta nazionale

This dataset has two files

covid19_italy_province.csv - Province level data of COVID-19 cases

covid_italy_region.csv - Region level data of COVID-19 cases

Acknowledgements

Data is collected by Sito del Dipartimento della Protezione Civile - Emergenza Coronavirus: la risposta nazionale and is uploaded into this github repo.

Dashboard on the data can be seen here. Picture courtesy is from the dashboard.

Inspiration

Insights on * Spread to various regions over time * Try to predict the spread of COVID-19 ahead of time to take preventive measures

Facebook

Twitter

Click to copy link

Link copied

Cite

DIKSHA SHUKLA (2025). Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset) [Dataset]. https://ieee-dataport.org/open-access/covid-19-fake-news-infodemic-research-dataset-covid19-fnir-dataset

Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset)

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jul 29, 2025

Authors

DIKSHA SHUKLA

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The United States of America

Clear search

Close search

Google apps

Main menu

Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset)

Covid-19 News Dataset Both Fake and Real

Data from: COVID-19 News Articles

COVID Fake News Dataset

COVID-19 Fake News Dataset

Covid-19 and vaccine news dataset

Covid-19 latest news dataset

COVID-19 Fake News Dataset (COVID19 Fake News Detection in English)

COVIFN : Fake News on COVID19

COVID-19 Fake News Dataset

Context

Content

Acknowledgements

covid_fake_news

FakeCovid - A Multilingual Cross-domain Fact Check News Dataset for COVID-19...

Salient sentence extraction dataset from COVID-19 news reports

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

Coronavirus (Covid-19) Data in the United States

Actions taken after reading fake COVID-19 news in the UK 2020-2021

CMU-MisCov19: A Novel Twitter Dataset for Characterizing COVID-19...

Coronavirus COVID-19 Global Cases

Abstract

Documentation

Section 2

Section 3

Section 4

Spanish Fake News Dataset

COVID-19 in Italy

Context

Content

Acknowledgements

Inspiration

Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset)