100+ datasets found

i
Data from: COVID-19 News Articles
ieee-dataport.org
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piyush Ghasiya (2022). COVID-19 News Articles [Dataset]. https://ieee-dataport.org/documents/covid-19-news-articles
Explore at:
Dataset updated
May 18, 2022
Authors
Piyush Ghasiya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
India
COVID Fake News Dataset
zenodo.org
data.niaid.nih.gov
Updated Nov 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Banik; Sumit Banik (2020). COVID Fake News Dataset [Dataset]. http://doi.org/10.5281/zenodo.4282522
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4282522
Dataset updated
Nov 27, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sumit Banik; Sumit Banik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

The dataset contains the list of COVID Fake News/Claims which is shared all over the internet.

Content

Headlines: String attribute consisting of the headlines/fact shared.

Outcome: It is binary data where 0 means the headline is fake and 1 means that it is true.

Inspiration

In many research portals, there was this common question in which the combined fake news dataset is available or not. This led to the publication of this dataset.
o
Covid-19 News Dataset Both Fake and Real
explore.openaire.eu
zenodo.org
Updated Apr 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shagoto Rahman; M. M. Raihan; Laboni Akter; Md. Mohsin Sarker Raihan (2021). Covid-19 News Dataset Both Fake and Real [Dataset]. http://doi.org/10.5281/zenodo.4722483
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4722483
Dataset updated
Apr 27, 2021
Authors
Shagoto Rahman; M. M. Raihan; Laboni Akter; Md. Mohsin Sarker Raihan
Description
The dataset contains fake and real news. There are 16898 unique rows that points out the numbers of news as well. The dataset is merged from two datasets one is from different source of CBC news (link: https://zenodo.org/record/4722470) and other is from different web portals (link: https://zenodo.org/record/4282522). Data Description: Text: Text contains the news that is either fake or real. Outcome: Contains either fake or real which is the status of the news. Data source link 1: https://www.kaggle.com/ryanxjhan/cbc-news-coronavirusarticles-march-26 Data source link 2: https://zenodo.org/record/4722470
i
Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset)
ieee-dataport.org
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DIKSHA SHUKLA (2025). Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset) [Dataset]. https://ieee-dataport.org/open-access/covid-19-fake-news-infodemic-research-dataset-covid19-fnir-dataset
Explore at:
Dataset updated
Jul 29, 2025
Authors
DIKSHA SHUKLA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The United States of America
Coronavirus COVID-19 Global Cases
redivis.com
application/jsonl +7
Updated Jul 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Center for Population Health Sciences (2020). Coronavirus COVID-19 Global Cases [Dataset]. http://doi.org/10.57761/pyf5-4e40
Explore at:
sas, csv, application/jsonl, spss, stata, parquet, arrow, avroAvailable download formats
Unique identifier
https://doi.org/10.57761/pyf5-4e40
Dataset updated
Jul 13, 2020
Dataset provided by
Redivis Inc.
Authors
Stanford Center for Population Health Sciences
Time period covered
Jan 22, 2020 - Jul 12, 2020
Description
Abstract

JHU Coronavirus COVID-19 Global Cases, by country

Documentation

PHS is updating the Coronavirus Global Cases dataset weekly, Monday, Wednesday and Friday from Cloud Marketplace.

This data comes from the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). This database was created in response to the Coronavirus public health emergency to track reported cases in real-time. The data include the location and number of confirmed COVID-19 cases, deaths, and recoveries for all affected countries, aggregated at the appropriate province or state. It was developed to enable researchers, public health authorities and the general public to track the outbreak as it unfolds. Additional information is available in the blog post.

Visual Dashboard (desktop): https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

Section 2

Included Data Sources are:

World Health Organization (WHO): https://www.who.int/

DXY.cn. Pneumonia. 2020. http://3g.dxy.cn/newh5/view/pneumonia.

BNO News: https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/

National Health Commission of the People’s Republic of China (NHC): http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml

China CDC (CCDC): http://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm

Hong Kong Department of Health: https://www.chp.gov.hk/en/features/102465.html

Macau Government: https://www.ssm.gov.mo/portal/

Taiwan CDC: https://sites.google.com/cdc.gov.tw/2019ncov/taiwan?authuser=0

US CDC: https://www.cdc.gov/coronavirus/2019-ncov/index.html

Government of Canada: https://www.canada.ca/en/public-health/services/diseases/coronavirus.html

Australia Government Department of Health: https://www.health.gov.au/news/coronavirus-update-at-a-glance

European Centre for Disease Prevention and Control (ECDC): https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases

Ministry of Health Singapore (MOH): https://www.moh.gov.sg/covid-19

Italy Ministry of Health: http://www.salute.gov.it/nuovocoronavirus

1Point3Arces: https://coronavirus.1point3acres.com/en

WorldoMeters: https://www.worldometers.info/coronavirus/

%3C!-- --%3E

Section 3

**Terms of Use: **

This GitHub repo and its contents herein, including all data, mapping, and analysis, copyright 2020 Johns Hopkins University, all rights reserved, is provided to the public strictly for educational and academic research purposes. The Website relies upon publicly available data from multiple sources, that do not always agree. The Johns Hopkins University hereby disclaims any and all representations and warranties with respect to the Website, including accuracy, fitness for use, and merchantability. Reliance on the Website for medical guidance or use of the Website in commerce is strictly prohibited.

Section 4

**U.S. county-level characteristics relevant to COVID-19 **

Chin, Kahn, Krieger, Buckee, Balsari and Kiang (forthcoming) show that counties differ significantly in biological, demographic and socioeconomic factors that are associated with COVID-19 vulnerability. A range of publicly available county-specific data identifying these key factors, guided by international experiences and consideration of epidemiological parameters of importance, have been combined by the authors and are available for use:

https://github.com/mkiang/county_preparedness/
i
Covid-19 and vaccine news dataset
ieee-dataport.org
Updated Oct 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajat Thakur (2021). Covid-19 and vaccine news dataset [Dataset]. https://ieee-dataport.org/documents/covid-19-and-vaccine-news-dataset
Explore at:
Dataset updated
Oct 27, 2021
Authors
Rajat Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains world news related to Covid-19 and vaccine and also with the news article's available metadata.
m
COVID-19 Fake News Dataset
data.mendeley.com
Updated Feb 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Koirala (2021). COVID-19 Fake News Dataset [Dataset]. http://doi.org/10.17632/zwfdmp5syg.1
Explore at:
Unique identifier
https://doi.org/10.17632/zwfdmp5syg.1
Dataset updated
Feb 22, 2021
Authors
Abhishek Koirala
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of a collection of true and fake news related to COVID-19. The dataset consists of news between the period of December 2019- July 2020.
Problems with finding coronavirus news worldwide 2020
statista.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Problems with finding coronavirus news worldwide 2020 [Dataset]. https://www.statista.com/statistics/1104506/coronavirus-news-opinions-worldwide/
Explore at:
Dataset updated
Jul 9, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Mar 6, 2020 - Mar 10, 2020
Area covered
Worldwide
Description
A global study conducted in March 2020 gathered data on consumers' attitudes to, experiences of, and issues with news consumption regarding the coronavirus pandemic, and found that ** percent of respondents were concerned about the amount of fake news being spread about the virus, which would impede their efforts to find out the facts that they need to stay updated. Others were met with challenges when seeking out trustworthy and reliable information, and ** percent felt that the public should be given more coronavirus news and updates from scientists and less from politicians.
r
News articles and front pages from 19 Swedish news sites during the...
researchdata.se
demo.researchdata.se
Updated Nov 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter M. Dahlgren (2021). News articles and front pages from 19 Swedish news sites during the covid-19/corona pandemic 2020–2021 [Dataset]. http://doi.org/10.5878/d18f-q220
Explore at:
(477962370), (255819)Available download formats
Unique identifier
https://doi.org/10.5878/d18f-q220
Dataset updated
Nov 2, 2021
Dataset provided by
University of Gothenburg
Authors
Peter M. Dahlgren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2021 - Apr 26, 2021
Area covered
Sweden
Description
This dataset contains news articles from Swedish news sites during the covid-19 corona pandemic 2020–2021. The purpose was to develop and test new methods for collection and analyses of large news corpora by computational means. In total, there are 677,151 articles collected from 19 news sites during 2020-01-01 to 2021-04-26. The articles were collected by scraping all links on the homepages and main sections of each site every two hours, day and night.

The dataset also includes about 45 million timestamps at which the articles were present on the front pages (homepages and main sections of each news site, such as domestic news, sports, editorials, etc.). This allows for detailed analysis of what articles any reader likely was exposed to when visiting a news site. The time resolution is (as stated previously) two hours, meaning that you can detect changes in which articles were on the front pages every two hours.

The 19 news sites are aftonbladet.se, arbetet.se, da.se, di.se, dn.se, etc.se, expressen.se, feministisktperspektiv.se, friatider.se, gp.se, nyatider.se, nyheteridag.se, samnytt.se, samtiden.nu, svd.se, sverigesradio.se, svt.se, sydsvenskan.se and vlt.se.

Due to copyright, the full text is not available but instead transformed into a document-term matrix (in long format) which contains the frequency of all words for each article (in total, 80 million words). Each article also includes extensive metadata that was extracted from the articles themselves (URL, document title, article heading, author, publish date, edit date, language, section, tags, category) and metadata that was inferred by simple heuristic algorithms (page type, article genre, paywall).

The dataset consists of the following: article_metadata.csv (53 MB): The file contains information about each news article, one article per row. In total, there are 677,151 observations and 17 variables.

article_text.csv (236 MB): The file contains the id of each news article and how many times (count) a specific word occurs in the news article. The file contains 80,090,784 observations and 3 variables in long format.

frontpage_timestamps.csv (175 MB): The file contains when each news article was found on the front page (homepage and main sections) of the news sites. The file contains 45,337,740 observations and 4 variables in long format.

More information about the content in the files is found in the README-file. In it you will also find the R-script for using the data.
n
Coronavirus (Covid-19) Data in the United States
nytimes.com
openicpsr.org
+2more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Explore at:
Dataset provided by
New York Times
Description
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
COVID-19 Worldwide Daily Data
kaggle.com
Updated Aug 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Altadata (2020). COVID-19 Worldwide Daily Data [Dataset]. https://www.kaggle.com/altadata/covid19/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Altadata
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5505749%2F2b83271d61e47e2523e10dc9c28e545c%2F600x200.jpg?generation=1599042483103679&alt=media" alt="">

ALTADATA is a curated data marketplace where our subscribers and our data partners can easily exchange ready-to-analyze datasets and create insights with EPO, our visual data analytics platform.

COVID-19 Worldwide Daily Data

Daily global COVID-19 data for all countries, provided by Johns Hopkins University (JHU) Center for Systems Science and Engineering (CSSE). If you want to use the update version of the data, you can use our daily updated data with the help of api key by entering it via Altadata.

Overview

In this data product, you may find the latest and historical global daily data on the COVID-19 pandemic for all countries.

The COVID‑19 pandemic, also known as the coronavirus pandemic, is an ongoing global pandemic of coronavirus disease 2019 (COVID‑19), caused by severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2). The outbreak was first identified in December 2019 in Wuhan, China. The World Health Organization declared the outbreak a Public Health Emergency of International Concern on 30 January 2020 and a pandemic on 11 March. As of 12 August 2020, more than 20.2 million cases of COVID‑19 have been reported in more than 188 countries and territories, resulting in more than 741,000 deaths; more than 12.5 million people have recovered.

The Johns Hopkins Coronavirus Resource Center is a continuously updated source of COVID-19 data and expert guidance. They aggregate and analyze the best data available on COVID-19 - including cases, as well as testing, contact tracing and vaccine efforts - to help the public, policymakers and healthcare professionals worldwide respond to the pandemic.

Methodology

Cases and Death counts include confirmed and probable (where reported)

Recovered cases are estimates based on local media reports, and state and local reporting when available, and therefore may be substantially lower than the true number. US state-level recovered cases are from COVID Tracking Project.

Active cases = total cases - total recovered - total deaths

Incidence Rate = cases per 100,000 persons

Case-Fatality Ratio (%) = Number recorded deaths / Number cases

Country Population represents 2019 projections by UN Population Division, integrated to the JHU CSSE's COVID-19 data by ALTADATA

Data Source

Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)

United Nations Population Division

Related Data Products

COVID-19 US Daily Data

OECD, EU28, G20 Life Expectancy and Mortality Indicators

Suggested Blog Posts

Bayesian Thinking During the Pandemic

Impact of COVID-19 on California Electricity Demand

Keep Calm and Look At The Fundamentals

Markets In The Corona Virus Crisis

Data Dictionary

Reported Date (reported_date) : Covid-19 Report Date

Country_Region (country_region) : Country, region or sovereignty name

Population (population) : Country populations as per United Nations Population Division

Confirmed Case (confirmed) : Confirmed cases include presumptive positive cases and probable cases

Active cases (active) : Active cases = total confirmed - total recovered - total deaths

Deaths (deaths) : Death cases counts

Recovered (recovered) : Recovered cases counts

Mortality Rate (mortality_rate) : Number of recorded deaths * 100 / Number of confirmed cases

Incident Rate (incident_rate) : Confirmed cases per 100,000 persons
Share of online fake news related to coronavirus (COVID-19) in Italy 2020
statista.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, Share of online fake news related to coronavirus (COVID-19) in Italy 2020 [Dataset]. https://www.statista.com/statistics/1109490/share-of-coronavirus-fake-news-italy/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2020 - May 2020
Area covered
Italy
Description
In May 2020, up to six percent of all online news and posts related to the coronavirus (COVID-19) and released in Italy were false or not accurate. The percentage was calculated on the average volume of posts and articles published by the Italian media outlets, including posts on social media. The peak in the release of fake news was registered in the early stage of the pandemic at the end of January 2020, with 7.3 percent of the coronavirus-related information.

For further information about the coronavirus (COVID-19) pandemic, please visit our dedicated Fact and Figures page.
i
free dataset from news/message boards/blogs about CoronaVirus (4 month of...
ieee-dataport.org
Updated Apr 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ran Geva (2020). free dataset from news/message boards/blogs about CoronaVirus (4 month of data - 5.2M posts) [Dataset]. https://ieee-dataport.org/open-access/free-dataset-newsmessage-boardsblogs-about-coronavirus-4-month-data-52m-posts
Explore at:
Dataset updated
Apr 7, 2020
Authors
Ran Geva
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Free dataset from news/message boards/blogs about CoronaVirus (4 month of data - 5.2M posts). The time frame of the data is Dec/2019 - March/2020. The posts are in English mentioning at least one of the following: "Covid" OR CoronaVirus OR "Corona Virus".
m
Covid-19 latest news dataset
data.mendeley.com
Updated Oct 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajat Thakur (2021). Covid-19 latest news dataset [Dataset]. http://doi.org/10.17632/8rbm7d874k.1
Explore at:
Unique identifier
https://doi.org/10.17632/8rbm7d874k.1
Dataset updated
Oct 27, 2021
Authors
Rajat Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Coronavirus disease 2019 (COVID19) time series that lists confirmed cases, reported deaths, and reported recoveries. Data is broken down by country (and sometimes by sub-region).

Coronavirus disease (COVID19) is caused by severe acute respiratory syndrome Coronavirus 2 (SARSCoV2) and has had an effect worldwide. On March 11, 2020, the World Health Organization (WHO) declared it a pandemic, currently indicating more than 118,000 cases of coronavirus disease in more than 110 countries and territories around the world.

This dataset contains the latest news related to Covid-19 and it was fetched with the help of Newsdata.io news API.
Z
INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET
data.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Sadman (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Kishor Datta Gupta
Nishat Anjum
Nafiz Sadman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh, United States
Description
Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
COVID-19 Country Level Timeseries
kaggle.com
Updated Mar 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arpan Das (2020). COVID-19 Country Level Timeseries [Dataset]. https://www.kaggle.com/arpandas65/covid19-country-level-timeseries/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arpan Das
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

Amidst the COVID-19 outbreak, the world is facing great crisis in every way. The value and things we built as a human race are going through tremendous challenges. It is a very small effort to bring curated data set on Novel Corona Virus to accelerate the forecasting and analytical experiments to cope up with this critical situation. It will help to visualize the country level out break and to keep track on regularly added new incidents.

COVID-19 Country Level Timeseries Dataset

This Dataset contains country wise public domain time series information on COVID-19 outbreak. The Data is sorted alphabetically on Country name and Date of Observation.

Column Descriptions

The data set contains the following columns:
ObservationDate: The date on which the incidents are observed country: Country of the Outbreak Confirmed: Number of confirmed cases till observation date Deaths: Number of death cases till observation date Recovered: Number of recovered cases till observation date New Confirmed: Number of new confirmed cases on observation date New Deaths: Number of New death cases on observation date New Recovered: Number of New recovered cases on observation date latitude: Latitude of the affected country longitude: Longitude of the affected country

Acknowledgements

This data set is a cleaner version of the https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset data set with added geo location information and regularly added incident counts. I would like to thank this great effort by SRK.

Original Data Source

Johns Hopkins University MoBS lab - https://www.mobs-lab.org/2019ncov.html World Health Organization (WHO): https://www.who.int/ DXY.cn. Pneumonia. 2020. http://3g.dxy.cn/newh5/view/pneumonia. BNO News: https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/ National Health Commission of the People’s Republic of China (NHC): http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml China CDC (CCDC): http://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm Hong Kong Department of Health: https://www.chp.gov.hk/en/features/102465.html Macau Government: https://www.ssm.gov.mo/portal/ Taiwan CDC: https://sites.google.com/cdc.gov.tw/2019ncov/taiwan?authuser=0 US CDC: https://www.cdc.gov/coronavirus/2019-ncov/index.html Government of Canada: https://www.canada.ca/en/public-health/services/diseases/coronavirus.html Australia Government Department of Health: https://www.health.gov.au/news/coronavirus-update-at-a-glance European Centre for Disease Prevention and Control (ECDC): https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases Ministry of Health Singapore (MOH): https://www.moh.gov.sg/covid-19 Italy Ministry of Health: http://www.salute.gov.it/nuovocoronavirus
f
Dutch covid news headlines sentiment analysis
figshare.com
xlsx
Updated Feb 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rein Wieringa (2022). Dutch covid news headlines sentiment analysis [Dataset]. http://doi.org/10.6084/m9.figshare.19154102.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19154102.v1
Dataset updated
Feb 10, 2022
Dataset provided by
figshare
Authors
Rein Wieringa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the results of a sentiment analysis of the headlines of all search results for pieces containing [corona] OR [coronavirus] from 1 March 2020 to 30 November 2021, for 5 media: NRC, Telegraaf, Volkskrant, NOS and Trouw.Headlines were removed from this data for copyright reasons. For the complete data set, contact me via e-mail (reinwieringa[AT]gmail[DOT]com).
Most trusted sources of coronavirus news U.S. 2020
statista.com
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most trusted sources of coronavirus news U.S. 2020 [Dataset]. https://www.statista.com/statistics/1104557/coronavirus-trusted-news-sources-by-us/
Explore at:
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Mar 13, 2020 - Mar 16, 2020
Area covered
United States
Description
As the United States battles the coronavirus, news consumers across the country have been attempting to keep themselves updated with how the pandemic is progressing, and a survey held in March 2020 revealed that the most trusted news source for details on COVID-19 was the CDC, with ** percent of respondents saying that they trusted the centers to provide accurate information on the topic. Following closely behind was the World Health Organization and then the state government, but just ** percent of consumers said that they trusted social media sites to publish reliable and accurate news about the coronavirus outbreak.
Covid19 Dataset (Worldwide cases 2019-20)
kaggle.com
Updated Dec 31, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivekkumar Gediya (2020). Covid19 Dataset (Worldwide cases 2019-20) [Dataset]. https://www.kaggle.com/vivekgediya/covid19-case-worldwide-cases-till-30th-dec20/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 31, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vivekkumar Gediya
Description
Context

From World Health Organization - On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus did not match any other known virus. This raised concern because when a virus is new, we do not know how it affects people.

So daily level information on the affected people can give some interesting insights when it is made available to the broader data science community.

Johns Hopkins University has made an excellent dashboard using the affected cases data. Data is extracted from the google sheets associated and made available here.

Edited

Now data is available as csv files in the Johns Hopkins Github repository. Please refer to the github repository for the Terms of Use details. Uploading it here for using it in Kaggle kernels and getting insights from the broader DS community.

Content 2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC

This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. Please note that this is a time series data and so the number of cases on any given day is the cumulative number.

The data is available from 22 Jan, 2020 to 30 Dec, 2020.

Sources

JHU confirmed covid datasets.
Coronavirus news sources worldwide 2020, by age group
statista.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Coronavirus news sources worldwide 2020, by age group [Dataset]. https://www.statista.com/statistics/1104381/coronavirus-news-sources-by-age-worldwide/
Explore at:
Dataset updated
Jul 9, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Mar 6, 2020 - Mar 10, 2020
Area covered
Worldwide
Description
According to a study conducted in March 2020, ** percent of adults worldwide aged between 18 and 35 years old were getting most of their information about the coronavirus pandemic via social media, compared to ** percent of those aged 55 or above. Major news organizations were overall a more popular source of information about COVID-19, but younger consumers were more evenly split in terms of which platforms they were using the most to keep themselves updated about the virus, whereas older adults were far more likely to turn to major news outlets.

Facebook

Twitter

Click to copy link

Link copied

Cite

Piyush Ghasiya (2022). COVID-19 News Articles [Dataset]. https://ieee-dataport.org/documents/covid-19-news-articles

Data from: COVID-19 News Articles

Explore at:

Dataset updated

May 18, 2022

Authors

Piyush Ghasiya

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

India

Clear search

Close search

Google apps

Main menu

Data from: COVID-19 News Articles

COVID Fake News Dataset

Covid-19 News Dataset Both Fake and Real

Covid-19 Fake News Infodemic Research Dataset (CoVID19-FNIR Dataset)

Coronavirus COVID-19 Global Cases

Abstract

Documentation

Section 2

Section 3

Section 4

Covid-19 and vaccine news dataset

COVID-19 Fake News Dataset

Problems with finding coronavirus news worldwide 2020

News articles and front pages from 19 Swedish news sites during the...

Coronavirus (Covid-19) Data in the United States

COVID-19 Worldwide Daily Data

ALTADATA is a curated data marketplace where our subscribers and our data partners can easily exchange ready-to-analyze datasets and create insights with EPO, our visual data analytics platform.

COVID-19 Worldwide Daily Data

Overview

Methodology

Data Source

Related Data Products

Suggested Blog Posts

Data Dictionary

Share of online fake news related to coronavirus (COVID-19) in Italy 2020

free dataset from news/message boards/blogs about CoronaVirus (4 month of...

Covid-19 latest news dataset

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

COVID-19 Country Level Timeseries

Context

COVID-19 Country Level Timeseries Dataset

Column Descriptions

Acknowledgements

Original Data Source

Dutch covid news headlines sentiment analysis

Most trusted sources of coronavirus news U.S. 2020

Covid19 Dataset (Worldwide cases 2019-20)

Context

Edited

Sources

Coronavirus news sources worldwide 2020, by age group

Data from: COVID-19 News Articles