100+ datasets found
  1. Novel Covid-19 Dataset

    • kaggle.com
    Updated Sep 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GHOST5612 (2025). Novel Covid-19 Dataset [Dataset]. https://www.kaggle.com/datasets/ghost5612/novel-covid-19-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 18, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    GHOST5612
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context:

    From World Health Organization - On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus did not match any other known virus. This raised concern because when a virus is new, we do not know how it affects people.

    So daily level information on the affected people can give some interesting insights when it is made available to the broader data science community.

    Johns Hopkins University has made an excellent dashboard using the affected cases data. Data is extracted from the google sheets associated and made available here.

    Edited:

    Now data is available as csv files in the Johns Hopkins Github repository. Please refer to the github repository for the Terms of Use details. Uploading it here for using it in Kaggle kernels and getting insights from the broader DS community.

    Content

    2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC

    This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. Please note that this is a time series data and so the number of cases on any given day is the cumulative number.

    The data is available from 22 Jan, 2020.

    Here’s a polished version suitable for a professional Kaggle dataset description:

    Dataset Description

    This dataset contains time-series and case-level records of the COVID-19 pandemic. The primary file is covid_19_data.csv, with supporting files for earlier records and individual-level line list data.

    Files and Columns

    1. covid_19_data.csv (Main File)

    This is the primary dataset and contains aggregated COVID-19 statistics by location and date.

    • Sno – Serial number of the record
    • ObservationDate – Date of the observation (MM/DD/YYYY)
    • Province/State – Province or state of the observation (may be missing for some entries)
    • Country/Region – Country of the observation
    • Last Update – Timestamp (UTC) when the record was last updated (not standardized, requires cleaning before use)
    • Confirmed – Cumulative number of confirmed cases on that date
    • Deaths – Cumulative number of deaths on that date
    • Recovered – Cumulative number of recoveries on that date

    2. 2019_ncov_data.csv (Legacy File)

    This file contains earlier COVID-19 records. It is no longer updated and is provided only for historical reference. For current analysis, please use covid_19_data.csv.

    3. COVID_open_line_list_data.csv

    This file provides individual-level case information, obtained from an open data source. It includes patient demographics, travel history, and case outcomes.

    4. COVID19_line_list_data.csv

    Another individual-level case dataset, also obtained from public sources, with detailed patient-level information useful for micro-level epidemiological analysis.

    ✅ Use covid_19_data.csv for up-to-date aggregated global trends.

    ✅ Use the line list datasets for detailed, individual-level case analysis.

    Country level datasets:

    If you are interested in knowing country level data, please refer to the following Kaggle datasets:

    India - https://www.kaggle.com/sudalairajkumar/covid19-in-india

    South Korea - https://www.kaggle.com/kimjihoo/coronavirusdataset

    Italy - https://www.kaggle.com/sudalairajkumar/covid19-in-italy

    Brazil - https://www.kaggle.com/unanimad/corona-virus-brazil

    USA - https://www.kaggle.com/sudalairajkumar/covid19-in-usa

    Switzerland - https://www.kaggle.com/daenuprobst/covid19-cases-switzerland

    Indonesia - https://www.kaggle.com/ardisragen/indonesia-coronavirus-cases

    Acknowledgements :

    Johns Hopkins University for making the data available for educational and academic research purposes

    MoBS lab - https://www.mobs-lab.org/2019ncov.html

    World Health Organization (WHO): https://www.who.int/

    DXY.cn. Pneumonia. 2020. http://3g.dxy.cn/newh5/view/pneumonia.

    BNO News: https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/

    National Health Commission of the People’s Republic of China (NHC): http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml

    China CDC (CCDC): http://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm

    Hong Kong Department of Health: https://www.chp.gov.hk/en/features/102465.html

    Macau Government: https://www.ssm.gov.mo/portal/

    Taiwan CDC: https://sites.google....

  2. Country data on COVID-19

    • kaggle.com
    zip
    Updated Aug 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carla Oliveira (2023). Country data on COVID-19 [Dataset]. https://www.kaggle.com/datasets/carlaoliveira/country-data-on-covid19
    Explore at:
    zip(8634707 bytes)Available download formats
    Dataset updated
    Aug 6, 2023
    Authors
    Carla Oliveira
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The data is in CSV format and includes all historical data on the pandemic up to 03/01/2023, following a 1-line format per country and date.

    In the pre-processing of these data, missing data were checked. It was observed, for example, that the missing data referring to new_cases was where the total number of cases had not been changed and that most of the missing data related to vaccination, which actually at the beginning of the pandemic there was no data. Therefore, to solve these cases of missing data it was decided to replace the data containing “NaN” by zero. Some of these features were combined to generate new features. This process that creates new features (data) from existing data, aiming to improve the data before applying machine learning algorithms, is called feature engineering. The new features created were: - Vaccination rate (vaccination_ratio'): total number of people who received at least one dose of vaccine divided by the population at risk. This dose number was chosen because it has a higher correlation with new deaths. - Prevalence: existing cases of the disease at a given time divided by the population at risk of having the disease. Formula: COVID-19 cases ÷ Population at risk * 100. Example: 168,331 ÷ 210,000,000 * 100 = 0.08. - Incidence: new cases of the disease in a defined population during a specific period (one day, for example) divided by the population at risk. Formula: New COVID-19 cases in one day ÷ Population - Total cases * 100. Example: 5,632 ÷ 209,837,301 * 100 = 0.0026.

  3. m

    Data for: COVID-19 Dataset: Worldwide Spread Log Including Countries First...

    • data.mendeley.com
    Updated Jul 20, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hasmot Ali (2020). Data for: COVID-19 Dataset: Worldwide Spread Log Including Countries First Case And First Death [Dataset]. http://doi.org/10.17632/vw427wzzkk.5
    Explore at:
    Dataset updated
    Jul 20, 2020
    Authors
    Hasmot Ali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contain informative data related to COVID-19 pandemic. Specially, figure out about the First Case and First Death information for every single country. The datasets mainly focus on two major fields first one is First Case which consists of information of Date of First Case(s), Number of confirm Case(s) at First Day, Age of the patient(s) of First Case, Last Visited Country and the other one First Death information consist of Date of First Death and Age of the Patient who died first for every Country mentioning corresponding Continent. The datasets also contain the Binary Matrix of spread chain among different country and region.

    *This is not a country. This is a ship. The name of the Cruise Ship was not given from the government.
    "N+": the age is not specified but greater than N
    “No Trace”: some data was not found
    “Unspecified”: not available from the authority
    “N/A”: for “Last Visited Country(s) of Confirmed Case(s)” column, “N/A” indicates that the confirmed case(s) of those countries do not have any travel history in recent past; in “Age of First Death(s)” column “N/A” indicates that those countries do not have may death case till May 16, 2020.

  4. e

    COVID-19 Coronavirus data - weekly (from 17 December 2020)

    • data.europa.eu
    csv, excel xlsx, html +3
    Updated Dec 17, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Disease Prevention and Control (2020). COVID-19 Coronavirus data - weekly (from 17 December 2020) [Dataset]. https://data.europa.eu/data/datasets/covid-19-coronavirus-data-weekly-from-17-december-2020?locale=en
    Explore at:
    html, csv, json, unknown, xml, excel xlsxAvailable download formats
    Dataset updated
    Dec 17, 2020
    Dataset authored and provided by
    European Centre for Disease Prevention and Control
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains a weekly situation update on COVID-19, the epidemiological curve and the global geographical distribution (EU/EEA and the UK, worldwide).

    Since the beginning of the coronavirus pandemic, ECDC’s Epidemic Intelligence team has collected the number of COVID-19 cases and deaths, based on reports from health authorities worldwide. This comprehensive and systematic process was carried out on a daily basis until 14/12/2020. See the discontinued daily dataset: COVID-19 Coronavirus data - daily. ECDC’s decision to discontinue daily data collection is based on the fact that the daily number of cases reported or published by countries is frequently subject to retrospective corrections, delays in reporting and/or clustered reporting of data for several days. Therefore, the daily number of cases may not reflect the true number of cases at EU/EEA level at a given day of reporting. Consequently, day to day variations in the number of cases does not constitute a valid basis for policy decisions.

    ECDC continues to monitor the situation. Every week between Monday and Wednesday, a team of epidemiologists screen up to 500 relevant sources to collect the latest figures for publication on Thursday. The data screening is followed by ECDC’s standard epidemic intelligence process for which every single data entry is validated and documented in an ECDC database. An extract of this database, complete with up-to-date figures and data visualisations, is then shared on the ECDC website, ensuring a maximum level of transparency.

    ECDC receives regular updates from EU/EEA countries through the Early Warning and Response System (EWRS), The European Surveillance System (TESSy), the World Health Organization (WHO) and email exchanges with other international stakeholders. This information is complemented by screening up to 500 sources every day to collect COVID-19 figures from 196 countries. This includes websites of ministries of health (43% of the total number of sources), websites of public health institutes (9%), websites from other national authorities (ministries of social services and welfare, governments, prime minister cabinets, cabinets of ministries, websites on health statistics and official response teams) (6%), WHO websites and WHO situation reports (2%), and official dashboards and interactive maps from national and international institutions (10%). In addition, ECDC screens social media accounts maintained by national authorities on for example Twitter, Facebook, YouTube or Telegram accounts run by ministries of health (28%) and other official sources (e.g. official media outlets) (2%). Several media and social media sources are screened to gather additional information which can be validated with the official sources previously mentioned. Only cases and deaths reported by the national and regional competent authorities from the countries and territories listed are aggregated in our database.

    Disclaimer: National updates are published at different times and in different time zones. This, and the time ECDC needs to process these data, might lead to discrepancies between the national numbers and the numbers published by ECDC. Users are advised to use all data with caution and awareness of their limitations. Data are subject to retrospective corrections; corrected datasets are released as soon as processing of updated national data has been completed.

    If you reuse or enrich this dataset, please share it with us.

  5. The World Dataset of COVID-19

    • kaggle.com
    zip
    Updated May 25, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C-3PO (2021). The World Dataset of COVID-19 [Dataset]. https://www.kaggle.com/aditeloo/the-world-dataset-of-covid19
    Explore at:
    zip(24211978 bytes)Available download formats
    Dataset updated
    May 25, 2021
    Authors
    C-3PO
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    World
    Description

    Context

    These datasets are from Our World in Data. Their complete COVID-19 dataset is a collection of the COVID-19 data maintained by Our World in Data. It is updated daily and includes data on confirmed cases, deaths, hospitalizations, testing, and vaccinations as well as other variables of potential interest.

    Content

    Confirmed cases and deaths:

    our data comes from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU). We discuss how and when JHU collects and publishes this data. The cases & deaths dataset is updated daily. Note: the number of cases or deaths reported by any institution—including JHU, the WHO, the ECDC, and others—on a given day does not necessarily represent the actual number on that date. This is because of the long reporting chain that exists between a new case/death and its inclusion in statistics. This also means that negative values in cases and deaths can sometimes appear when a country corrects historical data because it had previously overestimated the number of cases/deaths. Alternatively, large changes can sometimes (although rarely) be made to a country's entire time series if JHU decides (and has access to the necessary data) to correct values retrospectively.

    Hospitalizations and intensive care unit (ICU) admissions:

    our data comes from the European Centre for Disease Prevention and Control (ECDC) for a select number of European countries; the government of the United Kingdom; the Department of Health & Human Services for the United States; the COVID-19 Tracker for Canada. Unfortunately, we are unable to provide data on hospitalizations for other countries: there is currently no global, aggregated database on COVID-19 hospitalization, and our team at Our World in Data does not have the capacity to build such a dataset.

    Testing for COVID-19:

    this data is collected by the Our World in Data team from official reports; you can find further details in our post on COVID-19 testing, including our checklist of questions to understand testing data, information on geographical and temporal coverage, and detailed country-by-country source information. The testing dataset is updated around twice a week.

    Acknowledgements

    Our World in Data GitHub repository for covid-19.

    Inspiration

    All we love data, cause we love to go inside it and discover the truth that's the main inspiration I have.

  6. T

    CORONAVIRUS DEATHS by Country Dataset

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Mar 4, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2020). CORONAVIRUS DEATHS by Country Dataset [Dataset]. https://tradingeconomics.com/country-list/coronavirus-deaths
    Explore at:
    csv, excel, xml, jsonAvailable download formats
    Dataset updated
    Mar 4, 2020
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2025
    Area covered
    World
    Description

    This dataset provides values for CORONAVIRUS DEATHS reported in several countries. The data includes current values, previous releases, historical highs and record lows, release frequency, reported unit and currency.

  7. B

    Dataset 1: Bilateral Travel Restriction Database v1.0

    • borealisdata.ca
    • dataone.org
    Updated Mar 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Global Strategy Lab (2023). Dataset 1: Bilateral Travel Restriction Database v1.0 [Dataset]. http://doi.org/10.5683/SP2/5E4OA8
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 16, 2023
    Dataset provided by
    Borealis
    Authors
    The Global Strategy Lab
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Earlier this year, Dr. Hoffman and Dr. Fafard published a book chapter on the efficacy and legality of border closures enacted by governments in response to changing COVID-19 conditions. The authors concluded border closures are at best, regarded as powerful symbolic acts taken by governments to show they are acting forcefully, even if the actions lack an epidemiological impact and breach international law. This COVID-19 travel restriction project was developed out of a necessity and desire to further examine the empirical implications of border closures. The current dataset contains bilateral travel restriction information on the status of 179 countries between 1 January 2020 and 8 June 2020. The data was extracted from the ‘international controls’ column from the Oxford COVID-19 Government Response Tracker (OxCGRT). The data in the ‘international controls’ column outlined a country’s change in border control status, as a response to COVID-19 conditions. Accompanying source links were further verified through random selection and comparison with external news sources. Greater weight is given to official national government sources, then to provincial and municipal news-affiliated agencies. The database is presented in matrix form for each country-pair and date. Subsequently, each cell is represented by datum Xdmn and indicates the border closure status on date d by country m on country n. The coding is as follows: no border closure (code = 0), targeted border closure (= 1), and a total border closure (= 99). The dataset provides further details in the ‘notes’ column if the type of closure is a modified form of a targeted closure, either as a land or port closure, flight or visa suspension, or a re-opening of borders to select countries. Visa suspensions and closure of land borders were coded separately as de facto border closures and analyzed as targeted border closures in quantitative analyses. The file titled ‘BTR Supplementary Information’ covers a multitude of supplemental details to the database. The various tabs cover the following: 1) Codebook: variable name, format, source links, and description; 2) Sources, Access dates: dates of access for the individual source links with additional notes; 3) Country groups: breakdown of EEA, EU, SADC, Schengen groups with source links; 4) Newly added sources: for missing countries with a population greater than 1 million (meeting the inclusion criteria), relevant news sources were added for analysis; 5) Corrections: external news sources correcting for errors in the coding of international controls retrieved from the OxCGRT dataset. At the time of our study inception, there was no existing dataset which recorded the bilateral decisions of travel restrictions between countries. We hope this dataset will be useful in the study of the impact of border closures in the COVID-19 pandemic and widen the capabilities of studying border closures on a global scale, due to its interconnected nature and impact, rather than being limited in analysis to a single country or region only. Statement of contributions: Data entry and verification was performed mainly by GL, with assistance from MJP and RN. MP and IW provided further data verification on the nine countries purposively selected for the exploratory analysis of political decision-making.

  8. D

    Covid-19 Country Level Social Science Dataset

    • dataverse.no
    • dataverse.azure.uit.no
    application/dbf +10
    Updated Oct 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Øystein Solvang; Øystein Solvang; Kari Elida Eriksen; Jonas Stein; Camilla Brattland; Kari Elida Eriksen; Jonas Stein; Camilla Brattland (2020). Covid-19 Country Level Social Science Dataset [Dataset]. http://doi.org/10.18710/VMUP44
    Explore at:
    type/x-r-syntax(11257), csv(36577), application/prj(146), type/x-r-syntax(12007), application/shx(2140), application/dbf(323441), bin(5), txt(9844), application/sbn(2796), application/prj(145), application/shp(8800376), type/x-r-syntax(4038), application/sbx(349), pdf(189956), bin(6), csv(41050), pdf(138533), application/dbf(10298), application/sbx(348), pdf(339251)Available download formats
    Dataset updated
    Oct 20, 2020
    Dataset provided by
    DataverseNO
    Authors
    Øystein Solvang; Øystein Solvang; Kari Elida Eriksen; Jonas Stein; Camilla Brattland; Kari Elida Eriksen; Jonas Stein; Camilla Brattland
    License

    https://dataverse.no/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.18710/VMUP44https://dataverse.no/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.18710/VMUP44

    Time period covered
    Jan 1, 2020 - Jul 15, 2020
    Area covered
    Covers 199 countries
    Description

    The dataset is a cross-sectional dataset covering social and public health data pertaining to the Covid-19 outbreak in 199 countries. The dataset was compiled from public register and other openly available sources. Data on Covid-19 cases and related fatalities is current as of medio July 2020. Data on other variables is mainly from the last three years, depending on data availability. Standardized unique unit identifiers (ISO-3166-1 Alpha-3) are included, enabling merging with other data. The dataset was assembled concurrently with a similar one on the Norwegian municipal level, as part of the project «Ressurs for studentaktiv læring i undervisning i statistisk og romlig analyse for samfunnsfag», at the Department of Social Science and The Norwegian College of Fishery Science, UiT. Dette er et tverrsnittsdatasett med samfunns- og folkehelsedata relatert til den pågående Covid-19-pandemien. Datasettet dekker 199 land. Det er satt sammen med data fra offentlige registre og andre åpent tilgjengelige kilder. Data om Covid-19-tilfeller og -dødsfall er à jour per medio juli 2020. Data på andre variabler er hovedsaklig fra de tre siste årene, avhengig av hva som var tilgjengelig på innsamlingstidspunktet. Standardiserte unike ID-variabler (ISO-3166-1 Alpha-3) er inkludert for å muliggjøre fusjonering med annen data. Datasettet ble satt sammen parallellt med et tilsvarende på kommunenivå (Norge), som en del av prosjektet «Ressurs for studentaktiv læring i undervisning i statistisk og romlig analyse for samfunnsfag» ved Institutt for samfunnsvitenskap og Norges fiskerihøgskole, UiT.

  9. g

    Coronavirus COVID-19 Global Cases by the Center for Systems Science and...

    • github.com
    • systems.jhu.edu
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) [Dataset]. https://github.com/CSSEGISandData/COVID-19
    Explore at:
    Dataset provided by
    Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
    Area covered
    Global
    Description

    2019 Novel Coronavirus COVID-19 (2019-nCoV) Visual Dashboard and Map:
    https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

    • Confirmed Cases by Country/Region/Sovereignty
    • Confirmed Cases by Province/State/Dependency
    • Deaths
    • Recovered

    Downloadable data:
    https://github.com/CSSEGISandData/COVID-19

    Additional Information about the Visual Dashboard:
    https://systems.jhu.edu/research/public-health/ncov

  10. A

    Spatiotemporal data for 2019-Novel Coronavirus Covid-19 Cases and deaths

    • data.amerigeoss.org
    csv, pdf, txt
    Updated Jan 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UN Humanitarian Data Exchange (2022). Spatiotemporal data for 2019-Novel Coronavirus Covid-19 Cases and deaths [Dataset]. https://data.amerigeoss.org/it/dataset/2019-novel-coronavirus-cases
    Explore at:
    txt(23645), csv(4916), pdf(15032), txt(7422), csv(795112664)Available download formats
    Dataset updated
    Jan 4, 2022
    Dataset provided by
    UN Humanitarian Data Exchange
    Description

    Data Overview

    This repository contains spatiotemporal data from many official sources for 2019-Novel Coronavirus beginning 2019 in Hubei, China ("nCoV_2019")

    You may not use this data for commercial purposes. If there is a need for commercial use of the data, please contact Metabiota at info@metabiota.com to obtain a commercial use license.

    The incidence data are in a CSV file format. One row in an incidence file contains a piece of epidemiological data extracted from the specified source.

    The file contains data from multiple sources at multiple spatial resolutions in cumulative and non-cumulative formats by confirmation status. To select a single time series of case or death data, filter the incidence dataset by source, spatial resolution, location, confirmation status, and cumulative flag.

    Data are collected, structured, and validated by Metabiota’s digital surveillance experts. The data structuring process is designed to produce the most reliable estimates of reported cases and deaths over space and time. The data are cleaned and provided in a uniform format such that information can be compared across multiple sources. Data are collected at the time of publication in the highest geographic and temporal resolutions available in the original report.

    This repository is intended to provide a single access point for data from a wide range of data sources. Data will be updated periodically with the latest epidemiological data. Metabiota maintains a database of epidemiological information for over two thousand high-priority infectious disease events. Please contact us (info@metabiota.com) if you are interested in licensing the complete dataset.

    Cumulative vs. Non-Cumulative Incidence

    Reporting sources provide either cumulative incidence, non-cumulative incidence, or both. If the source only provides a non-cumulative incidence value, the cumulative values are inferred using prior reports from the same source. Use the CUMULATIVE FLAG variable to subset the data to cumulative (TRUE) or non-cumulative (FALSE) values.

    Case Confirmation Status

    The incidence datasets include the confirmation status of cases and deaths when this information is provided by the reporting source. Subset the data by the CONFIRMATION_STATUS variable to either TOTAL, CONFIRMED, SUSPECTED, or PROBABLE to obtain the data of your choice.

    Total incidence values include confirmed, suspected, and probable incidence values. If a source only provides suspected, probable, or confirmed incidence, the total incidence is inferred to be the sum of the provided values. If the report does not specify confirmation status, the value is included in the "total" confirmation status value.

    The data provided under the "Metabiota Composite Source" often does not include suspected incidence due to inconsistencies in reporting cases and deaths with this confirmation status.

    Outcome - Cases vs. Deaths

    The incidence datasets include cases and deaths. Subset the data to either CASE or DEATH using the OUTCOME variable. It should be noted that deaths are included in case counts.

    Spatial Resolution

    Data are provided at multiple spatial resolutions. Data should be subset to a single spatial resolution of interest using the SPATIAL_RESOLUTION variable.

    Information is included at the finest spatial resolution provided to the original epidemic report. We also aggregate incidence to coarser geographic resolutions. For example, if a source only provides data at the province-level, then province-level data are included in the dataset as well as country-level totals. Users should avoid summing all cases or deaths in a given country for a given date without specifying the SPATIAL_RESOLUTION value. For example, subset the data to SPATIAL_RESOLUTION equal to “AL0” in order to view only the aggregated country level data.

    There are differences in administrative division naming practices by country. Administrative levels in this dataset are defined using the Google Geolocation API (https://developers.google.com/maps/documentation/geolocation/). For example, the data for the 2019-nCoV from one source provides information for the city of Beijing, which Google Geolocations indicates is a “locality.” Beijing is also the name of the municipality where the city Beijing is located. Thus, the 2019-nCoV dataset includes rows of data for both the city Beijing, as well as the municipality of the same name. If additional cities in the Beijing municipality reported data, those data would be aggregated with the city Beijing data to form the municipality Beijing data.

    Sources

    Data sources in this repository were selected to provide comprehensive spatiotemporal data for each outbreak. Data from a specific source can be selected using the SOURCE variable.

    In addition to the original reporting sources, Metabiota compiles multiple sources to generate the most comprehensive view of an outbreak. This compilation is stored in the database under the source name “Metabiota Composite Source.” The purpose of generating this new view of the outbreak is to provide the most accurate and precise spatiotemporal data for the outbreak. At this time, Metabiota does not incorporate unofficial - including media - sources into the “Metabiota Composite Source” dataset.

    Quality Assurance

    Data are collected by a team of digital surveillance experts and undergo many quality assurance tests. After data are collected, they are independently verified by at least one additional analyst. The data also pass an automated validation program to ensure data consistency and integrity.

    NonCommercial Use License

    • Creative Commons License Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

    • This is a human-readable summary of the Legal Code.

    • You are free:

      to Share — to copy, distribute and transmit the work to Remix — to adapt the work

    • Under the following conditions:

      Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).

      Noncommercial — You may not use this work for commercial purposes.

      Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.

    • With the understanding that:

      Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.

      Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.

      Other Rights — In no way are any of the following rights affected by the license: Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations; The author's moral rights; Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. Notice — For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page.

    For details and the full license text, see http://creativecommons.org/licenses/by-nc-sa/3.0/

    Liability

    Metabiota shall in no event be liable for any decision taken by the user based on the data made available. Under no circumstances, shall Metabiota be liable for any damages (whatsoever) arising out of the use or inability to use the database. The entire risk arising out of the use of the database remains with the user.

  11. A

    COVID-19 Government Measures Dataset

    • data.amerigeoss.org
    pdf, xlsx
    Updated Jul 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UN Humanitarian Data Exchange (2021). COVID-19 Government Measures Dataset [Dataset]. https://data.amerigeoss.org/ar/dataset/acaps-covid19-government-measures-dataset
    Explore at:
    pdf(222717), xlsx(3988532)Available download formats
    Dataset updated
    Jul 15, 2021
    Dataset provided by
    UN Humanitarian Data Exchange
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The #COVID19 Government Measures Dataset puts together all the measures implemented by governments worldwide in response to the Coronavirus pandemic. Data collection includes secondary data review. The researched information available falls into five categories:

    • Social distancing
    • Movement restrictions
    • Public health measures
    • Social and economic measures
    • Lockdowns

    Each category is broken down into several types of measures.

    ACAPS consulted government, media, United Nations, and other organisations sources.

    For any comments, please contact us at info@acaps.org

    Please note note that some measures together with non-compliance policies may not be recorded and the exact date of implementation may not be accurate in some cases, due to the different way of reporting of the primary data sources we used.

  12. Z

    COVID-19 mortality correlation with cloudiness, sunlight, latitude in...

    • data.niaid.nih.gov
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iftime Adrian; Omer Secil; Burcea Victor (2024). COVID-19 mortality correlation with cloudiness, sunlight, latitude in European countries [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4266757
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    University of Medicine and Pharmacy "Carol Davila", Romania
    University of Medicine and Pharmacy "Carol Davila", Biophysics Department, Romania
    Authors
    Iftime Adrian; Omer Secil; Burcea Victor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Europe
    Description

    "COVID-19 mortality correlation with cloudiness, sunlight, latitude in European countries"

    Dataset for preprint titled "COVID-19 mortality: positive correlation with cloudiness but no correlation with sunlight and latitude in Europe" https://doi.org/10.1101/2021.01.27.21250658

    by SECIL OMER, ADRIAN IFTIME, VICTOR BURCEA

    Corresponding author: A. Iftime, University of Medicine and Pharmacy "Carol Davila", Biophysics Department, 8 Blvd. Eroii Sanitari, 050474 Bucharest, Romania. Email address: adrian.iftime [at] umfcd.ro.

    ===========

    Dataset file: 2.0.0.COVID-19_Mortality_Cloudiness_Insolation_EUROPE_March_December_2020.csv

    Dataset graphical preview: 2.0.0.INFOGRAPHIC_CloudFraction_vs_COVID-19_mortality_Europe_March-December_2020.png

    DATASET: 444 rows (records), with the following fields:

    "Country" : Country name; 37 European countries included.

    "Date": Date stamp at the collection time. Data collection was performed in the last day of every month. Date format: YYYY-MM-DD

    "Month_Key" : Date stamp at the collection time, formatted for easier monthly time series analysis. Date format: YYYY-MM

    "Month_Fct2020" Date stamp at the collection time,formatted for easier graphing, as a string with names of the months (in English).

    "Deaths_per_1Mpop" : Monthly mortality from COVID-19 raported in the country, reported as number of COVID-19 deaths per 1 million population of the country, in that particular month / country. NB: it is reported as million population, not patients.

    "LogDeaths_per_1Mpop" : Log10 transformation of "Deaths_per_1Mpop"

    "Insolation_Average" : Insolation average (solar irradiance at ground level), in that particular month / country. It is expressed in Watt / square meter of the ground surface. Data derived from data avaialble at NASA Langley Research Center, NASA’s Earth Observatory, CERES / FLASHFlux team, 2020, https://neo.gsfc.nasa.gov/view.php?datasetId=CERES_INSOL_M (old link: https://neo.sci.gsfc.nasa.gov/view.php?datasetId=CERES_INSOL_M )

    "Cloud_Fraction" : Cloudiness (also known as cloud fraction, cloud cover, cloud amount or sky cover), as decimal fraction of the sky obscured by clouds, in that particular month / country. Data derived from NASA Goddard Space Flight Center, NASA’s Earth Observatory, MODIS Atmosphere Science Team, 2020, https://neo.gsfc.nasa.gov/view.php?datasetId=MODAL2_M_CLD_FR (old link: https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MODAL2_M_CLD_FR )

    "CENTR_latitude" and "CENTR_longitude" : Latitude and Longitude of the country centroid, for each country. Data derived from Google LLC, "Dataset publishing language: country centroids", https://developers.google.com/public-data/docs/canonical/countries_csv
    NOTE: This is identical in every month (obviuously); it is redundantly included for easier monthly sectional analysis of the data.

    ===========

    Versioning of the dataset: MAJOR: changes yearly; 1 = 2020 MINOR: changes if new monthly data is added in that particular year. PATCH: Changes only if errors or minor edits were performed.

    ===========

    CHANGELOG:

    Version 2.0.0.COVID-19_Mortality_Cloudiness_Insolation_EUROPE_March_December_2020.csv - CERES/FLASHFLUX data for August-December 2020 became available at new links at nasa.gov - These data were gathered, analyzed and introduced in this dataset (2.0.0). - updated links for CERES/FLASHFLUX and MODIS dataset - added DOI link for preprint - minor edits on text. -Dataset file source for this version (internal analysis source file): db_covid_all-ANALYSIS.2020-all-year_versiunea18d.csv

    Version 1.0.0.COVID-19_Mortality_Cloudiness_Insolation_EUROPE_March_August_2020.csv First version Dataset file source for this version (internal analysis source file): db_covid_all-ANALYSIS.2020-09-22_r10.csv

  13. m

    Coronavirus Panoply.io for Database Warehousing and Post Analysis using...

    • data.mendeley.com
    Updated Feb 4, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav Pandya (2020). Coronavirus Panoply.io for Database Warehousing and Post Analysis using Sequal Language (SQL) [Dataset]. http://doi.org/10.17632/4gphfg5tgs.2
    Explore at:
    Dataset updated
    Feb 4, 2020
    Authors
    Pranav Pandya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    It has never been easier to solve any database related problem using any sequel language and the following gives an opportunity for you guys to understand how I was able to figure out some of the interline relationships between databases using Panoply.io tool.

    I was able to insert coronavirus dataset and create a submittable, reusable result. I hope it helps you work in Data Warehouse environment.

    The following is list of SQL commands performed on dataset attached below with the final output as stored in Exports Folder QUERY 1 SELECT "Province/State" As "Region", Deaths, Recovered, Confirmed FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Deaths>0 Description: How will we estimate where Coronavirus has infiltrated, but there is effective recovery amongst patients? We can view those places by having Recovery twice more than the Death Toll.

    Query 2 SELECT country, sum(confirmed) as "Confirmed Count", sum(Recovered) as "Recovered Count", sum(Deaths) as "Death Toll" FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Confirmed>0 GROUP BY country

    Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries

    Query 3 SELECT country as "Countries where Coronavirus has reached" FROM "public"."coronavirus_updated" WHERE confirmed>0 GROUP BY country Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries.

    Query 4 SELECT country, sum(suspected) as "Suspected Cases under potential CoronaVirus outbreak" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 AND confirmed=0 GROUP BY country ORDER BY sum(suspected) DESC

    Description: Coronavirus is spreading at alarming rate. In order to know which countries are newly getting the virus is important because in these countries if timely measures are taken, it could prevent any causalities. Here is a list of suspected cases with no virus resulted deaths.

    Query 5 SELECT country, sum(suspected) as "Coronavirus uncontrolled spread count and human life loss", 100*sum(suspected)/(SELECT sum((suspected)) FROM "public"."coronavirus_updated") as "Global suspected Exposure of Coronavirus in percentage" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 GROUP BY country ORDER BY sum(suspected) DESC Description: Coronavirus is getting stronger in particular countries, but how will we measure that? We can measure it by knowing the percentage of suspected patients amongst countries which still doesn’t have any Coronavirus related deaths. The following is a list.

    Data Provided by: SRK, Data Scientist at H2O.ai, Chennai, India

  14. P

    [Archived] COVID-19 cases in Pacific Island Countries and Territories

    • pacificdata.org
    • pacific-data.sprep.org
    csv
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SPC (2025). [Archived] COVID-19 cases in Pacific Island Countries and Territories [Dataset]. https://pacificdata.org/data/dataset/archived-covid-19-cases-in-pacific-island-countries-and-territories-df-covid
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    SPC
    Time period covered
    Jan 1, 2020 - May 31, 2024
    Description

    Disclaimer: As of January 2025, SPC will no longer provide updated information on COVID-19 cases and deaths. The information presented on this page is for reference only. For current epidemic and emerging disease alerts in the Pacific region, please visit: https://www.spc.int/epidemics/

    Statistics from SPC's Public Health Division (PHD) on the number of cases of COVID-19 and the number of deaths attributed to COVID-19 in Pacific Island Countries and Territories.

    Find more Pacific data on PDH.stat.

  15. T

    World Coronavirus COVID-19 Cases

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Mar 9, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2020). World Coronavirus COVID-19 Cases [Dataset]. https://tradingeconomics.com/world/coronavirus-cases
    Explore at:
    csv, excel, xml, jsonAvailable download formats
    Dataset updated
    Mar 9, 2020
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 4, 2020 - May 17, 2023
    Area covered
    World
    Description

    The World Health Organization reported 766440796 Coronavirus Cases since the epidemic began. In addition, countries reported 6932591 Coronavirus Deaths. This dataset provides - World Coronavirus Cases- actual values, historical data, forecast, chart, statistics, economic calendar and news.

  16. Z

    INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Silicon Orchard Lab, Bangladesh
    University of Memphis, USA
    Independent University, Bangladesh
    Authors
    Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh, United States
    Description

    Introduction

    There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

    However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

    2 Data-set Introduction

    2.1 Data Collection

    We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

    The headline must have one or more words directly or indirectly related to COVID-19.

    The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

    The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

    Avoid taking duplicate reports.

    Maintain a time frame for the above mentioned newspapers.

    To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

    2.2 Data Pre-processing and Statistics

    Some pre-processing steps performed on the newspaper report dataset are as follows:

    Remove hyperlinks.

    Remove non-English alphanumeric characters.

    Remove stop words.

    Lemmatize text.

    While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

    The primary data statistics of the two dataset are shown in Table 1 and 2.

    Table 1: Covid-News-USA-NNK data statistics

    No of words per headline

    7 to 20

    No of words per body content

    150 to 2100

    Table 2: Covid-News-BD-NNK data statistics No of words per headline

    10 to 20

    No of words per body content

    100 to 1500

    2.3 Dataset Repository

    We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

    3 Literature Review

    Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

    Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

    Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

    Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

    4 Our experiments and Result analysis

    We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

    In February, both the news paper have talked about China and source of the outbreak.

    StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

    Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

    Washington Post discussed global issues more than StarTribune.

    StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

    While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

    We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

    where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,

  17. COVID-19 First Case Date By Country (Coronavirus)

    • kaggle.com
    zip
    Updated May 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Glynn (2020). COVID-19 First Case Date By Country (Coronavirus) [Dataset]. https://www.kaggle.com/datasets/josephglynn/covid19-first-case-date-by-country-coronavirus/code
    Explore at:
    zip(3258 bytes)Available download formats
    Dataset updated
    May 20, 2020
    Authors
    Joseph Glynn
    Description

    Context

    This data was collected as part of a university research paper where COVID-19 cases were analysed using a cross-sectional regression model as at 17th May 2020. In order to better understand COVID-19 cases growth at a country level I decided to create a dataset containing key dates in the progression of the virus globally.

    Content

    210 rows, 6 columns.

    This dataset contains data relating to COVID-19 cases for 210 countries globally. Data was collected using the most recent and reliable information as at 17th May 2020. The majority of data was collected from Worldometer. https://www.worldometers.info/coronavirus/#countries

    This dataset contains dates for the 1st coronavirus case, 100th coronavirus case, and (50th coronavirus case per 1 million people) for 210 countries. Data is also provided for the number of days between the 1st case and the 100th as well as the 1st case and the 50th per 1 million people.

    Data prior to 15th February 2020, was not easily accessible at the country level from Worldometer. Therefore any dates prior to 15th February 2020 were not sourced from Worldometer but reputable government and local media sources.

    Blanks (null values) indicate that the country in question has not reached either 50 coronavirus cases per 1 million people or 100 coronavirus cases. These were left blank.

    Acknowledgements

    I would like to acknowledge Worldometer for providing the vast majority of the data in this file. Worldometer is a website that provides real time statistics on topics such as coronavirus cases. Its sources include government official reports as well as trusted local media sources all of which are referenced on their website.

    Inspiration

    Hopefully this data can be used to better understand the growth of COVID-19 cases globally.

  18. d

    MD COVID-19 - Vaccination Percent Age Group Population

    • catalog.data.gov
    • opendata.maryland.gov
    • +1more
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opendata.maryland.gov (2025). MD COVID-19 - Vaccination Percent Age Group Population [Dataset]. https://catalog.data.gov/dataset/md-covid-19-vaccination-percent-age-group-population
    Explore at:
    Dataset updated
    Jun 21, 2025
    Dataset provided by
    opendata.maryland.gov
    Description

    Regarding all Vaccination Data The date of Last Update is 4/21/2023. Additionally on 4/27/2023 several COVID-19 datasets were retired and no longer included in public COVID-19 data dissemination. See this link for more information https://imap.maryland.gov/pages/covid-data Summary The cumulative number of COVID-19 vaccinations percent age group population: 16-17; 18-49; 50-64; 65 Plus. Description COVID-19 - Vaccination Percent Age Group Population data layer is a collection of COVID-19 vaccinations that have been reported each day into ImmuNet. COVID-19 is a disease caused by a respiratory virus first identified in Wuhan, Hubei Province, China in December 2019. COVID-19 is a new virus that hasn't caused illness in humans before. Worldwide, COVID-19 has resulted in thousands of infections, causing illness and in some cases death. Cases have spread to countries throughout the world, with more cases reported daily. The Maryland Department of Health reports daily on COVID-19 cases by county. Terms of Use The Spatial Data, and the information therein, (collectively the Data) is provided as is without warranty of any kind, either expressed, implied, or statutory. The user assumes the entire risk as to quality and performance of the Data. No guarantee of accuracy is granted, nor is any responsibility for reliance thereon assumed. In no event shall the State of Maryland be liable for direct, indirect, incidental, consequential or special damages of any kind. The State of Maryland does not accept liability for any damages or misrepresentation caused by inaccuracies in the Data or as a result to changes to the Data, nor is there responsibility assumed to maintain the Data in any manner or form. The Data can be freely distributed as long as the metadata entry is not modified or deleted. Any data derived from the Data must acknowledge the State of Maryland in the metadata. This map is for planning purposes only. MEMA does not guarantee the accuracy of any forecast or predictive elements.

  19. COVID-19 Pandemic Wikipedia Readership

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Johnson; Leila Zia; Joseph Allemandou; Marcel Ruiz Forns; Nuria Ruiz; Fabian Kaelin (2023). COVID-19 Pandemic Wikipedia Readership [Dataset]. http://doi.org/10.6084/m9.figshare.14548032.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Isaac Johnson; Leila Zia; Joseph Allemandou; Marcel Ruiz Forns; Nuria Ruiz; Fabian Kaelin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data release includes two Wikipedia datasets related to the readership of the project as it relates to the early COVID-19 pandemic period. The first dataset is COVID-19 article page views by country, the second dataset is one hop navigation where one of the two pages are COVID-19 related. The data covers roughly the first six months of the pandemic, more specifically from January 1st 2020 to June 30th 2020. For more background on the pandemic in those months, see English Wikipedia's Timeline of the COVID-19 pandemic.Wikipedia articles are considered COVID-19 related according the methodology described here, the list of COVID-19 articles used for the released datasets is available in covid_articles.tsv. For simplicity and transparency, the same list of articles from 20 April 2020 was used for the entire dataset though in practice new COVID-19-relevant articles were constantly being created as the pandemic evolved.Privacy considerationsWhile this data is considered valuable for the insight that it can provide about information-seeking behaviors around the pandemic in its early months across diverse geographies, care must be taken to not inadvertently reveal information about the behavior of individual Wikipedia readers. We put in place a number of filters to release as much data as we can while minimizing the risk to readers.The Wikimedia foundation started to release most viewed articles by country from Jan 2021. At the beginning of the COVID-19 an exemption was made to store reader data about the pandemic with additional privacy protections:- exclude the page views from users engaged in an edit session- exclude reader data from specific countries (with a few exceptions)- the aggregated statistics are based on 50% of reader sessions that involve a pageview to a COVID-19-related article (see covid_pages.tsv). As a control, a 1% random sample of reader sessions that have no pageviews to COVID-19-related articles was kept. In aggregate, we make sure this 1% non-COVID-19 sample and 50% COVID-19 sample represents less than 10% of pageviews for a country for that day. The randomization and filters occurs on a daily cadence with all timestamps in UTC.- exclude power users - i.e. userhashes with greater than 500 pageviews in a day. This doubles as another form of likely bot removal, protects very heavy users of the project, and also in theory would help reduce the chance of a single user heavily skewing the data.- exclude readership from users of the iOS and Android Wikipedia apps. In effect, the view counts in this dataset represent comparable trends rather than the total amount of traffic from a given country. For more background on readership data per country data, and the COVID-19 privacy protections in particular, see this phabricator.To further minimize privacy risks, a k-anonymity threshold of 100 was applied to the aggregated counts. For example, a page needs to be viewed at least 100 times in a given country and week in order to be included in the dataset. In addition, the view counts are floored to a multiple of 100.DatasetsThe datasets published in this release are derived from a reader session dataset generated by the code in this notebook with the filtering described above. The raw reader session data itself will not be publicly available due to privacy considerations. The datasets described below are similar to the pageviews and clickstream data that the Wikimedia foundation publishes already, with the addition of the country specific counts.COVID-19 pageviewsThe file covid_pageviews.tsv contains:- pageview counts for COVID-19 related pages, aggregated by week and country- k-anonymity threshold of 100- example: In the 13th week of 2020 (23 March - 29 March 2020), the page 'Pandémie_de_Covid-19_en_Italie' on French Wikipedia was visited 11700 times from readers in Belgium- as a control bucket, we include pageview counts to all pages aggregated by week and country. Due to privacy considerations during the collection of the data, the control bucket was sampled at ~1% of all view traffic. The view counts for the control title are thus proportional to the total number of pageviews to all pages.The file is ~8 MB and contains ~134000 data points across the 27 weeks, 108 countries, and 168 projects.Covid reader session bigramsThe file covid_session_bigrams.tsv contains:- number of occurrences of visits to pages A -> B, where either A or B is a COVID-19 related article. Note that the bigrams are tuples (from, to) of articles viewed in succession, the underlying mechanism can be clicking on a link in an article, but it may also have been a new search or reading both articles based on links from third source articles. In contrast, the clickstream data is based on referral information only- aggregated by month and country- k-anonymity threshold of 100- example: In March of 2020, there were a 1000 occurences of readers accessing the page es.wikipedia/SARS-CoV-2 followed by es.wikipedia/Orthocoronavirinae from ChileThe file is ~10 MB and contains ~90000 bigrams across the 6 months, 96 countries, and 56 projects.ContactPlease reach out to research-feedback@wikimedia.org for any questions.

  20. COVID-19 Country Data

    • kaggle.com
    zip
    Updated May 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick (2020). COVID-19 Country Data [Dataset]. https://www.kaggle.com/datasets/bitsnpieces/covid19-country-data/code
    Explore at:
    zip(190821 bytes)Available download formats
    Dataset updated
    May 3, 2020
    Authors
    Patrick
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Motivation

    Why did I create this dataset? This is my first time creating a notebook in Kaggle and I am interested in learning more about COVID-19 and how different countries are affected by it and why. It might be useful to compare different metrics between different countries. And I also wanted to participate in a challenge, and I've decided to join the COVID-19 datasets challenge. While looking through the projects, I noticed https://www.kaggle.com/koryto/countryinfo and it inspired me to start this project.

    Method

    My approach is to scour the Internet and Kaggle looking for country data that can potentially have an impact on how the COVID-19 pandemic spreads. In the end, I ended up with the following for each country:

    • Monthly temperature and precipitation from Worldbank
    • Latitude and longitude
    • Population, density, gender and age
    • Airport traffic from Worldbank
    • COVID-19 date of first case and number of cases and deaths as of March 26, 2020
    • 2009 H1N1 flu pandemic cases and deaths obtained from Wikipedia
    • Property affordability index and Health care index from Numbeo
    • Number of hospital beds and ICU beds from Wikipedia
    • Flu and pneumonia death rate from Worldlifeexpectancy.com (Age Adjusted Death Rate Estimates: 2017)
    • School closures due to COVID-19
    • Number of COVID-19 tests done
    • Number of COVID-19 genetic strains
    • US Social Distancing Policies from COVID19StatePolicy’s SocialDistancing repository on GitHub
    • DHL Global Connectedness Index 2018 (People Breadth scores)
    • Datasets have been merged by country name whenever possible. I needed to rename some countries by hand, e.g. US to United Sates, etc. but it's possible that I might have missed some. See the output file covid19_merged.csv for the merged result.

    See covid19_data - data_sources.csv for data source details.

    Notebook: https://www.kaggle.com/bitsnpieces/covid19-data

    Caveats

    Since I did not personally collect each datapoint, and because each datasource is different with different objectives, collected at different times, measured in different ways, any inferences from this dataset will need further investigation.

    Other interesting sources of information

    Acknowledgements

    I want to acknowledge the authors of the datasets that made their data publicly available which has made this project possible. Banner image is by Brian.

    I hope that the community finds this dataset useful. Feel free to recommend other datasets that you think will be useful / relevant! Thanks for looking.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
GHOST5612 (2025). Novel Covid-19 Dataset [Dataset]. https://www.kaggle.com/datasets/ghost5612/novel-covid-19-dataset
Organization logo

Novel Covid-19 Dataset

Day level Info On Covid-19 affected cases Worldwide

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
GHOST5612
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Context:

From World Health Organization - On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus did not match any other known virus. This raised concern because when a virus is new, we do not know how it affects people.

So daily level information on the affected people can give some interesting insights when it is made available to the broader data science community.

Johns Hopkins University has made an excellent dashboard using the affected cases data. Data is extracted from the google sheets associated and made available here.

Edited:

Now data is available as csv files in the Johns Hopkins Github repository. Please refer to the github repository for the Terms of Use details. Uploading it here for using it in Kaggle kernels and getting insights from the broader DS community.

Content

2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC

This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. Please note that this is a time series data and so the number of cases on any given day is the cumulative number.

The data is available from 22 Jan, 2020.

Here’s a polished version suitable for a professional Kaggle dataset description:

Dataset Description

This dataset contains time-series and case-level records of the COVID-19 pandemic. The primary file is covid_19_data.csv, with supporting files for earlier records and individual-level line list data.

Files and Columns

1. covid_19_data.csv (Main File)

This is the primary dataset and contains aggregated COVID-19 statistics by location and date.

  • Sno – Serial number of the record
  • ObservationDate – Date of the observation (MM/DD/YYYY)
  • Province/State – Province or state of the observation (may be missing for some entries)
  • Country/Region – Country of the observation
  • Last Update – Timestamp (UTC) when the record was last updated (not standardized, requires cleaning before use)
  • Confirmed – Cumulative number of confirmed cases on that date
  • Deaths – Cumulative number of deaths on that date
  • Recovered – Cumulative number of recoveries on that date

2. 2019_ncov_data.csv (Legacy File)

This file contains earlier COVID-19 records. It is no longer updated and is provided only for historical reference. For current analysis, please use covid_19_data.csv.

3. COVID_open_line_list_data.csv

This file provides individual-level case information, obtained from an open data source. It includes patient demographics, travel history, and case outcomes.

4. COVID19_line_list_data.csv

Another individual-level case dataset, also obtained from public sources, with detailed patient-level information useful for micro-level epidemiological analysis.

✅ Use covid_19_data.csv for up-to-date aggregated global trends.

✅ Use the line list datasets for detailed, individual-level case analysis.

Country level datasets:

If you are interested in knowing country level data, please refer to the following Kaggle datasets:

India - https://www.kaggle.com/sudalairajkumar/covid19-in-india

South Korea - https://www.kaggle.com/kimjihoo/coronavirusdataset

Italy - https://www.kaggle.com/sudalairajkumar/covid19-in-italy

Brazil - https://www.kaggle.com/unanimad/corona-virus-brazil

USA - https://www.kaggle.com/sudalairajkumar/covid19-in-usa

Switzerland - https://www.kaggle.com/daenuprobst/covid19-cases-switzerland

Indonesia - https://www.kaggle.com/ardisragen/indonesia-coronavirus-cases

Acknowledgements :

Johns Hopkins University for making the data available for educational and academic research purposes

MoBS lab - https://www.mobs-lab.org/2019ncov.html

World Health Organization (WHO): https://www.who.int/

DXY.cn. Pneumonia. 2020. http://3g.dxy.cn/newh5/view/pneumonia.

BNO News: https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/

National Health Commission of the People’s Republic of China (NHC): http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml

China CDC (CCDC): http://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm

Hong Kong Department of Health: https://www.chp.gov.hk/en/features/102465.html

Macau Government: https://www.ssm.gov.mo/portal/

Taiwan CDC: https://sites.google....

Search
Clear search
Close search
Google apps
Main menu