MM-COVID is a dataset for fake news detection related to COVID-19. This dataset provides the multilingual fake news and the relevant social context. It contains 3,981 pieces of fake news content and 7,192 trustworthy information from English, Spanish, Portuguese, Hindi, French and Italian, 6 different languages.
A full description of this dataset along with updated information can be found here.
In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of scholarly articles, including full text content, about COVID-19 and the coronavirus family of viruses for use by the global research community.
This dataset is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease. The corpus will be updated weekly as new research is published in peer-reviewed publications and archival services like bioRxiv, medRxiv, and others.
By downloading this dataset you are agreeing to the Dataset license. Specific licensing information for individual articles in the dataset is available in the metadata file.
Additional licensing information is available on the PMC website, medRxiv website and bioRxiv website.
Dataset content:
Commercial use subset
Non-commercial use subset
PMC custom license subset
bioRxiv/medRxiv subset (pre-prints that are not peer reviewed)
Metadata file
Readme
Each paper is represented as a single JSON object (see schema file for details).
Description:
The dataset contains all COVID-19 and coronavirus-related research (e.g. SARS, MERS, etc.) from the following sources:
PubMed's PMC open access corpus using this query (COVID-19 and coronavirus research)
Additional COVID-19 research articles from a corpus maintained by the WHO
bioRxiv and medRxiv pre-prints using the same query as PMC (COVID-19 and coronavirus research)
We also provide a comprehensive metadata file of coronavirus and COVID-19 research articles with links to PubMed, Microsoft Academic and the WHO COVID-19 database of publications (includes articles without open access full text).
We recommend using metadata from the comprehensive file when available, instead of parsed metadata in the dataset. Please note the dataset may contain multiple entries for individual PMC IDs in cases when supplementary materials are available.
This repository is linked to the WHO database of publications on coronavirus disease and other resources, such as Microsoft Academic Graph, PubMed, and Semantic Scholar. A coalition including the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine of the National Institutes of Health came together to provide this service.
Citation:
When including CORD-19 data in a publication or redistribution, please cite the dataset as follows:
In bibliography:
COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-MM-DD. Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed YYYY-MM-DD. 10.5281/zenodo.3715505
In text:
(CORD-19, 2020)
The Allen Institute for AI and particularly the Semantic Scholar team will continue to provide updates to this dataset as the situation evolves and new research is released.
Ref: https://github.com/CSSEGISandData/COVID-19 Daily reports (csse_covid_19_daily_reports) This folder contains daily case reports. All timestamps are in UTC (GMT+0). File naming convention MM-DD-YYYY.csv in UTC. Field description Province/State: China - province name; US/Canada/Australia/ - city name, state/province name; Others - name of the event (e.g., "Diamond Princess" cruise ship); other countries - blank. Country/Region: country/region name conforming to WHO (will be updated). Last Update: MM/DD/YYYY HH:mm (24 hour format, in UTC). Confirmed: the number of confirmed cases. For Hubei Province: from Feb 13 (GMT +8), we report both clinically diagnosed and lab-confirmed cases. For lab-confirmed cases only (Before Feb 17), please refer to who_covid_19_situation_reports. For Italy, diagnosis standard might be changed since Feb 27 to "slow the growth of new case numbers." (Source) Deaths: the number of deaths. Recovered: the number of recovered cases. Update frequency Files after Feb 1 (UTC): once a day around 23:59 (UTC). Files on and before Feb 1 (UTC): the last updated files before 23:59 (UTC). Sources: archived_data and dashboard. Data sources Refer to the mainpage. Why create this new folder? Unifying all timestamps to UTC, including the file name and the "Last Update" field. Pushing only one file every day. All historic data is archived in archived_data. Time series summary (csse_covid_19_time_series) This folder contains daily time series summary tables, including confirmed, deaths and recovered. All data are from the daily case report. Field descriptioin Province/State: same as above. Country/Region: same as above. Lat and Long: a coordinates reference for the user. Date fields: M/DD/YYYY (UTC), the same data as MM-DD-YYYY.csv file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multivariate regression for prediction of severe COVID-19.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset.
This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim.
The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation).
The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019).
The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English.
The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy.
The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels.
The data sources used are:
The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/
CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID
MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID
CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data
TREC Health Misinformation track https://trec-health-misinfo.github.io/
TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html
The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True).
The entries in the dataset contain the following information:
Claim. Text of the claim.
Claim label. The labels are: False, and True.
Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals.
Original information source. Information about which general information source was used to obtain the claim.
Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities.
Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1).
References
Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596
Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109.
Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552.
Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23.
Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885.
Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation.
Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics.
Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
https://www.immport.org/agreementhttps://www.immport.org/agreement
COVID-19 mRNA vaccines are highly efficacious in preventing COVID-19 morbidity and mortality in phase 3 clinical studies as well as in real-world settings. Emerging evidence suggests that some individuals with underlying comorbidities may mount suboptimal antibody responses to SARS-CoV-2 immunization (Addeo et al., 2021; Monin et al., 2021; Thakkar et al., 2021). Indeed, patients with multiple myeloma (MM) are immuno-compromised due to defects in humoral and cellular immunity as well as due to immunosuppressive therapy. Preliminary reports indicate that the antibody response in MM after the initial dose of SARS-CoV-2 mRNA vaccine is attenuated and delayed compared to healthy controls (Bird et al., 2021; Terpos et al., 2021). Moreover, MM patients who receive anti-CD38 monoclonal antibodies may have poorer vaccine-induced antibody responses even after completion of the full two-dose mRNA vaccine regimen (Pimpinelli et al., 2021). The kinetics of the vaccine responses in MM patients with prior COVID-19 infection and the impact of treatments, including BCMA-targeting agents, to vaccine response remain unknown.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets in this publication report the number of diagnoses with coronavirus disease (COVID-19) as reported by the Department of Health in Ireland. This includes new cases diagnosed per day and cumulative cases, as well as cases across age groups. The latter also include population estimates by age group for 2019 from Ireland's Central Statistics Office, in order to express cases per million population.
For the files YYYYMMDD_covid_ie_age_groups.csv, variable descriptions are as follows:
For the files YYYYMMDD_covid_ie_daily_cases, variable descriptions are as follows:
https://ottawa.ca/en/city-hall/get-know-your-city/open-data#open-data-licence-version-2-0https://ottawa.ca/en/city-hall/get-know-your-city/open-data#open-data-licence-version-2-0
Effective June 7th, 2024, this dataset will no longer be updated.This file contains data for the last 6 weeks on: Weekly counts and rates of Ottawa residents with laboratory-confirmed COVID-19 by episode date (i.e. the earliest of symptom onset, testing or reported date) and age. Weekly counts and rates of Ottawa residents with laboratory-confirmed COVID-19 by reported date. Data are from the Ontario Ministry of Health Public Health Case and Contact Management Solution (CCM).
Accuracy: Points of consideration for interpretation of the data: Data are entered into and extracted by Ottawa Public Health from the Ontario Ministry of Health Public Health Case and Contact Management Solution (CCM). The COD is a dynamic disease reporting system that allows for ongoing updates; data represent a snapshot at the time of extraction and may differ from previous or subsequent reports.As the cases are investigated and more information is available, the dates are updated. A person’s exposure may have occurred up to 14 days prior to onset of symptoms. Symptomatic cases occurring in approximately the last 14 days are likely under-reported due to the time for individuals to seek medical assessment, availability of testing, and receipt of test results.Confirmed cases are those with a confirmed COVID-19 laboratory result as per the Ministry of Health Public health management of cases and contacts of COVID-19 in Ontario. March 25, 2020 version 6.0.Counts will be subject to varying degrees of underreporting due to a variety of factors, such as disease awareness and medical care seeking behaviours, which may depend on severity of illness, clinical practice, changes in laboratory testing, and reporting behaviours.Surveillance testing for COVID-19 began in long term care facilities on April 25, 2020. Update Frequency: Tuesdays and Fridays
Attributes: Data fields: Week – Date of the first day of the episode week (i.e. the week during which the case first developed symptom, got tested or was reported to OPH – whichever was earliest). Date in format YYYY-MM-DD H:MM. Weekly Rate of COVID-19 by 20-year Age Groupings (per 100,000 pop) and Episode Date – The number of Ottawa residents with confirmed COVID-19 within an age group (e.g. 0-9 years) divided by the total Ottawa population for that age group. This fraction is then multiplied by 100,000 to get a rate of COVID-19 per 100,000 population for that age group.Weekly Total of Cases by Episode Date - number of Ottawa residents with laboratory-confirmed COVID-19 by episode date.Weekly Total of Cases by Reported Date – number of Ottawa residents with laboratory-confirmed COVID-19 by reported date.Weekly Rate of COVID-19 (per 100,000 pop) by Reported Date – number of Ottawa residents with laboratory-confirmed COVID-19 by reported date divided by the total Ottawa population and multiplied by 100,000. Contact: OPH Epidemiology Team | Epidemiology & Evidence, Ottawa Public Health
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundMyopic shift had been observed during the COVID-19 lockdown in young school children. It remains unknown whether myopic shift is accompanied with increase in axial length. We aimed to evaluate the impact of the COVID-19 lockdown on myopia and axial length of school children in China by comparing them before, during and after the lockdown.MethodsIn this population-based cross-sectional study, school-based myopia screenings were conducted in the Fall of 2019, 2020, and 2021 (representing before, during and after COVID-19 lockdown respectively) in Chengdu, China. Myopia screenings were performed on 83,132 students aged 6 to 12 years. Non-cycloplegic refractive error was examined using NIDEK auto-refractor (ARK-510A; NIDEK Corp., Tokyo, Japan) and axial length was measured using AL-Scan (NIDEK Corp., Tokyo, Japan). Spherical equivalent (SER, calculated as sphere+ 0.5*cylinder), prevalence of myopia (SER ≤ -0.50 D), and axial length were compared across 3 years stratified by age.ResultsMyopia prevalence rate was 45.0% (95% CI: 44.6–45.5%) in 2019, 48.7% (95% CI: 48.3–49.1%) in 2020, and 47.5% (95% CI: 47.1–47.9%) in 2021 (p < 0.001). The mean non-cycloplegic SER (SD) was −0.70 (1.39) D, −0.78 (1.44) D, and −0.78 (1.47) D respectively (p < 0.001). The mean (SD) axial length was 23.41 (1.01) mm, 23.45 (1.03) mm, and 23.46 (1.03) mm across 3 years respectively (p < 0.001). From the multivariable models, the risk ratio (RR) of myopia was 1.07 (95% CI: 1.06–1.08) times, the SER was 0.05 D (95% CI: 0.04 D to 0.06 D) more myopic and the mean axial length increased by 0.01 mm (95% CI: 0.01 mm to 0.02 mm) in 2020 compared to 2019. In 2021, the risk ratio (RR) of myopia was 1.05 (95% CI: 1.04–1.06), the mean SER was 0.06 D (95% CI: 0.05 D to 0.07 D) more myopic, and the mean axial length increased by 0.03 mm (95% CI: 0.02 mm to 0.04 mm) compared to 2019.ConclusionsThe COVID-19 lockdown had significant impact on myopia development and axial length, and these impacts remained 1 year after the lockdown. Further longitudinal studies following-up with these students are needed to help understand the long-term effects of COVID-19 lockdown on myopia.
Late in December 2019, the World Health Organisation (WHO) China Country Office obtained information about severe pneumonia of an unknown cause, detected in the city of Wuhan in Hubei province, China. This later turned out to be the novel coronavirus disease (COVID-19), an infectious disease caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) of the coronavirus family. The disease causes respiratory illness characterized by primary symptoms like cough, fever, and in more acute cases, difficulty in breathing. WHO later declared COVID-19 as a Pandemic because of its fast rate of spread across the Globe.
The COVID-19 datasets organized by continent contain daily level information about the COVID-19 cases in the different continents of the world. It is a time-series data and the number of cases on any given day is cumulative. The original datasets can be found on this John Hopkins University Github repository. I will be updating the COVID-19 datasets on a regular basis with every update from John Hopkins University. I have also included the World COVID-19 tests data scraped from Worldometer and 2020 world population also scraped from worldometer.
COVID-19 cases
covid19_world.csv
. It contains the cumulative number of COVID-19 cases from around the world since January 22, 2020, as compiled by John Hopkins University.
covid19_asia.csv
, covid19_africa.csv
, covid19_europe.csv
, covid19_northamerica.csv
, covid19.southamerica.csv
, covid19_oceania.csv
, and covid19_others.csv
. These contain the cumulative number of COVID-19 cases organized by the continent.
Field description - ObservationDate: Date of observation in YY/MM/DD - Country_Region: name of Country or Region - Province_State: name of Province or State - Confirmed: the number of COVID-19 confirmed cases - Deaths: the number of deaths from COVID-19 - Recovered: the number of recovered cases - Active: the number of people still infected with COVID-19 Note: Active = Confirmed - (Deaths + Recovered)
COVID-19 tests
covid19_tests.csv
. It contains the cumulative number of COVID tests data from worldometer conducted since the onset of the pandemic. Data available from June 01, 2020.
Field description Date: date in YY/MM/DD Country, Other: Country, Region, or dependency TotalTests: cumulative number of tests up till that date Population: population of Country, Region, or dependency Tests/1M pop: tests per 1 million of the population 1 Testevery X ppl: 1 test for every X number of people
2020 world population
world_population(2020).csv
. It contains the 2020 world population as reported by woldometer.
Field description Country (or dependency): Country or dependency Population (2020): population in 2020 Yearly Change: yearly change in population as a percentage Net Change: the net change in population Density(P/km2): population density Land Area(km2): land area Migrants(net): net number of migrants Fert. Rate: Fertility Rate Med. Age: median age Urban pop: urban population World Share: share of the world population as a percentage
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For English, see below This file contains the following numbers: - Number per VOC, VOI and VUM detected per week - Total number of measurements, the denominator, per weekly sample This is split into the WHO (https://www.who .int/en/activities/tracking-SARS-CoV-2-variants/) and/or ECDC (https://www.ecdc.europa.eu/en/covid-19/variants-concern) Variant or Concern ( VOC), Variant of Interest (VOI) and Variant Under Monitoring (VUM). The week to which a sample belongs is based on the date of sampling. The numbers are based on the random sample from the germ surveillance, which means that samples belonging to outbreaks are not included in the data. The file is structured as follows: - One record per VOC, VOI and VUM designated SARS-CoV-2 variant per week. This file is updated weekly on Fridays. The way this information is generated is different from the rapid tests and PCR tests. More advanced machines are used that have a longer lead time than, for example, the machines used for PCR testing. Due to all the logistics processes, it is therefore not feasible to form a representative picture of the last two weeks: these are therefore not reported. Additionally, the germ surveillance project has been operational since October 2020 with an increasing number of weekly samples until mid-early January 2021, therefore older data is not available. For all reported data, the instructions, definitions and footnotes as stated on https://www.rivm.nl/coronavirus-covid-19/virus/varianten are leading. N.B.: Due to internationally changing tribal name definitions based on advancing scientific insight, the records in the data presented here can be adjusted. Changelog: Version 2 update (October 29, 2021): - A WHO_category column has been added with the current variant category (VOC/VOI/VUM) as assigned by WHO. - In addition to the VOC and VOI categories, the VUM category is now also included in the file. Version 3 update (December 10, 2021): - A column May_include_samples_listed_before has been added with a value TRUE it is possible that the reported Variant_cases aggregate samples that are already included in a previous variant in the table. When this is not possible, the value is FALSE. Version 4 update (July 8, 2022): - The May_include_samples_listed_before column has been replaced by an Is_subvariant_of column. If this variant is a subvariant of another variant mentioned, this column contains a value that corresponds to the Variant_code of the other variant. The numbers (Variant_cases) of this subvariant are a subset of those of the other variant. Description of the variables: Version: Version number of the dataset. When the content of the dataset is structurally changed (so not the weekly update or a correction at record level), the version number will be adjusted (+1) and also the corresponding metadata in RIVM data (data.rivm.nl). Date_of_report: Date and time when the data file was last updated by RIVM. Notation: YYYY-MM-DD hh:mm:ss. Date_of_statistics_week_start: The date of the Monday - first day of that week - for which the numbers per week are presented. The last day of the week is Sunday. Notation: YYYY-MM-DD. Variant_code: Scientific name of SARS-CoV-2 variant based on Pangolin nomenclature. Can contain letters, numbers and periods. Variant_name: Current WHO label of SARS-CoV-2 variant. Consists of letters only. ECDC_category: Indicates whether it is a Variant of Concern (VOC), Variant of Interest (VOI), Variant under Monitoring (VUM), or De-escalated Variant (DEV) according to ECDC's current definitions. For more information see also: https://www.ecdc.europa.eu/en/covid-19/variants-concern. WHO_category: Indicates whether it is a Variant of Concern (VOC), Variant of Interest (VOI) or Variant under Monitoring (VUM) according to the current WHO definitions. For more info see also: https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/ Is_subvariant_of: If this variant is a subvariant of another variant mentioned, this column contains a value that corresponds to the Variant_code of the other variant. The numbers (Variant_cases) of this subvariant are a subset of those of the other variant. Sample_size: Shows the total sample size in that week. Consists of whole numbers only. Variant_cases: Shows for how many cases from the sample in the week in question the specific VOC, VOI or VUM was found. Consists of whole numbers only. -------------------------------------------------- --------------------------------------------- Covid-19 reporting of SARS-CoV-2 variants in the Netherlands through the random sample of RT -PCR positive samples in the national surveillance of virus variants. This file contains the following numbers: - Number per VOC, VOI and VUM detected per week - Total number of measurements, the denominator, per weekly sample This is split into the WHO (https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/) and/or ECDC (https://www.ecdc.europa.eu/en/covid-19/variants-concern) designated Variant of Concern (VOC), Variant of Interest (VOI) and Variant Under Monitoring (VUM). The week to which a sample belongs is based on the date of sampling. The numbers are based on the random sample from the virus variant surveillance, which means that samples belonging to outbreaks are not included in the data. The file is structured as follows: - One record per VOC, VOI and VUM noted SARS-CoV-2 variant per week. This file is updated weekly on Fridays. The way this information is generated is different from the rapid tests and PCR tests. More advanced machines are used that have a longer run time than, for example, the machines used for PCR testing. Due to all the logistics processes, it is therefore not feasible to form a representative picture of the most recent two weeks: these are not reported for that reason. Additionally, the virus variant surveillance project has been operational since October 2020 with an increasing number of weekly samples until mid-early January 2021, therefore older data is not available. For all reported data, the instructions, definitions and footnotes as stated on https://www.rivm.nl/coronavirus-covid-19/virus/varianten are leading. Please note, due to internationally changing variant name definitions based on advancing scientific insight, the records in the data presented here can be adjusted. Changelog: Version 2 update (October 29, 2021): - A WHO_category column has been added with the current variant category (VOC/VOI/VUM) as assigned by the WHO. - In addition to the VOC and VOI categories, the VUM category is now also included in the file. Version 3 update (December 10, 2021): - A column May_include_samples_listed_before has been added with a value TRUE whenever it is possible for the reported Variant_cases to aggregate samples that have already been included in a previous variant in the table. When this is not possible, the value is FALSE. Version 4 update (July 8, 2022): - The May_include_samples_listed_before column has been replaced by an Is_subvariant_of column. If this variant is a subvariant of another variant mentioned, this column contains a value that corresponds to the Variant_code of the other variant. The numbers (Variant_cases) of this subvariant are a subset of those of the other variant. Description of the variables: Version: Version number of the dataset. When the content of the dataset is structurally changed (so not the weekly update or a correction at record level), the version number will be adjusted (+1) and also the corresponding metadata in RIVM data (data.rivm.nl). Date_of_report: Date and time when the database was last updated by the RIVM. Notation: YYYY-MM-DD hh:mm:ss. Date_of_statistics_week_start: The date of the Monday - first day of that week - for which the numbers per week are presented. The last day of the week is Sunday. Notation: YYYY-MM-DD. Variant_code: Scientific name of SARS-CoV-2 variant based on Pangolin nomenclature. Can contain letters, numbers and periods. Variant_name: Current WHO label of SARS-CoV-2 variant. Consists of letters only. ECDC_category: Indicates whether it is a Variant of Concern (VOC), Variant of Interest (VOI), Variant under Monitoring (VUM), or De-escalated Variant (DEV) according to ECDC's current definitions. For more information see also: https://www.ecdc.europa.eu/en/covid-19/variants-concern. WHO_category: Indicates whether it is a Variant of Concern (VOC), Variant of Interest (VOI) or Variant under Monitoring (VUM) according to the current WHO definitions. For more information see also: https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/ Is_subvariant_of: If this variant is a subvariant of another variant that has been mentioned, this column contains a value that corresponds to the Variant_code of the other variant. The numbers (Variant_cases) of this subvariant are a subset of those of the other variant. Sample_size: Shows the total sample size in that week. Consists of whole numbers only. Variant_cases: Shows for how many cases from the sample from that week the specific VOC, VOI or VUM was found. Consists of whole numbers only.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Brazil COVID-19: No. of Tests: M&M: New: Undefined data was reported at 0.000 Unit in 22 Mar 2024. This stayed constant from the previous number of 0.000 Unit for 21 Mar 2024. Brazil COVID-19: No. of Tests: M&M: New: Undefined data is updated daily, averaging 0.000 Unit from Feb 2020 (Median) to 22 Mar 2024, with 1488 observations. The data reached an all-time high of 10,342.000 Unit in 17 Jul 2020 and a record low of 0.000 Unit in 22 Mar 2024. Brazil COVID-19: No. of Tests: M&M: New: Undefined data remains active status in CEIC and is reported by Ministry of Health. The data is categorized under High Frequency Database’s Disease Outbreaks – Table BR.HLA002: Disease Outbreaks: COVID-19: Number of Tests: Mild to Moderate Cases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is based on the Github repository maintained by OpenZH. Data has been enriched with geographical data for the cantons, in order to produce visualisations.Field NameDescriptionFormatNote
updateDate and time of notification YYYY-MM-DD-HH-MM
nameName of the reporting cantonTextabbreviation_canton_and_fl Abbreviation of the reporting canton
Text
ncumul_testedReported number of tests performed as of dateNumberIrrespective of canton of residence
ncumul_confReported number of confirmed cases as of dateNumberOnly cases that reside in the current canton
current_hosp (formerly ncumul_hosp) *Reported number of hospitalised patients on dateNumberIrrespective of canton of residencecurrent_icu (formerly ncumul_icu) *Reported number of hospitalised patients in ICUs on dateNumberIrrespective of canton of residencecurrent_vent(formerly ncumul_vent) *Reported number of patients requiring ventilation on dateNumberIrrespective of canton of residencencumul_released Reported number of patients released from hospitals or reported recovered as of date
NumberIrrespective of canton of residence
ncumul_deceasedReported number of deceased as of dateNumberOnly cases that reside in the current cantonnew_hosp *Number of new hospitalisations since last dateNumberIrrespective of canton of residence
sourceSource of the informationURL linkgeo_point_2dGeographical centroid of the cantongeo_point_2dcurrent_isolatedReported number of isolated persons on dateNumberInfected persons, who are not hospitalisedcurrent_quarantinedReported number of quarantined persons on dateNumberPersons, who were in 'close contact' with an infected person, while that person was infectious, and are not hospitalised themselvescurrent_quarantined_riskareatravelReported number of quarantined persons on dateNumberPeople arriving in Switzerland from certain countries and areas, required to go into quarantine (introduced in May 2021)*These variables were affected by the format change on April 9th, 2020, which consists in:- new variable "new_hosp"- variables "ncumul_hosp", "ncumul_icu", "ncumul_vent" have been renamed to "current_hosp", "current_icu", "current_vent", to fit with their nature. To ensure compatibility with already made dashboards or reuses, these fields have been duplicated to avoid errors when their old names are used; but we strongly recommand to replace their old names by the new as soon as possible.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is based on the Github repository maintained by OpenZH. Data has been enriched with geographical data for the cantons, in order to produce visualisations.Field NameDescriptionFormatNote
updateDate and time of notification YYYY-MM-DD-HH-MM
nameName of the reporting cantonTextabbreviation_canton_and_fl Abbreviation of the reporting canton
Text
ncumul_testedReported number of tests performed as of dateNumberIrrespective of canton of residence
ncumul_confReported number of confirmed cases as of dateNumberOnly cases that reside in the current canton
current_hosp (formerly ncumul_hosp) *Reported number of hospitalised patients on dateNumberIrrespective of canton of residencecurrent_icu (formerly ncumul_icu) *Reported number of hospitalised patients in ICUs on dateNumberIrrespective of canton of residencecurrent_vent(formerly ncumul_vent) *Reported number of patients requiring ventilation on dateNumberIrrespective of canton of residencencumul_released Reported number of patients released from hospitals or reported recovered as of date
NumberIrrespective of canton of residence
ncumul_deceasedReported number of deceased as of dateNumberOnly cases that reside in the current cantonnew_hosp *Number of new hospitalisations since last dateNumberIrrespective of canton of residence
sourceSource of the informationURL linkgeo_point_2dGeographical centroid of the cantongeo_point_2dcurrent_isolatedReported number of isolated persons on dateNumberInfected persons, who are not hospitalisedcurrent_quarantinedReported number of quarantined persons on dateNumberPersons, who were in 'close contact' with an infected person, while that person was infectious, and are not hospitalised themselvescurrent_quarantined_riskareatravelReported number of quarantined persons on dateNumberPeople arriving in Switzerland from certain countries and areas, required to go into quarantine (introduced in May 2021)*These variables were affected by the format change on April 9th, 2020, which consists in:- new variable "new_hosp"- variables "ncumul_hosp", "ncumul_icu", "ncumul_vent" have been renamed to "current_hosp", "current_icu", "current_vent", to fit with their nature. To ensure compatibility with already made dashboards or reuses, these fields have been duplicated to avoid errors when their old names are used; but we strongly recommand to replace their old names by the new as soon as possible.
This data was gathered as part of the data mining project for the General Assembly Data Science course. using the API from https://rapidapi.com/astsiatsko/api/coronavirus-monitor .
The Covid-19 is a contagious coronavirus that hailed from Wuhan, China. This new strain of the virus has strike fear in many countries as cities are quarantined and hospitals are overcrowded. This dataset will help us understand how Covid-19 in Italy.
On March 8, 2020 - Italy’s prime minister announced a sweeping coronavirus quarantine early Sunday, restricting the movements of about a quarter of the country’s population in a bid to limit contagions at the epicenter of Europe’s outbreak.
### High Light: - Spread to various overtime in Italy - Try to predict the spread of COVID-19 ahead of time to take preventive measures
https://www.livescience.com/why-italy-coronavirus-deaths-so-high.html
Data set from the article Mazzaccaro D, Giacomazzi F, Giannetta M, Varriale A, Scaramuzzo R, Modafferi A, Malacrida G, Righini P, Marrocco-Trischitta MM, Nano G. Non-Overt Coagulopathy in Non-ICU Patients with Mild to Moderate COVID-19 Pneumonia. J Clin Med. 2020 Jun 8;9(6):1781. doi: 10.3390/jcm9061781. PMID: 32521707; PMCID: PMC7355651.
Abstract
Introduction: Aim of the study is to assess the occurrence of early stage coagulopathy and disseminated intravascular coagulation (DIC) in patients with mild to moderate respiratory distress secondary to SARS-CoV-2 infection.
Materials and methods: Data of patients hospitalized from 18 March 2020 to 20 April 2020 were retrospectively reviewed. Two scores for the screening of coagulopathy (SIC and non-overt DIC scores) were calculated. The occurrence of thrombotic complication, death, and worsening respiratory function requiring non-invasive ventilation (NIV) or admission to ICU were recorded, and these outcomes were correlated with the results of each score. Chi-square test, receiver-operating characteristic curve, and logistic regression analysis were used as appropriate. p Values < 0.05 were considered statistically significant.
Results: Data of 32 patients were analyzed. Overt-DIC was diagnosed in two patients (6.2%), while 26 (81.2%) met the criteria for non-overt DIC. Non-overt DIC score values ≥4 significantly correlated with the need of NIV/ICU (p = 0.02) and with the occurrence of thrombotic complications (p = 0.04). A score ≥4 was the optimal cut-off value, performing better than SIC score (p = 0.0018). Values ≥4 in patients with thrombotic complications were predictive of death (p = 0.03).
Conclusions: Overt DIC occurred in 6.2% of non-ICU patients hospitalized for a mild to moderate COVID-19 respiratory distress, while 81.2% fulfilled the criteria for non-overt DIC. The non-overt DIC score performed better than the SIC score in predicting the need of NIV/ICU and the occurrence of thrombotic complications, as well as in predicting mortality in patients with thrombotic complications, with a score ≥4 being detected as the optimal cut-off.
The COVID-19 Numerical Claims Open Research Dataset (CONCORD) is an open-source collection of numerical claims meticulously extracted from academic papers focused on COVID-19 research. This dataset contains approximately 203,000 numerical claims, derived from over 57,000 scientific research articles published between January 2020 and May 2022. The claims were extracted from full-text research articles using a white box, weakly supervised model, with the CORD-19 repository serving as the raw dataset. The inclusion of numerical entities significantly enhances the credibility of claims, offering fine-grained, tangible, and valuable information, particularly beneficial within the biomedical domain.
The dataset features the following columns:
* claim_uid
: A unique identifier for each individual numerical claim.
* cord_uid
: An identifier for the research paper from which the claims were extracted, similar to those found in CORD-19.
* title
: A string field representing the title of the research paper.
* doi
: A string field for the Digital Object Identifier (DOI) of the paper.
* numerical_claims
: A string field containing the numerical claim sentence itself.
* publish_time
: A datetime field indicating the published date of the paper in yyyy-mm-dd format. Note that this field may not always be accurate, as some publishers denote unknown dates with future dates (e.g., yyyy-12-31).
* authors
: A list of strings, where each string represents an author of the paper in 'Last, First Middle' format, semicolon-separated.
* journal
: A string field for the journal in which the paper was published. Journal strings are not normalised (e.g., "BMJ" and "British Medical Journal" may both exist). This field can be empty if unknown.
* country
: A string field indicating the author's country. Country strings are not normalised (e.g., "USA" and "United States of America" may both exist). This field can be empty if unknown.
* institution
: A string field for the author's institute of affiliation. This field can be empty if unknown.
This dataset is typically provided in a CSV data file format. It comprises approximately 203,000 unique numerical claims, each represented by a record in the dataset. It is structured in a tabular format, suitable for analytical processing.
This dataset is ideally suited for various applications, including: * Biomedical research and analysis. * Public health studies and insights into COVID-19. * Natural Language Processing (NLP) tasks, such as information extraction and entity recognition. * Binary classification problems within text analysis. * Developing and training AI and Large Language Models (LLMs) that require fine-grained numerical information from scientific literature.
The dataset's content covers scientific research articles published within the time range of January 2020 to May 2022. The publish_time
data specifically spans from 2020-01-01 to 2022-12-31. While the dataset is global in scope, author affiliations show geographical distribution, with 16% from the USA and 10% from China, among others. Journal representation includes Int J Environ Res Public Health (5%) and PLoS One (4%), with the majority falling under other journals. Institution data indicates that 1% of authors are affiliated with the University of California, with the remaining distributed among other institutions or unknown.
CC By 4.0
This dataset is valuable for a wide array of users, including: * Researchers and academics in medical sciences, public health, and epidemiology, seeking quantitative evidence from COVID-19 literature. * Data scientists and AI developers focusing on extracting structured information from unstructured text, especially in the health domain. * Healthcare professionals and policymakers who need specific, numerically supported insights into the pandemic's various aspects. * Anyone requiring tangible and credible numerical information from the biomedical research field to inform their models or analyses.
Original Data Source: COVID-19 Numerical Claims Open Research Dataset
This dataset was collected from data received via this APi.
“[Recovered cases are a] more important metric to track than Confirmed cases.”— Researchers for the University of Virginia’s COVID-19 dashboard
If the number of total cases were accurately known for every country then the number of cases per million people would be a good indicator as to how well various countries are handling the pandemic.
№ | column name | Dtype | description |
---|---|---|---|
0 | index | int64 | index |
1 | continent | object | Any of the world's main continuous expanses of land (Europe, Asia, Africa, North and South America, Oceania) |
2 | country | object | A country is a distinct territorial body |
3 | population | float64 | The total number of people in the country |
4 | day | object | YYYY-mm-dd |
5 | time | object | YYYY-mm-dd T HH :MM:SS+UTC |
6 | cases_new | object | The difference in relation to the previous record of all cases |
7 | cases_active | float64 | Total number of current patients |
8 | cases_critical | float64 | Total number of current seriously ill |
9 | cases_recovered | float64 | Total number of recovered cases |
10 | cases_1M_pop | object | The number of cases per million people |
11 | cases_total | int64 | Records of all cases |
12 | deaths_new | object | The difference in relation to the previous record of all cases |
13 | deaths_1M_pop | object | The number of cases per million people |
14 | deaths_total | float64 | Records of all cases |
15 | tests_1M_pop | object | The number of cases per million people |
16 | tests_total | float64 | Records of all cases |
Datasets contend data about covid_19 from 232 countries - Afghanistan - Albania - Algeria - Andorra - Angola - Anguilla - Antigua-and-Barbuda - Argentina - Armenia - Aruba - Australia - Austria - Azerbaijan - Bahamas - Bahrain - Bangladesh - Barbados - Belarus - Belgium - Belize - Benin - Bermuda - Bhutan - Bolivia - Bosnia-and-Herzegovina - Botswana - Brazil - British-Virgin-Islands - Brunei - Bulgaria - Burkina-Faso - Burundi - Cabo-Verde - Cambodia - Cameroon - Canada - CAR - Caribbean-Netherlands - Cayman-Islands - Chad - Channel-Islands - Chile - China - Colombia - Comoros - Congo - Cook-Islands - Costa-Rica - Croatia - Cuba - Curaçao - Cyprus - Czechia - Denmark - Diamond-Princess - Diamond-Princess- - Djibouti - Dominica - Dominican-Republic - DRC - Ecuador - Egypt - El-Salvador - Equatorial-Guinea - Eritrea - Estonia - Eswatini - Ethiopia - Faeroe-Islands - Falkland-Islands - Fiji - Finland - France - French-Guiana - French-Polynesia - Gabon - Gambia - Georgia - Germany - Ghana - Gibraltar - Greece - Greenland - Grenada - Guadeloupe - Guam - Guatemala - Guinea - Guinea-Bissau - Guyana - Haiti - Honduras - Hong-Kong - Hungary - Iceland - India - Indonesia - Iran - Iraq - Ireland - Isle-of-Man - Israel - Italy - Ivory-Coast - Jamaica - Japan - Jordan - Kazakhstan - Kenya - Kiribati - Kuwait - Kyrgyzstan - Laos - Latvia - Lebanon - Lesotho - Liberia - Libya - Liechtenstein - Lithuania - Luxembourg - Macao - Madagascar - Malawi - Malaysia - Maldives - Mali - Malta - Marshall-Islands - Martinique - Mauritania - Mauritius - Mayotte - Mexico - Micronesia - Moldova - Monaco - Mongolia - Montenegro - Montserrat - Morocco - Mozambique - MS-Zaandam - MS-Zaandam- - Myanmar - Namibia - Nepal - Netherlands - New-Caledonia - New-Zealand - Nicaragua - Niger - Nigeria - Niue - North-Macedonia - Norway - Oman - Pakistan - Palau - Palestine - Panama - Papua-New-Guinea - Paraguay - Peru - Philippines - Poland - Portugal - Puerto-Rico - Qatar - Réunion - Romania - Russia - Rwanda - S-Korea - Saint-Helena - Saint-Kitts-and-Nevis - Saint-Lucia - Saint-Martin - Saint-Pierre-Miquelon - Samoa - San-Marino - Sao-Tome-and-Principe - Saudi-Arabia - Senegal - Serbia - Seychelles - Sierra-Leone - Singapore - Sint-Maarten - Slovakia - Slovenia - Solomon-Islands - Somalia - South-Africa - South-Sudan - Spain - Sri-Lanka - St-Barth - St-Vincent-Grenadines - Sudan - Suriname - Sweden - Switzerland - Syria - Taiwan - Tajikistan - Tanzania - Thailand - Timor-Leste - Togo - Tonga - Trinidad-and-Tobago - Tunisia - Turkey - Turks-and-Caicos - UAE - Uganda - UK - Ukraine - Uruguay - US-Virgin-Islands - USA - Uzbekistan - Vanuatu - Vatican-City - Venezuela - Vietnam - Wallis-and-Futuna - Western-Sahara - Yemen - Zambia - Zimbabw-
https://ottawa.ca/en/city-hall/get-know-your-city/open-data#open-data-licence-version-2-0https://ottawa.ca/en/city-hall/get-know-your-city/open-data#open-data-licence-version-2-0
This file contains data regarding a 7-day average of the estimated instantaneous reproduction number, R(t), of COVID-19 in Ottawa. The reproduction number, R, is the average number of secondary cases of disease caused by a single infected individual over his or her infectious period. R(t) values greater than 1 indicate the virus is spreading faster and each case infects more than one contact, and less than 1 indicates the spread is slowing and the epidemic is coming under control.
R(t) was calculated using the EpiEstim package, developed by Cori et al. (2013; DOI: 10.1093/aje/kwt133), in the R software environment for statistical computing and graphics. Accurate episode date was used as the time anchor and cases were assigned as having a local or travel-related source of infection.
Accuracy: Points of consideration for interpretation of the data: Data are entered into and extracted by Ottawa Public Health from la Solution de gestion des cas et des contacts pour la santé publique (Solution GCC). The CCM is a dynamic disease reporting system that allows for ongoing updates; data represent a snapshot at the time of extraction and may differ from previous or subsequent reports.As the cases are investigated and more information is available, the dates are updated.A person’s exposure may have occurred up to 14 days prior to onset of symptoms. Symptomatic cases occurring in approximately the last 14 days are likely under-reported due to the time for individuals to seek medical assessment, availability of testing, and receipt of test results.Confirmed cases are those with a confirmed COVID-19 laboratory result as per the Ministry of Health Public health management of cases and contacts of COVID-19 in Ontario. March 25, 2020 version 6.0.Counts will be subject to varying degrees of underreporting due to a variety of factors, such as disease awareness and medical care seeking behaviours, which may depend on severity of illness, clinical practice, changes in laboratory testing, and reporting behaviours.Surveillance testing for COVID-19 began in long term care facilities on April 25, 2020. Attributes: Data fields: Date – the earliest of symptom onset, test or reported date for cases (YYYY-MM-DD H:MM).Lower Bound - 95% Confidence Interval - lower bound of the 95% confidence interval for the 7-day average of the R(t) estimate. Upper Bound - 95% Confidence Interval - upper bound of the 95% confidence interval for the 7-day average of the R(t) estimate.Estimate of R(t) (7 Day Average) - 7-day average of the estimated instantaneous reproduction number, R(t), of COVID-19 in Ottawa. Nowcasting Adjusted Cases by Episode Date – number of Ottawa residents with confirmed COVID-19 by episode date. Counts for the most recent 14 days represent a nowcasting adjusted estimate developed by R. Imgrund in 2020. The model uses linear regression to estimate the number of future cases expected to have an accurate episode date within that 14-day window. Update Frequency: As of March 2022, the dataset is no longer updated. Historical data only. Contact: OPH Epidemiology Team
Late in December 2019, the World Health Organisation (WHO) China Country Office obtained information about severe pneumonia of an unknown cause, detected in the city of Wuhan in Hubei province, China. This later turned out to be the novel coronavirus disease (COVID-19), an infectious disease caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) of the coronavirus family. The disease causes respiratory illness characterized by primary symptoms like cough, fever, and in more acute cases, difficulty in breathing. WHO later declared COVID-19 as a Pandemic because of its fast rate of spread across the Globe with over 5.9 Million confirmed cases and over 365,000 deaths as of May 30, 2020. The African continent started confirming its first cases of COVID-19 in late January and early February of 2020 in some of its countries. The disease has since spread across all the 54 African countries with over 135,000 confirmed cases and over 3,900 deaths as of May 30, 2020.
The COVID-19 Africa dataset contains daily level information about the COVID-19 cases in Africa since January 27th, 2020. It is a time-series data and the number of cases on any given day is cumulative. The original datasets can be found on this John Hopkins University Github repository. The R script that I used to prepare this dataset is also available on my Github repository. I will be updating the COVID-19 Africa dataset on a daily basis, with every update from John Hopkins University.
Possible Insights 1. The current number of COVID-19 cases in Africa 2. The current number of COVID-19 cases by country 3. The number of COVID-19 cases in Africa / African country(s) by May 30, 2020 (Any future date)
MM-COVID is a dataset for fake news detection related to COVID-19. This dataset provides the multilingual fake news and the relevant social context. It contains 3,981 pieces of fake news content and 7,192 trustworthy information from English, Spanish, Portuguese, Hindi, French and Italian, 6 different languages.