100+ datasets found

m
Data for: COVID-19 Dataset: Worldwide Spread Log Including Countries First...
data.mendeley.com
Updated Jul 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hasmot Ali (2020). Data for: COVID-19 Dataset: Worldwide Spread Log Including Countries First Case And First Death [Dataset]. http://doi.org/10.17632/vw427wzzkk.5
Explore at:
Unique identifier
https://doi.org/10.17632/vw427wzzkk.5
Dataset updated
Jul 20, 2020
Authors
Hasmot Ali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Contain informative data related to COVID-19 pandemic. Specially, figure out about the First Case and First Death information for every single country. The datasets mainly focus on two major fields first one is First Case which consists of information of Date of First Case(s), Number of confirm Case(s) at First Day, Age of the patient(s) of First Case, Last Visited Country and the other one First Death information consist of Date of First Death and Age of the Patient who died first for every Country mentioning corresponding Continent. The datasets also contain the Binary Matrix of spread chain among different country and region.

*This is not a country. This is a ship. The name of the Cruise Ship was not given from the government.
"N+": the age is not specified but greater than N
“No Trace”: some data was not found
“Unspecified”: not available from the authority
“N/A”: for “Last Visited Country(s) of Confirmed Case(s)” column, “N/A” indicates that the confirmed case(s) of those countries do not have any travel history in recent past; in “Age of First Death(s)” column “N/A” indicates that those countries do not have may death case till May 16, 2020.
Parameterizing Spatial Models of Infectious Disease Transmission that...
plos.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajat Malik; Rob Deardon; Grace P. S. Kwong (2023). Parameterizing Spatial Models of Infectious Disease Transmission that Incorporate Infection Time Uncertainty Using Sampling-Based Likelihood Approximations [Dataset]. http://doi.org/10.1371/journal.pone.0146253
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0146253
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Rajat Malik; Rob Deardon; Grace P. S. Kwong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A class of discrete-time models of infectious disease spread, referred to as individual-level models (ILMs), are typically fitted in a Bayesian Markov chain Monte Carlo (MCMC) framework. These models quantify probabilistic outcomes regarding the risk of infection of susceptible individuals due to various susceptibility and transmissibility factors, including their spatial distance from infectious individuals. The infectious pressure from infected individuals exerted on susceptible individuals is intrinsic to these ILMs. Unfortunately, quantifying this infectious pressure for data sets containing many individuals can be computationally burdensome, leading to a time-consuming likelihood calculation and, thus, computationally prohibitive MCMC-based analysis. This problem worsens when using data augmentation to allow for uncertainty in infection times. In this paper, we develop sampling methods that can be used to calculate a fast, approximate likelihood when fitting such disease models. A simple random sampling approach is initially considered followed by various spatially-stratified schemes. We test and compare the performance of our methods with both simulated data and data from the 2001 foot-and-mouth disease (FMD) epidemic in the U.K. Our results indicate that substantial computation savings can be obtained—albeit, of course, with some information loss—suggesting that such techniques may be of use in the analysis of very large epidemic data sets.
d
Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends,...
datarade.ai
.json, .csv
Updated Aug 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataplex (2024). Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-consumer-behavior-data-2-1m-subred-dataplex
Explore at:
.json, .csvAvailable download formats
Dataset updated
Aug 7, 2024
Dataset authored and provided by
Dataplex
Area covered
Cuba, Tunisia, Saint Barthélemy, Netherlands, Lithuania, Togo, Cocos (Keeling) Islands, Burkina Faso, Croatia, Belize
Description
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

Dataset Overview:

This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

Sourced Directly from Reddit:

All data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

Key Features:

Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.

User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.

Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.

AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

Use Cases:

Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.

Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.

Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.

Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

Data Quality and Reliability:

The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

Integration and Usability:

The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

User-Friendly Structure and Metadata:

The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

Ideal For:

Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.

Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.

Researchers: Explore consumer behavior data of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conducting acade...
Worldwide COVID-19 Data from WHO (2025 Edition)
kaggle.com
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Shamim (2025). Worldwide COVID-19 Data from WHO (2025 Edition) [Dataset]. https://www.kaggle.com/datasets/adilshamim8/worldwide-covid-19-data-from-who
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2025
Dataset provided by
Kaggle
Authors
Adil Shamim
Description
Dataset Overview

This dataset contains global COVID-19 case and death data by country, collected directly from the official World Health Organization (WHO) COVID-19 Dashboard. It provides a comprehensive view of the pandemic’s impact worldwide, covering the period up to 2025. The dataset is intended for researchers, analysts, and anyone interested in understanding the progression and global effects of COVID-19 through reliable, up-to-date information.

Source Information

Website: WHO COVID-19 Dashboard

Organization: World Health Organization (WHO)

Data Coverage: Global (by country/territory)

Time Period: Up to 2025

The World Health Organization is the United Nations agency responsible for international public health. The WHO COVID-19 Dashboard is a trusted source that aggregates official reports from countries and territories around the world, providing daily updates on cases, deaths, and other key metrics related to COVID-19.

Dataset Contents

Country/Region: The name of the country or territory.

Date: Reporting date.

New Cases: Number of new confirmed COVID-19 cases.

Cumulative Cases: Total confirmed COVID-19 cases to date.

New Deaths: Number of new confirmed deaths due to COVID-19.

Cumulative Deaths: Total deaths reported to date.

Additional fields may include population, rates per 100,000, and more (see data files for details).

How to Use

This dataset can be used for: - Tracking the spread and trends of COVID-19 globally and by country - Modeling and forecasting pandemic progression - Comparative analysis of the pandemic’s impact across countries and regions - Visualization and reporting

Data Reliability

The data is sourced from the WHO, widely regarded as the most authoritative source for global health statistics. However, reporting practices and data completeness may vary by country and may be subject to revision as new information becomes available.

Acknowledgements

Special thanks to the WHO for making this data publicly available and to all those working to collect, verify, and report COVID-19 statistics.
n
ECMWF ERA5.1: 10 ensemble member surface level analysis parameter data for...
data-search.nerc.ac.uk
catalogue.ceda.ac.uk
Updated Dec 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ECMWF ERA5.1: 10 ensemble member surface level analysis parameter data for 2000-2006 [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=ensemble%20runs
Explore at:
Dataset updated
Dec 8, 2023
Description
This dataset contains ERA5.1 surface level analysis parameter data for the period 2000-2006 from 10 member ensemble runs. ERA5.1 is the European Centre for Medium-Range Weather Forecasts (ECWMF) ERA5 reanalysis project re-run for 2000-2006 to improve upon the cold bias in the lower stratosphere seen in ERA5 (see technical memorandum 859 in the linked documentation section for further details). Ensemble means and spreads are calculated from these 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble mean and ensemble spread data. The main ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data, ERA5t, are also available upto 5 days behind the present. A limited selection of data from these runs are also available via CEDA, whilst full access is available via the Copernicus Data Store.
Z
Data from: WildfireSpreadTS: A dataset of multi-modal time series for...
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerard, Sebastian (2024). WildfireSpreadTS: A dataset of multi-modal time series for wildfire spread prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8006176
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Sullivan, Josephine
Gerard, Sebastian
Zhao, Yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a multi-temporal, multi-modal remote-sensing dataset for predicting how active wildfires will spread at a resolution of 24 hours. The dataset consists of 13.607 images across 607 fire events in the United States from January 2018 to October 2021. For each fire event, the dataset contains a full time series of daily observations, containing detected active fires and variables related to fuel, topography and weather conditions. Documentation WildfireSpreadTS_Documentation.pdf includes further details about the dataset, following Gebru et al.'s "Datasheets for Datasets" framework. This documentation is similar to the supplementary material of the associated NeurIPS paper, excluding only information about experimental setup and results. For full details, please refer to the associated paper. Code: Getting started Get started working with the dataset at https://github.com/SebastianGer/WildfireSpreadTS. The code includes a PyTorch Dataset and Lightning DataModule to allow for easy access. We recommend converting the GeoTIFF files provided here to HDF5 files (bigger files, but much faster). The necessary code is also available in the repository.

This work is funded by Digital Futures in the project EO-AI4GlobalChange. The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at C3SE partially funded by the Swedish Research Council through grant agreement no. 2022-06725.
Z
Data from: A dataset of Covid-related misinformation videos and their spread...
data.niaid.nih.gov
Updated Feb 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Knuutila, Aleksi (2021). A dataset of Covid-related misinformation videos and their spread on social media [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4557827
Explore at:
Dataset updated
Feb 24, 2021
Dataset authored and provided by
Knuutila, Aleksi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains metadata about all Covid-related YouTube videos which circulated on public social media, but which YouTube eventually removed because they contained false information. It describes 8,122 videos that were shared between November 2019 and June 2020. The dataset contains unique identifiers for the videos and social media accounts that shared the videos, statistics on social media engagement and metadata such as video titles and view counts where they were recoverable. We publish the data alongside the code used to produce on Github. The dataset has reuse potential for research studying narratives related to the coronavirus, the impact of social media on knowledge about health and the politics of social media platforms.
n
ECMWF ERA5t: 10 ensemble member surface level analysis parameter data
data-search.nerc.ac.uk
catalogue.ceda.ac.uk
Updated Dec 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ECMWF ERA5t: 10 ensemble member surface level analysis parameter data [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=ensemble%20runs
Explore at:
Dataset updated
Dec 8, 2023
Description
This dataset contains ERA5 initial release (ERA5t) surface level analysis parameter data from 10 member ensemble runs. ERA5t is the European Centre for Medium-Range Weather Forecasts (ECWMF) ERA5 reanalysis project initial release available upto 5 days behind the present data. CEDA will maintain a 6 month rolling archive of these data with overlap to the verified ERA5 data - see linked datasets on this record. Ensemble means and spreads were calculated from the ERA5t 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. See linked datasets for ensemble member and spread data. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble mean and ensemble spread data. The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed and, if required, amended before the full ERA5 release. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record.
A
‘Pakistan Corona Virus Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Pakistan Corona Virus Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pakistan-corona-virus-dataset-7f50/f59c6dcf/?iid=027-428&v=presentation
Explore at:
Dataset updated
Sep 30, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Pakistan
Description
Analysis of ‘Pakistan Corona Virus Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/pakistan-corona-virus-citywise-data on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Context

Pakistan witnessed its first Corona virus patient on February 26th 2020. It's a bumpy ride since then. The cases are increasing gradually and we haven't seen the worst yet. While, there are few government resources for cumulative updates, there is no place where you can find the city level patients data. It's also not possible to find the running chronological tally of patients as they test positive. We have decided to create our own dataset for all the researchers out there with such details so we can model the infection spread and forecast the situation in coming days. We hope, by doing so, we will be able to inform policy makers on various intervention models, and healthcare professionals to be ready for the influx of new patients. We certainly hope, that this little contribution will go a long way for saving lives in Pakistan

Content

The dataset contains seven columns for date, number of cases, number of deaths, number of people recovered, travel history of those cases, and location of the cases (province and city).

The first version has the data from first case of February 26 2020 to April 19, 2020. We intend to publish weekly updates

Acknowledgements

Users are allowed to use, copy, distribute and cite the dataset as follows: “Zeeshan-ul-hassan Usmani, Sana Rasheed, Pakistan Corona Virus Data, Kaggle Dataset Repository, April 19, 2020.”

Inspiration

Some ideas worth exploring:

Can we find the spread factor for the Corona virus in Pakistan?

How long it takes for a positive case to infect another in Pakistan?

How we can use this data to simulate lock down scenarios and find its impact on country's economy? Here is a good
read to get started - http://zeeshanusmani.com/urdu/corona-economic-impact/

How does Pakistan Corona virus spread compare against its neighbors and other developed counties?

What would be the impact of this infection spread on country's economy and people living under poverty? Here are two briefs to get you started

http://zeeshanusmani.com/urdu/corona/ http://zeeshanusmani.com/urdu/corona-what-to-learn/

How do we visualize this dataset to inform policy makers? Here is one example https://zeeshanusmani.com/corona/

Can we predict the number of cases in next 10 days and a month?

--- Original source retains full ownership of the source dataset ---
c
Witty Worm dataset
catalog.caida.org
Updated Feb 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CAIDA (2021). Witty Worm dataset [Dataset]. https://catalog.caida.org/dataset/telescope_witty_worm
Explore at:
Dataset updated
Feb 25, 2021
Dataset authored and provided by
CAIDA
License
https://www.caida.org/about/legal/aua/public_aua/https://www.caida.org/about/legal/aua/public_aua/
https://www.caida.org/about/legal/aua/https://www.caida.org/about/legal/aua/
Time period covered
Mar 19, 2004 - Mar 25, 2004
Description
This dataset contains information useful for studying the spread of the Witty worm. Data were collected on the UCSD Network Telescope between Fri Mar 19 20:01:40 PST 2004 and Wed Mar 25 00:01:40 PST 2004. The dataset is exclusively available through Impact. Up until Feb 2014 the dataset was online in two portions, one public, one restricted. After Feb 2014 the whole Witty dataset was public until mid-2016. The publicly available set of files contains summarized information that does not individually identify infected computers. These are the files in witty/summaries/public. The restricted-access set of files that do contain more sensitive information, including packet traces containing complete IP and UDP headers and partial payload received from hosts spreading the Witty worm March 19-24, 2004. It also includes routing tables and summaries. These are the files in witty/data and witty/summaries/restricted/.
News Popularity in Multiple Social Media Platforms
kaggle.com
zip
Updated Oct 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil John (2020). News Popularity in Multiple Social Media Platforms [Dataset]. https://www.kaggle.com/nikhiljohnk/news-popularity-in-multiple-social-media-platforms
Explore at:
zip(10881978 bytes)Available download formats
Dataset updated
Oct 28, 2020
Authors
Nikhil John
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Social Media has been taking up everything on the Internet. People getting the latest news, useful resources, life partner and what not. In a world where Social media plays a big role in giving news, we must also know that news which affects our sentiments are going to get spread like a wildfire. Based on the Headline and the title, and according to the date given and the Social media platforms, you have to predict how it has affected the human sentiment scores. You have to predict the column “SentimentTitle” and “SentimentHeadline”.

Content

This is a subset of the dataset of the same name available in the UCI Machine Learning Repository The collected data relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: economy, microsoft, obama and palestine.

Dataset Information

The attributes for each of the dataset are : - IDLink (numeric): Unique identifier of news items - Title (string): Title of the news item according to the official media sources - Headline (string): Headline of the news item according to the official media sources - Source (string): Original news outlet that published the news item - Topic (string): Query topic used to obtain the items in the official media sources - Publish-Date (timestamp): Date and time of the news items' publication - Facebook (numeric): Final value of the news items' popularity according to the social media source Facebook - Google-Plus (numeric): Final value of the news items' popularity according to the social media source Google+ - LinkedIn (numeric): Final value of the news items' popularity according to the social media source LinkedIn - SentimentTitle: Sentiment score of the title, Higher the score, better is the impact or +ve sentiment and vice-versa. (Target Variable 1) - SentimentHeadline: Sentiment score of the text in the news items' headline. Higher the score, better is the impact or +ve sentiment. (Target Variable 2)
h
open-prices-hazelnut-spread-challenge
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex B, open-prices-hazelnut-spread-challenge [Dataset]. https://huggingface.co/datasets/TTx-hf/open-prices-hazelnut-spread-challenge
Explore at:
Authors
Alex B
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
This dataset is an extended extract of the main Open Prices Dataset https://huggingface.co/datasets/openfoodfacts/open-prices

What is Open Prices?

Open Prices is a project to collect and share prices of products around the world. It's a publicly available dataset that can be used for research, analysis, and more. Open Prices is developed and maintained by Open Food Facts. There are currently few companies that own large databases of product prices at the barcode level. These prices… See the full description on the dataset page: https://huggingface.co/datasets/TTx-hf/open-prices-hazelnut-spread-challenge.
SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest...
zenodo.org
bin
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhengpeng Feng; Yihang She; Keshav Srinivasan; Zhengpeng Feng; Yihang She; Keshav Srinivasan (2024). SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest Vision Tasks (Part II) [Dataset]. http://doi.org/10.5281/zenodo.14525290
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14525290
Dataset updated
Dec 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zhengpeng Feng; Yihang She; Keshav Srinivasan; Zhengpeng Feng; Yihang She; Keshav Srinivasan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This page only provides the drone-view image dataset.

For the ground-level image dataset, please visit SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest Vision Tasks (Part I).

For the point clouds, please visit SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest Vision Tasks (Part III).

The dataset contains drone-view RGB images, depth maps and instance segmentation labels collected from different scenes. Data from each scene is stored in a separate .7z file, along with a color_palette.xlsx file, which contains the RGB_id and corresponding RGB values.

All files follow the naming convention: {central_tree_id}_{timestamp}, where {central_tree_id} represents the ID of the tree centered in the image, which is typically in a prominent position, and timestamp indicates the time when the data was collected.

Specifically, each 7z file includes the following folders:

rgb: This folder contains the RGB images (PNG) of the scenes and their metadata (TXT). The metadata describes the weather conditions and the world time when the image was captured. An example metadata entry is: Weather:Snow_Blizzard,Hour:10,Minute:56,Second:36.

depth_pfm: This folder contains absolute depth information of the scenes, which can be used to reconstruct the point cloud of the scene through reprojection.

instance_segmentation: This folder stores instance segmentation labels (PNG) for each tree in the scene, along with metadata (TXT) that maps tree_id to RGB_id. The tree_id can be used to look up detailed information about each tree in obj_info_final.xlsx, while the RGB_id can be matched to the corresponding RGB values in color_palette.xlsx. This mapping allows for identifying which tree corresponds to a specific color in the segmentation image.

obj_info_final.xlsx: This file contains detailed information about each tree in the scene, such as position, scale, species, and various parameters, including trunk diameter (in cm), tree height (in cm), and canopy diameter (in cm).

landscape_info.txt: This file contains the ground location information within the scene, sampled every 0.5 meters.

For birch_forest, broadleaf_forest, redwood_forest and rainforest, we also provided COCO-format annotation files (.json). Two such files can be found in these datasets:

{name}_coco.json: This file contains the annotation of each tree in the scene.

{name}_filtered.json: This file is derived from the previous one, but filtering is applied to rule out overlapping instances.

⚠️: 7z files that begin with "!" indicate that the RGB values in the images within the instance_segmentation folder cannot be found in color_palette.xlsx. Consequently, this prevents matching the trees in the segmentation images to their corresponding tree information, which may hinder the application of the dataset to certain tasks. This issue is related to a bug in Colossium/AirSim, which has been reported in link1 and link2.
h
next-day-wildfire-spread
huggingface.co
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Szablewski (2024). next-day-wildfire-spread [Dataset]. https://huggingface.co/datasets/TheRootOf3/next-day-wildfire-spread
Explore at:
Dataset updated
Aug 9, 2024
Authors
Andrzej Szablewski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Next Day Wildfire Spread Dataset

This dataset is an xarray version of the original Next Day Wildfire Spread dataset. It comes in three splits: train, eval and test. Note: Given the original dataset does not contain spatio-temporal information, the xarray coordinates has been set to arbitrary ranges (0-63 for spatial dimensions and 0-number_of_samples for the temporal dimension).

Example

To open a train split of the dataset and show an elevation plot at time=2137:… See the full description on the dataset page: https://huggingface.co/datasets/TheRootOf3/next-day-wildfire-spread.
H
Replication Data for: The Spread of Synthetic Media on X
dataverse.harvard.edu
search.dataone.org
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulio Corsi; Bill Marino; Willow Wong (2024). Replication Data for: The Spread of Synthetic Media on X [Dataset]. http://doi.org/10.7910/DVN/QYS1VH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/QYS1VH
Dataset updated
Nov 21, 2024
Dataset provided by
Harvard Dataverse
Authors
Giulio Corsi; Bill Marino; Willow Wong
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset supports the research article "The Spread of Synthetic Media on X". In compliance with X's terms of service, we are only able to provide the tweet IDs used in our analysis rather than the full annotated tweet data. However, to enable reproducibility and further research, we are also making available the Python Jupyter notebook used for data collection and analysis, as well as the validation dataset used to assess the performance of our classification models. Additional, the Community Notes datasets are public and accessible at: https://communitynotes.x.com/guide/en/under-the-hood/download-data
A Twitter Dataset of 70+ million tweets related to COVID-19
zenodo.org
csv, tsv, zip
Updated Apr 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell (2023). A Twitter Dataset of 70+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3732460
Explore at:
csv, tsv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3732460
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Z
INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET
data.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Sadman (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Kishor Datta Gupta
Nishat Anjum
Nafiz Sadman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh, United States
Description
Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
ECE657AW20-ASG4-Coronavirus
kaggle.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MarkCrowley (2025). ECE657AW20-ASG4-Coronavirus [Dataset]. https://www.kaggle.com/markcrowley/ece657aw20asg4coronavirus/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2025
Dataset provided by
Kaggle
Authors
MarkCrowley
Description
COVID-19 Data for Analysis and Machine Learning

There are lots of datasets online, more growing every day, to help us all get a handle on this pandemic. Here are just a few links to data we've found that students in ECE 657A, and anyone else who finds their way here, can play with and practice their machine learning skills. The main dataset is the COVID-19 dataset from John Hopkins university. This data is perfect for time series analysis and Recurrent Neural Networks, the final topic in the course. This dataset will be left public so anyone can see it but to join you must request the link from Prof. Crowley or be in the ECE 657A W20 course at the University of Waterloo.

For ECE 657A W20 Students

Your bonus grade for assignment 4 comes from creating a kernel from this dataset and writing up some useful analysis and publishing that notebook. You can do any kind of analysis you like but some good places to start are - Analysis: feature extraction and analysis of the data to look for patterns that aren't evident from the original features (this is hard for the simple spread/infection/death data since there aren't that many features) - Other Data: utilize any other datasets in your kernels by loading data about the countries themselves (population, density, wealthy etc.) or their responses to the situation. Tip: If you open a New Notebook related to this dataset you can easily add new data available on Kaggle and link that to you analysis. - HOW'S MY FLATTENING COVID19 DATASET - This dataset has a lot more files and includes a lot of what I was talking about, so if you produce good kernels there you can also count them for your asg4 grade. https://www.kaggle.com/howsmyflattening/covid19-challenges - Predict: make predictions about confirmed cases, deaths, recoveries or other metrics for the future. You can test you models by training on the past and predicting on the following days, then post a prediction for tomorrow or the next few days given ALL the data up to this point. Hopefully the datasets we've linked here will updated automatically so your kernels would update as well. - Create Tasks: you can make your own "Tasks" as part of this kaggle and propose your own solution to it. Then others can try solving it as well. - Groups: students can do this assignment either in the same groups they had for assignment 3 or individually.

Suggest other datasets

We're happy to add other relevant data to this Kaggle, in particular it would be great to integrate live data on the following: - Progression of each country/region/city in "days since X Level" such as Days since 100 confirmed cases, see the link for a great example such a dataset being plotted. I haven't see a live link to a csv of that data, but we could generate. - Mitigation Policies enacted by local governments in each city/region/country. These are dates when that region first enacted Level 1, 2, 3, 4 containment, or started encouraging social distancing or the date when they closed different levels of schools, pubs, restaurants etc. - The hidden positives: this would be a dataset, or method for estimating, as described by Emtiyaz Khan in this twitter thread. The idea is, how many unreported or unconfirmed cases are there in any region, and can we build an estimate of that number using other regions with widespread testing as a baseline and the death rates which are like an observation of a process with a hidden variable or true infection rate. - Paper discussing one way to compute this : https://cmmid.github.io/topics/covid19/severity/global_cfr_estimates.html
d
Johns Hopkins COVID-19 Case Tracker
data.world
csv, zip
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Associated Press (2025). Johns Hopkins COVID-19 Case Tracker [Dataset]. https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker
Explore at:
zip, csvAvailable download formats
Dataset updated
Jul 14, 2025
Authors
The Associated Press
Time period covered
Jan 22, 2020 - Mar 9, 2023
Area covered
Description
Updates

Notice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.

CDC Weekly case and death counts (national and state level)

CDC County level cases and deaths

HHS New hospital admissions

CDC NowCast COVID variant proportions (national and regional level)

April 9, 2020

The population estimate data for New York County, NY has been updated to include all five New York City counties (Kings County, Queens County, Bronx County, Richmond County and New York County). This has been done to match the Johns Hopkins COVID-19 data, which aggregates counts for the five New York City counties to New York County.

April 20, 2020

Johns Hopkins death totals in the US now include confirmed and probable deaths in accordance with CDC guidelines as of April 14. One significant result of this change was an increase of more than 3,700 deaths in the New York City count. This change will likely result in increases for death counts elsewhere as well. The AP does not alter the Johns Hopkins source data, so probable deaths are included in this dataset as well.

April 29, 2020

The AP is now providing timeseries data for counts of COVID-19 cases and deaths. The raw counts are provided here unaltered, along with a population column with Census ACS-5 estimates and calculated daily case and death rates per 100,000 people. Please read the updated caveats section for more information.

September 1st, 2020

Johns Hopkins is now providing counts for the five New York City counties individually.

February 12, 2021

The Ohio Department of Health recently announced that as many as 4,000 COVID-19 deaths may have been underreported through the state’s reporting system, and that the "daily reported death counts will be high for a two to three-day period."

Because deaths data will be anomalous for consecutive days, we have chosen to freeze Ohio's rolling average for daily deaths at the last valid measure until Johns Hopkins is able to back-distribute the data. The raw daily death counts, as reported by Johns Hopkins and including the backlogged death data, will still be present in the new_deaths column.

February 16, 2021

- Johns Hopkins has reconciled Ohio's historical deaths data with the state.

Overview

The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.

The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.

The AP is updating this dataset hourly at 45 minutes past the hour.

To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.

Queries

Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic

Filter cases by state here

Rank states by their status as current hotspots. Calculates the 7-day rolling average of new cases per capita in each state: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=481e82a4-1b2f-41c2-9ea1-d91aa4b3b1ac

Find recent hotspots within your state by running a query to calculate the 7-day rolling average of new cases by capita in each county: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=b566f1db-3231-40fe-8099-311909b7b687&showTemplatePreview=true

Join county-level case data to an earlier dataset released by AP on local hospital capacity here. To find out more about the hospital capacity dataset, see the full details.

Pull the 100 counties with the highest per-capita confirmed cases here

Rank all the counties by the highest per-capita rate of new cases in the past 7 days here. Be aware that because this ranks per-capita caseloads, very small counties may rise to the very top, so take into account raw caseload figures as well.

Interactive

The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.

@(https://datawrapper.dwcdn.net/nRyaf/15/)

Interactive Embed Code

<iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>

Caveats

This data represents the number of cases and deaths reported by each state and has been collected by Johns Hopkins from a number of sources cited on their website.

In some cases, deaths or cases of people who've crossed state lines -- either to receive treatment or because they became sick and couldn't return home while traveling -- are reported in a state they aren't currently in, because of state reporting rules.

In some states, there are a number of cases not assigned to a specific county -- for those cases, the county name is "unassigned to a single county"

This data should be credited to Johns Hopkins University's COVID-19 tracking project. The AP is simply making it available here for ease of use for reporters and members.

Caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

Population estimates at the county level are drawn from 2014-18 5-year estimates from the American Community Survey.

The Urban/Rural classification scheme is from the Center for Disease Control and Preventions's National Center for Health Statistics. It puts each county into one of six categories -- from Large Central Metro to Non-Core -- according to population and other characteristics. More details about the classifications can be found here.

Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here

Attribution

This data should be credited to Johns Hopkins University COVID-19 tracking project
Data Scientists vs Size of Datasets
kaggle.com
Updated Oct 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laurae (2016). Data Scientists vs Size of Datasets [Dataset]. https://www.kaggle.com/datasets/laurae2/data-scientists-vs-size-of-datasets/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2016
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Laurae
Description
This research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.

What can you do with the data?

Look up whether Kagglers has "stronger" hardware than non-Kagglers

Whether there is a correlation between a preferred data set size and hardware

Is proficiency a predictor of specific preferences?

Are data scientists more Intel or AMD?

How spread is GPU computing, and is there any relationship with Kaggling?

Are you able to predict the amount of euros a data scientist might invest, provided their current workstation details?

I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:

Your intended usage (research? business use? blogging?...)

Your first/last name

Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.

Data set size:

Small: under 1 million values

Medium: between 1 million and 1 billion values

Large: over 1 billion values

For the data, it uses the following fields (DS = Data Scientist, W = Workstation):

DS_1 = Are you working with "large" data sets at work? (large = over 1 billion values) => Yes or No

DS_2 = Do you enjoy working with large data sets? => Yes or No

DS_3 = Would you rather have small, medium, or large data sets for work? => Small, Medium, or Large

DS_4 = Do you have any presence at Kaggle or any other Data Science platforms? => Yes or No

DS_5 = Do you view yourself proficient at working in Data Science? => Yes, A bit, or No

W_1 = What is your CPU brand? => Intel or AMD

W_2 = Do you have access to a remote server to perform large workloads? => Yes or No

W_3 = How much Euros would you invest in Data Science brand new hardware? => numeric output, rounded by 100s

W_4 = How many cores do you have to work with data sets? => numeric output

W_5 = How much RAM (in GB) do you have to work with data sets? => numeric output

W_6 = Do you do GPU computing? => Yes or No

W_7 = What programming languages do you use for Data Science? => R or Python (any other answer accepted)

W_8 = What programming languages do you use for pure statistical analysis? => R or Python (any other answer accepted)

W_9 = What programming languages do you use for training models? => R or Python (any other answer accepted)

You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.

Facebook

Twitter

Click to copy link

Link copied

Cite

Hasmot Ali (2020). Data for: COVID-19 Dataset: Worldwide Spread Log Including Countries First Case And First Death [Dataset]. http://doi.org/10.17632/vw427wzzkk.5

Data for: COVID-19 Dataset: Worldwide Spread Log Including Countries First Case And First Death

Explore at:

Unique identifier

https://doi.org/10.17632/vw427wzzkk.5

Dataset updated

Jul 20, 2020

Authors

Hasmot Ali

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Contain informative data related to COVID-19 pandemic. Specially, figure out about the First Case and First Death information for every single country. The datasets mainly focus on two major fields first one is First Case which consists of information of Date of First Case(s), Number of confirm Case(s) at First Day, Age of the patient(s) of First Case, Last Visited Country and the other one First Death information consist of Date of First Death and Age of the Patient who died first for every Country mentioning corresponding Continent. The datasets also contain the Binary Matrix of spread chain among different country and region.

*This is not a country. This is a ship. The name of the Cruise Ship was not given from the government.
"N+": the age is not specified but greater than N
“No Trace”: some data was not found
“Unspecified”: not available from the authority
“N/A”: for “Last Visited Country(s) of Confirmed Case(s)” column, “N/A” indicates that the confirmed case(s) of those countries do not have any travel history in recent past; in “Age of First Death(s)” column “N/A” indicates that those countries do not have may death case till May 16, 2020.

Clear search

Close search

Google apps

Main menu

Data for: COVID-19 Dataset: Worldwide Spread Log Including Countries First...

Parameterizing Spatial Models of Infectious Disease Transmission that...

Dataplex: Reddit Data | Consumer Behavior Data | 2.1M+ subreddits: trends,...

Worldwide COVID-19 Data from WHO (2025 Edition)

Dataset Overview

Source Information

Dataset Contents

How to Use

Data Reliability

Acknowledgements

ECMWF ERA5.1: 10 ensemble member surface level analysis parameter data for...

Data from: WildfireSpreadTS: A dataset of multi-modal time series for...

Data from: A dataset of Covid-related misinformation videos and their spread...

ECMWF ERA5t: 10 ensemble member surface level analysis parameter data

‘Pakistan Corona Virus Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Witty Worm dataset

News Popularity in Multiple Social Media Platforms

Context

Content

Dataset Information

open-prices-hazelnut-spread-challenge

SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest...

next-day-wildfire-spread

Replication Data for: The Spread of Synthetic Media on X

A Twitter Dataset of 70+ million tweets related to COVID-19

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

ECE657AW20-ASG4-Coronavirus

COVID-19 Data for Analysis and Machine Learning

For ECE 657A W20 Students

Suggest other datasets

Johns Hopkins COVID-19 Case Tracker

Updates

- Johns Hopkins has reconciled Ohio's historical deaths data with the state.

Overview

Queries

Interactive

Interactive Embed Code

Caveats

Attribution

Data Scientists vs Size of Datasets

Data for: COVID-19 Dataset: Worldwide Spread Log Including Countries First Case And First DeathSee More Versions

Data for: COVID-19 Dataset: Worldwide Spread Log Including Countries First Case And First Death