19 datasets found

NYT COVID US Cases & Deaths
redivis.com
application/jsonl +7
Updated Feb 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2021). NYT COVID US Cases & Deaths [Dataset]. https://redivis.com/datasets/28ec-fsftysdhj
Explore at:
avro, csv, stata, application/jsonl, spss, sas, parquet, arrowAvailable download formats
Dataset updated
Feb 19, 2021
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Time period covered
Jan 21, 2020 - Feb 16, 2021
Area covered
United States
Description
Abstract

The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time.

Documentation

https://github.com/nytimes/covid-19-data

U.S. National-Level Data

The daily number of cases and deaths nationwide, including states, U.S. territories and the District of Columbia, can be found in the covid_us table. (Raw CSV file here.)

%3E date,cases,deaths 2020-01-21,1,0 ...

State-Level Data

State-level data can be found in the covid_us_states table. (Raw CSV file here.)

%3E date,state,fips,cases,deaths 2020-01-21,Washington,53,1,0 ...

County-Level Data

County-level data can be found in the covid_us_counties table. (Raw CSV file here.)

%3E date,county,state,fips,cases,deaths 2020-01-21,Snohomish,Washington,53061,1,0 ...

In some cases, the geographies where cases are reported do not map to standard county boundaries. See the list of geographic exceptions for more detail on these.
g
Coronavirus (Covid-19) Data in the United States
github.com
openicpsr.org
+4more
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
Explore at:
csvAvailable download formats
Dataset provided by
New York Times
License
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
Description
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
US counties COVID 19 dataset
kaggle.com
zip
Updated Dec 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MyrnaMFL (2021). US counties COVID 19 dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/2966461
Explore at:
zip(23213106 bytes)Available download formats
Unique identifier
https://doi.org/10.34740/kaggle/dsv/2966461
Dataset updated
Dec 24, 2021
Authors
MyrnaMFL
Area covered
United States
Description
From the New York Times GITHUB source: CSV US counties "The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository. United States Data

Data on cumulative coronavirus cases and deaths can be found in two files for states and counties.

Each row of data reports cumulative counts based on our best reporting up to the moment we publish an update. We do our best to revise earlier entries in the data when we receive new information."

The specific data here, is the data PER US COUNTY.

The CSV link for counties is: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
NY-TIMES COVID-19 USA dataset
kaggle.com
zip
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eisa (2024). NY-TIMES COVID-19 USA dataset [Dataset]. https://www.kaggle.com/imoore/us-covid19-dataset-live-hourlydaily-updates
Explore at:
zip(29335111 bytes)Available download formats
Dataset updated
Mar 20, 2024
Authors
Eisa
Area covered
United States
Description
Historical Coronavirus (Covid-19) Data for the United States

NEW: We are publishing the data behind our excess deaths tracker in order to provide researchers and the public with a better record of the true toll of the pandemic. This data is compiled from official national and municipal data for 24 countries. See the data and documentation in the excess-deaths/ directory.

[ U.S. Data (Raw CSV) | U.S. State-Level Data (Raw CSV) | U.S. County-Level Data (Raw CSV) ]

The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.

Live and Historical Data

We are providing two sets of data with cumulative counts of coronavirus cases and deaths: one with our most current numbers for each geography and another with historical data showing the tally for each day for each geography.

The historical data files are at the top level of the directory and contain data up to, but not including the current day. The live data files are in the live/ directory.

A key difference between the historical and live files is that the numbers in the historical files are the final counts at the end of each day, while the live files have figures that may be a partial count released during the day but cannot necessarily be considered the final, end-of-day tally..

The historical and live data are released in three files, one for each of these geographic levels: U.S., states and counties.

Each row of data reports the cumulative number of coronavirus cases and deaths based on our best reporting up to the moment we publish an update. Our counts include both laboratory confirmed and probable cases using criteria that were developed by states and the federal government. Not all geographies are reporting probable cases and yet others are providing confirmed and probable as a single total. Please read here for a full discussion of this issue.

We do our best to revise earlier entries in the data when we receive new information. If a county is not listed for a date, then there were zero reported confirmed cases and deaths.

State and county files contain FIPS codes, a standard geographic identifier, to make it easier for an analyst to combine this data with other data sets like a map file or population data.

Download all the data or clone this repository by clicking the green "Clone or download" button above.

Historical Data

U.S. National-Level Data

The daily number of cases and deaths nationwide, including states, U.S. territories and the District of Columbia, can be found in the us.csv file. (Raw CSV file here.)

date,cases,deaths 2020-01-21,1,0 ...

State-Level Data

State-level data can be found in the states.csv file. (Raw CSV file here.)

date,state,fips,cases,deaths 2020-01-21,Washington,53,1,0 ...

County-Level Data

County-level data can be found in the counties.csv file. (Raw CSV file here.)

date,county,state,fips,c...
NYT Articles: 2.1M+ (2000-Present) Daily Updated
kaggle.com
zip
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aryan Singh (2025). NYT Articles: 2.1M+ (2000-Present) Daily Updated [Dataset]. https://www.kaggle.com/datasets/aryansingh0909/nyt-articles-21m-2000-present
Explore at:
zip(917844941 bytes)Available download formats
Dataset updated
May 31, 2025
Authors
Aryan Singh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

As one of the most renowned online news platforms globally, The New York Times stands out for its exceptional ability to engage and connect with its readers. What sets this publication apart from others is its unique capacity to foster meaningful interactions with its audience. This dataset offers a wealth of information, presenting a valuable opportunity to analyze and gain insights from the extensive collection of news articles available through The New York Times. Explore the data and unlock the potential for in-depth analysis and understanding of news trends and patterns.

Content

This dataset contains a comprehensive collection of articles from The New York Times, spanning from January 1, 2000, to the present day. The dataset, titled "**The New York Times Articles Metadata**," includes over 2.1 million articles, capturing a vast range of topics and stories. It is important to note that this dataset is updated daily, ensuring that the latest articles from The New York Times are included, providing an up-to-date and evolving resource for analysis. If you want to know how I update the dataset daily. You can refer to my Scraping New York Times Articles (Daily Updated) this notebook for the code template.

Features

The dataset includes key features: 1. Abstract: A brief summary of the article's content. 2. Web URL: The article's web address. 3. Headline: The title or heading of the article. 4. Keywords: Tags associated with the article, providing insights into its content. 5. Pub Date: The publication date of the article. 6. News Desk: The department responsible for the article. 7. Section Name: The section or category of the article. 8. Byline: The author or authors of the article. 9. Word Count: The number of words in the article.

And many more features...

Inspiration

This dataset opens up various possibilities for analysis and exploration, such as:

Trend Analysis: Identify emerging topics and popular themes by analyzing the frequency of keywords and categories over time.

User Engagement: Explore reader comments and reactions to gain insights into public sentiment and opinions on various articles.

Sentiment Analysis: Analyze the emotional tone of news articles using sentiment analysis techniques on headings, snippets, or full text to understand public perception.

Content Recommendation: Build a recommendation system that suggests relevant articles based on user preferences, article content, and historical patterns.

Journalistic Styles: Examine the evolution of writing styles and journalistic preferences over time and across different sections or authors.

Data Visualization: Create visually compelling graphs, word clouds, and interactive dashboards to present meaningful insights and trends derived from the dataset.

Topic Modeling: Employ techniques such as Latent Dirichlet Allocation (LDA) to identify key topics and themes within the articles, providing a deeper understanding of the content.

Social Network Analysis: Uncover connections and influence networks between authors, articles, and readers, revealing patterns of collaboration and engagement.

Geographical Analysis: Explore geographical patterns by analyzing the distribution of news articles based on locations mentioned or covered.

Text Classification: Classify articles into different genres or categories using machine learning models to understand the diversity and distribution of content.

These are just a few examples to inspire you. Enjoy exploring the rich dataset and discovering valuable insights from The New York Times articles!
COVID19_datasets
kaggle.com
zip
Updated Apr 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suradech Kongkiatpaiboon (2022). COVID19_datasets [Dataset]. https://www.kaggle.com/datasets/suradechk/covid19-datasets/discussion
Explore at:
zip(136322570 bytes)Available download formats
Dataset updated
Apr 2, 2022
Authors
Suradech Kongkiatpaiboon
Description
Collected COVID-19 datasets from various sources as part of DAAN-888 course, Penn State, Spring 2022. Collaborators: Mohamed Abdelgayed, Heather Beckwith, Mayank Sharma, Suradech Kongkiatpaiboon, and Alex Stroud

**1 - COVID-19 Data in the United States ** Source: The data is collected from multiple public health official sources by NY Times journalists and compiled in one single file. Description: Daily count of new COVID-19 cases and deaths for each state. Data is updated daily and runs from 1/21/2020 to 2/4/2022. URL: https://github.com/nytimes/covid-19-data/blob/master/us-states.csv Data size: 38,814 row and 5 columns.

**2 - Mask-Wearing Survey Data ** Source: The New York Times is releasing estimates of mask usage by county in the United States. Description: This data comes from a large number of interviews conducted online by the global data and survey firm Dynata, at the request of The New York Times. The firm asked a question about mask usage to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. URL: https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv Data size: 3,142 rows and 6 columns

**3a - Vaccine Data – Global ** Source: This data comes from the US Centers for Disease Control and Prevention (CDC), Our World in Data (OWiD) and the World Health Organization (WHO). Description: Time series data of vaccine doses administered and the number of fully and partially vaccinated people by country. This data was last updated on February 3, 2022 URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/global_data/time_series_covid19_vaccine_global.csv
Data Size: 162,521 rows and 8 columns

**3b -Vaccine Data – United States ** Source: The data is comprised of individual State's public dashboards and data from the US Centers for Disease Control and Prevention (CDC). Description: Time series data of the total vaccine doses shipped and administered by manufacturer, the dose number (first or second) by state. This data was last updated on February 3, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/us_data/time_series/vaccine_data_us_timeline.csv
Data Size: 141,503 rows and 13 columns

**4 - Testing Data ** Source: The data is comprised of individual State's public dashboards and data from the U.S. Department of Health & Human Services. Description: Time series data of total tests administered by county and state. This data was last updated on January 25, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/testing_data/county_time_series_covid19_US.csv
Data size: 322,154 rows and 8 columns

**5 – US State and Territorial Public Mask Mandates ** Source: Data from state and territory executive orders, administrative orders, resolutions, and proclamations is gathered from government websites and cataloged and coded by one coder using Microsoft Excel, with quality checking provided by one or more other coders. Description: US State and Territorial Public Mask Mandates from April 10, 2020 through August 15, 2021 by County by Day URL: https://data.cdc.gov/Policy-Surveillance/U-S-State-and-Territorial-Public-Mask-Mandates-Fro/62d6-pm5i Data Size: 1,593,869 rows and 10 columns

**6 – Case Counts & Transmission Level ** Source: This open-source dataset contains seven data items that describe community transmission levels across all counties. This dataset provides the same numbers used to show transmission maps on the COVID Data Tracker and contains reported daily transmission levels at the county level. The dataset is updated every day to include the most current day's data. The calculating procedures below are used to adjust the transmission level to low, moderate, considerable, or high.
Description: US State and County case counts and transmission level from 16-Aug-2021 to 03-Feb-2022 URL: https://data.cdc.gov/Public-Health-Surveillance/United-States-COVID-19-County-Level-of-Community-T/8396-v7yb Data Size: 550,702 rows and 7 columns

**7 - World Cases & Vaccination Counts ** Source: This is an open-source dataset collected and maintained by Our World in Data. OWID provides research and data to help against the world’s largest problems.
Description: This dataset includes vaccinations, tests & positivity, hospital & ICU, confirmed cases, confirmed deaths, reproduction rate, policy responses and other variables of interest. URL: https://github.com/owid/covid-19-data/tree/master/public/data Data Size: 67 columns and 157,000 rows

**8 - COVID-19 Data in the European Union ** Source: This is an open-source dataset collected and maintained by ECDC. It is an EU agency aimed at strengthening Europe's defenses against infectious diseases.
Description: This dataset co...
OpenAI Embeddings for New York Times Articles
kaggle.com
zip
Updated Aug 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dillon Wong (2023). OpenAI Embeddings for New York Times Articles [Dataset]. https://www.kaggle.com/datasets/dilwong/openai-embeddings-for-new-york-times-articles
Explore at:
zip(123819767 bytes)Available download formats
Dataset updated
Aug 13, 2023
Authors
Dillon Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OpenAI's text-embedding-ada-002 embeddings aren't the best out there, but they're certainly easy and cheap to obtain. Here are such vectors in a CSV file for 9100 New York Times (NYT) articles from January 2022 to mid-April 2022.

Column definitions: - id: An identifier for the article. If you go to https://twitter.com/nytimes/status/{id}, you will find a Twitter tweet that references the NYT article. Like, retweet, and reply statistics for each associated tweet can be found at https://www.kaggle.com/datasets/dilwong/newspopularity - title: The title of the news article - full_url: A URL to the NYT article - comments: If comments are enabled for the article, the number of comments - has_video: Does the article have a video? - has_audio: Does the article have audio? - n_tokens: Number of cl100k_base tokens in the article - embedding: A 1536-dimensional list of floats that provides a semantic representation of the article
Datasets supporting analytical workflow of: Chronic Acid Suppression and...
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bing Zhang; Anna Silverman; Saroja Bangaru; Douglas Arneson; Sonya Dasharathy; Nghia Nguyen; Diane Rodden; Jonathan Shih; Atul Butte; Wael El-Nachef; Brigid Boland; Vivek Rudrapatna (2023). Datasets supporting analytical workflow of: Chronic Acid Suppression and Social Determinants of COVID-19 Infection [Dataset]. http://doi.org/10.6084/m9.figshare.13380356.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13380356.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Bing Zhang; Anna Silverman; Saroja Bangaru; Douglas Arneson; Sonya Dasharathy; Nghia Nguyen; Diane Rodden; Jonathan Shih; Atul Butte; Wael El-Nachef; Brigid Boland; Vivek Rudrapatna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Publicly available geocoded social determinants of health and mobility datasets used in the analysis of "Chronic Acid Suppression and Social Determinants of COVID-19 Infection".These datasets are required for the analytical workflow shared on Github which demonstrates how the analysis in the manuscript was done using randomly generated samples to protect patient privacy.zcta_county_rel_10.txt - Population and housing density from the 2010 decennial census. Obtained from: https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txtcre-2018-a11.csv - Community Resilience Estimates which is is the capacity of individuals and households to absorb, endure, and recover from the health, social, and economic impacts of a disaster such as a hurricane or pandemic. Data obtained from: https://www.census.gov/data/experimental-data-products/community-resilience-estimates.htmlzcta_tract_rel_10.txt - Relationship between ZCTA and US Census tracts (used to map census tracts to ZCTA). Data obtained from: https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#par_textimage_674173622mask-use-by-county.txt - Mask Use By County comes from a large number of interviews conducted online by the global data and survey firm Dynata at the request of The New York Times. The firm asked a question about mask use to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. Data obtained from: https://github.com/nytimes/covid-19-data/tree/master/mask-usemobility_report_US.txt - Google mobility report which charts movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. Data obtained from: https://github.com/ActiveConclusion/COVID19_mobility/blob/master/google_reports/mobility_report_US.csvACS2015_zctaallvars.csv - Social Deprivation Index is a composite measure of area level deprivation based on seven demographic characteristics collected in the American Community Survey (https://www.census.gov/programs-surveys/acs/) and used to quantify the socio-economic variation in health outcomes. Factors are: Income, Education, Employment, Housing, Household Characteristics, Transportation, Demographics. Data obtained from: https://www.graham-center.org/rgc/maps-data-tools/sdi/social-deprivation-index.html
NY Times Vector Corpus
kaggle.com
zip
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheItCrow (2023). NY Times Vector Corpus [Dataset]. https://www.kaggle.com/datasets/kevinbnisch/ny-times-vectordb-for-topic-extraction
Explore at:
zip(3676887181 bytes)Available download formats
Dataset updated
Nov 7, 2023
Authors
TheItCrow
Description
About

This is the first version of the English dataset for VecTop which contains >250k (2018-10-01 -> 2023-10-23) articles from NY Times which have been embedded with OpenAI's text-embedding-ada-002. This corpus is being used within VecTop to extract the topics and subtopics of a given text. Please refer to the GitHub page for more information and refer to the live demo here for quick evaluation.

This dataset is also supplied via a postgreSQL backup. It is advisable to import the dataset into a proper database with Vector functionalities for instance results. See the GitHub Repo for that.

German Version

A German version with Spiegel Online has already been released here.

Use Cases

Topic Extraction

Given a small or large chunk of text, it is useful to categorize the text into topics. VecTop uses this dataset within a PostgreSQL database to first summarize the unlabeled text (if determined to be too long) and then create word embeddings of it. These word embeddings are then compared to the dataset, and by doing so, VecTop determines the topics and subtopics by looking at the topics and subtopics of the closest embeddings regarding the cosine similarity. As the result, the text is being categorized into topics and subtopics.

Searching

The dataset can be used to search for similarities in texts.

Legal Research

Legal VecTop will be used to research legal activities. For that, a legal corpus is being built. (Coming soon)

License

VecTop and therefore this dataset is being licensed under the Apache-2.0 license
RSS News Feed Collection: Sky News & NYT
kaggle.com
zip
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bkcoban (2023). RSS News Feed Collection: Sky News & NYT [Dataset]. https://www.kaggle.com/datasets/bkcoban/rss-news-feed-collection-sky-news-and-nyt
Explore at:
zip(46622 bytes)Available download formats
Dataset updated
Dec 21, 2023
Authors
bkcoban
Description
This dataset comprises current news headlines, links, descriptions, publication dates and categories collected from the RSS feeds of Sky News and The New York Times, spanning a wide range of categories.

It includes content from Home, UK, World, US, Business, Politics, Technology, Entertainment, Odd News, Sports, Science, Health, Arts, Job Listings, Most Viewed, Sunday Review, and Television.

This dataset is a resource for news analysis, tracking content trends, media research, and projects in artificial intelligence and natural language processing. Each entry contains the headline, URL, brief description, publication date, and category of the related news item.
News Popularity
kaggle.com
zip
Updated Nov 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dillon Wong (2022). News Popularity [Dataset]. https://www.kaggle.com/datasets/dilwong/newspopularity/code
Explore at:
zip(10524129 bytes)Available download formats
Dataset updated
Nov 27, 2022
Authors
Dillon Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a CSV file containing Twitter retweets, replies, likes counts for 9100 New York Times (NYT) articles from January 2022 to mid-April 2022. Twitter retweet, replies, likes counts can be used as a measure for how popular an individual article is.

Column definitions: - id: Twitter ID. The tweet from which the retweet, replies, likes numbers were obtained is at https://twitter.com/nytimes/status/{id} - retweet_count: Number of retweets - reply_count: Number of replies - like_count: Number of likes - url: A URL to the NYT article - date: Timestamp for the tweet - bag_of_phrases: A list of the words/phrases that appear in the NYT article. The text of each article is stored in the CSV file as a bag of lemmatized words, but since some words tend to occur together, those words are instead stored as phrases in which the constitute words are separated by underscores (e.g. "european_union").

Credit for the photograph here: https://unsplash.com/photos/WYd_PkCa1BY Code for the data scraping here: https://github.com/dilwong/NewsPopularity/blob/master/0%20Data%20Scraping.ipynb
U.S. States COVID-19 Cases & Deaths (NYT)
kaggle.com
zip
Updated Aug 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Venkatesh Vaishnav (2025). U.S. States COVID-19 Cases & Deaths (NYT) [Dataset]. https://www.kaggle.com/datasets/godeyeisnub/u-s-states-covid-19-cases-and-deaths-nyt
Explore at:
zip(610420 bytes)Available download formats
Dataset updated
Aug 18, 2025
Authors
Venkatesh Vaishnav
Area covered
United States
Description
Dataset

This dataset was created by Venkatesh Vaishnav

Released under Other (specified in description)

Contents
Covid-19 and Hospitals US County Time Series
kaggle.com
zip
Updated Oct 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Zahn (2025). Covid-19 and Hospitals US County Time Series [Dataset]. https://www.kaggle.com/datasets/jmzahn/covid19-and-hospitals-us-county-time-series
Explore at:
zip(6221564 bytes)Available download formats
Dataset updated
Oct 23, 2025
Authors
Jacob Zahn
Area covered
United States
Description
Context

This data was collected and created for a project in a data science course I took in college in the Spring of 2020. I have updated the data to include more dates into the summer and decided to share it and the code so others can explore it.

Content

Data

Hospitals.csv

Available here: https://hifld-geoplatform.opendata.arcgis.com/datasets/hospitals

Information on hospitals in the United States.

us-counties.csv

Available here: https://github.com/nytimes/covid-19-data

Daily covid cases and death data for us counties.

co-est2019-alldata.csv

Available here: https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/

Data sheet available here: https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/co-est2019-alldata.pdf

2019 county level census estimates.

daily.csv

Available here: https://covidtracking.com/api/v1/states/daily.csv

Daily state level covid testing data.

Uploaded with Git LFS

CountyHospitalCombined.csv, CovCountyHospitalTimeSeries.csv, and StateTestingTimeSeries.csv

Intereim data views created by me to hold cleaned data and used to create the final datset.

MasterTimeSeries.csv

Final combined dataset, a days X 3142(num of us counties+dc) long time series with variables stored as a proportion of population.

Uploaded with Git LFS

Code

The python scripts have comments to explain which datasets they're responsible for generating.

Feel free to use and edit them to tailor the datasets generated to your liking.

There is also a helper function library in the main directory.

Scripts can be ran by calling >python
NYT Full Text Articles 1985-2018
kaggle.com
zip
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SKparey (2025). NYT Full Text Articles 1985-2018 [Dataset]. https://www.kaggle.com/datasets/skparey/nyt-full-text-articles-195-2018
Explore at:
zip(235705007 bytes)Available download formats
Dataset updated
Jan 24, 2025
Authors
SKparey
Description
A collection of Job Market, Business, World, Business Day, and Technology Articles from New York Times. The dataset includes the full text articles in CSV format and JSON Format.
New York Times Articles & Comments (2020)
kaggle.com
zip
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Dornel (2021). New York Times Articles & Comments (2020) [Dataset]. https://www.kaggle.com/benjaminawd/new-york-times-articles-comments-2020
Explore at:
zip(2091289314 bytes)Available download formats
Dataset updated
Jul 20, 2021
Authors
Benjamin Dornel
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

The New York Times is one of the most popular online news platforms in the world. What sets the Times apart from other publications is the ability to engage and connect with its readers. Readers who visit the site can provide their thoughts and reactions to published content in the form of comments, and have been doing so increasingly over the last few years.

Content

This dataset contains all comments and articles from January 1, 2020 - December 31, 2020. The articles .csv file contains 16K+ articles with 11 features, and the comments .csv file contains nearly 5M comments with 23 features.

Inspiration

There's a ton of things you can do with this dataset, including: 1. Predict the number of comments that an article will receive -- you can use n_comments as a target variable or convert it to a binary classification variable. You can use this the train / test .csv files for this task. 2. Predict how many recommendations a comment will receive using recommendations as a target variable. 3. Predict whether a comment will be selected as a Times Pick using editorsSelection as a target variable. 4. Identify the most popular topics based on article headlines -- you could try using something like KMeans clustering or Latent Dirichlet Allocation (LDA) clustering. 5. Generate news headlines using a Long Short-Term Memory (LSTM) neural network.

Acknowledgements

This data was accessed through the New York Times API with nytimes-scraper. A detailed look at the data cleaning process can be found here. I'd like to acknowledge two invaluable sources of inspiration -- Aashita Kersawani's 2018 dataset, and The Analytics Edge 2015 competition.
New York Times Comments
kaggle.com
zip
Updated May 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aashita Kesarwani (2018). New York Times Comments [Dataset]. https://www.kaggle.com/aashita/nyt-comments
Explore at:
zip(502973613 bytes)Available download formats
Dataset updated
May 2, 2018
Authors
Aashita Kesarwani
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

New York Times has a wide audience and plays a prominent role in shaping people's opinion and outlook on current affairs and also in setting the tone of the public discourse, especially in the USA. The comment section in the articles is very active and it gives a glimpse of readers' take on the matters concerning the articles.

Content

The data contains information about the comments made on the articles published in New York Times in Jan-May 2017 and Jan-April 2018. The month-wise data is given in two csv files - one each for the articles on which comments were made and for the comments themselves. The csv files for comments contain over 2 million comments in total with 34 features and those for articles contain 16 features about more than 9,000 articles.

Inspiration

The data set is rich in information containing comments' texts, that are largely very well written, along with contextual information such as section/topic of the article, as well as features indicating how well the comment was received by the readers such as editorsSelection and recommendations. This data can serve the purpose of understanding and analyzing the public mood.
The exploratory kernel here can be used for a review of the features of the dataset and the NB-Logistic model kernel for predicting NYT's pick can be used as a starter for building models on a range of ideas, some of which are:

Predicting the number of upvotes a comment will receive using the feature recommendations as the target variable. With enough training set for the model, we can make a guess of how a hypothetical comment on a certain topic will be received by the community of NYT readers' and this can be considered a tool to gauge public opinion. The design of this model will be very similar to the ones used in ranking the reviews based on guessing how many upvotes the reviews will receive.

Predicting whether a comment will be editor's pick using feature editorsSelection as the target variable. It gives a clue to what NYT considers worth promoting.

Based on a comment, guessing the topic (using sectionName and/or newDesk as the target variable) of the article.

Predicting how likely it is for a comment to get replies (using replyCount feature as the target variable).

Predicting how likely it is for an article to initiate discussion and get comments and upvotes as well as sentiment analysis of the comments' text.

Predicting the same as above for topics (indicated by the features sectionName and/or newDesk).

Analyzing behaviors of the top commenters such as which topics they most likely comment and the sentiment analysis of the comments.

Data collection

The python package here written to supplant this dataset can be used to retrieve comments from a customized search of the NYT articles concerning a specific topic, for example - Iraq war or ObamaCare - in a given timeline. The tutorial here gives detailed information about the use of the package with the help of examples.

Acknowledgements

The data was collected with the help of New York Times API to retrieve URL of the articles.

The URL used to retrieve comments from a given article in the code in the package written to retrieve the data is taken from the blog by Neal Caren.
New York Times Crossword Clues & Answers 1993-2021
kaggle.com
zip
Updated Nov 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hugequiz.com (2021). New York Times Crossword Clues & Answers 1993-2021 [Dataset]. https://www.kaggle.com/datasets/darinhawley/new-york-times-crossword-clues-answers-19932021
Explore at:
zip(12086306 bytes)Available download formats
Dataset updated
Nov 4, 2021
Authors
hugequiz.com
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
These are all clues and answers for words in the New York Times crossword from 11/21/93 thru 10/31/21.

This data was compiled to create these two quizzes: https://hugequiz.com/quizzes/most-common-new-york-times-crossword-answers/ https://hugequiz.com/quizzes/most-common-new-york-times-crossword-answers-by-letter/
News Articles
kaggle.com
zip
Updated May 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
harishaaram (2018). News Articles [Dataset]. https://www.kaggle.com/harishcscode/all-news-articles-from-home-page-media-house
Explore at:
zip(327948548 bytes)Available download formats
Dataset updated
May 6, 2018
Authors
harishaaram
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The Data is collected from various media houses home page to see which News media shares/writes articles with less gory words.

Content

Datasource is obtained from these websites which are downloaded from a time period of Oct 2017 to Nov 2017:

1. "http://www.nytimes.com/" 2. "http://www.foxnews.com/" 3. "http://www.reuters.com/" 4. "http://www.cnn.com/" 5. "http://www.huffingtonpost.com/"

Each folder is named in the mmddyyyy convention. And Each CSV file has the media house name as the file name(eg: reuters.csv). The CSV has the following columns:

TITLE: the Title of the article.

SUMMARY: first few lines of the article's text.

TEXT: Full text inside the article

URL: web link to the article.

KEYWORDS: important words in the article.

Acknowledgements

This dataset is under CC0: public domain license.

Inspiration

All around the world both good and bad happens, and we get to know only those that are exposed to us. And, that’s the primary responsibility of the media. But the bigger responsibility of these media houses is the way in which they express the content to the people.

A responsible media house’s content should be original, unbiased, free of exaggeration and should be very sensitive in handling the emotions of it’s readers and viewers. A same story could be told in different ways and these different ways could definitely trigger different emotions among it’s readers.

It is known that we become who we are by what we say and what we read. Reading a story that’s filled with positive words would make us feel more positive and vice versa. So the wordings of a content definitely plays an equal role as that of the content itself.

This dataset stands as sample to find out which media house conveys the NEWS in more optimistic way!!!
NYT Bestsellers 1931-2024 (Fiction/non-Fiction)
kaggle.com
zip
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryant Reese (2024). NYT Bestsellers 1931-2024 (Fiction/non-Fiction) [Dataset]. https://www.kaggle.com/datasets/bryantreese/nyt-bestsellers-1931-2024-fictionnon-fiction/versions/1
Explore at:
zip(5604008 bytes)Available download formats
Dataset updated
Nov 26, 2024
Authors
Bryant Reese
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
NYT Bestseller Books from 1931-2024

Executive Summary

This project creates and provides in tabular form a complete list of books from the New York Times Bestseller Lists (1931–2024) in the Fiction and Non-fiction categories. The motivation for this project was curiousity. As a reader I wanted to see historic bestseller trends and also identify how many of the bestseller books I have read and which authors consistently appear on the NYT list. I also think this project provides a good opportunity for literary research in the future. I was surprised to find that only lists specific to one genre (fiction) or within a limited time frame had been created and publicly available. This dataset is unique because it provides both fiction and non-fiction data, as well as some book descriptions, across the entire current (November 2024) history of the NYT bestseller list.

The scraping and analysis was conducted using Python scripts to extract, clean and process data from PDFs available on Hawes.com from Hawes Publications

Description of Data

Fiction Data (fiction_all.csv):

Contains bestseller data for books in the Fiction category.

Columns: Date, Rank, Title, Author, Publisher, Description, and Genre.

Non-fiction Data (non_fiction_all.csv):

Contains bestseller data for books in the Non-fiction category.

Columns: Same as Fiction data.

Merged Data (merged_genres.csv):

Combines Fiction and Non-fiction datasets with a Genre column to identify the category of each book.

Author Appearance Data (author_appearances.csv):

Groups data by authors, with a count of the number of times each author appeared on the list and their dominant genre.

Book Appearance Data (book_appearances.csv):

Groups data by book titles, with a count of the number of times each book appeared on the list and its associated genre.

Power Analysis Results

A power analysis was not applicable for my project because it does not involve hypothesis testing or sampling methodologies requiring statistical power computations.

Exploratory Data Analysis

Top 20 Authors by Number of List Appearances

A bar chart visualization highlights the most frequently appearing authors.

Key Insight: Fiction authors dominate this category, with 18 out of 20 authors classified as Fiction. Danielle Steel and Stephen King were the two most frequent to appear. I would hyptohesize fiction authors appeared the most because they often create series with strong fanbase retention that would push multiple books to be bestsellers over multiple weeks.

Top 20 Books by Number of List Appearances

A bar chart visualization displays the most frequently appearing books.

Key Insight: Non-fiction books are more frequent in this category, with 13 out of 20 entries being Non-fiction. Books turned movies like Tuesdays with Morrie and Unbroken are at the top of list longevity. It is certainly interesting that more non-fiction books are able to last on the list for more weeks considering fiction author's dominance over total weeks on the bestseller lists.

The bar chart visualizations are saved as interactive HTML files (authors.html and books.html) and can be found in this repository.

Link to Code Repository

All scripts for data cleaning, preprocessing, and visualization are publicly available in the GitHub repository: https://github.com/breese5/NYTBestseller1931-2024

Ethics Statement

This dataset and analysis were developed for educational and exploratory purposes. While efforts were made to ensure the accuracy of the data, there may be inconsistencies introduced during preprocessing or due to the nature of scraping from PDF turned TXT files. Some additional cleaning manual cleaning to clear out some abnormal spacing could better the dataset however this was difficult given the size.

The dataset: -Reflects historical bestseller lists, so is not representative of the "best" books necesarily btu is a matter of opinion.

Open Source License

This dataset and all associated scripts are released under the MIT License, allowing for open use, modification, and sharing with proper attribution.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Redivis Demo Organization (2021). NYT COVID US Cases & Deaths [Dataset]. https://redivis.com/datasets/28ec-fsftysdhj

NYT COVID US Cases & Deaths

Explore at:

avro, csv, stata, application/jsonl, spss, sas, parquet, arrowAvailable download formats

Dataset updated

Feb 19, 2021

Dataset provided by

Redivis Inc.

Authors

Redivis Demo Organization

Time period covered

Jan 21, 2020 - Feb 16, 2021

Area covered

United States

Description

Abstract

The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time.

Documentation

https://github.com/nytimes/covid-19-data

U.S. National-Level Data

The daily number of cases and deaths nationwide, including states, U.S. territories and the District of Columbia, can be found in the covid_us table. (Raw CSV file here.)

%3E date,cases,deaths 2020-01-21,1,0 ...

State-Level Data

State-level data can be found in the covid_us_states table. (Raw CSV file here.)

%3E date,state,fips,cases,deaths 2020-01-21,Washington,53,1,0 ...

County-Level Data

County-level data can be found in the covid_us_counties table. (Raw CSV file here.)

%3E date,county,state,fips,cases,deaths 2020-01-21,Snohomish,Washington,53061,1,0 ...

In some cases, the geographies where cases are reported do not map to standard county boundaries. See the list of geographic exceptions for more detail on these.

Clear search

Close search

Google apps

Main menu

NYT COVID US Cases & Deaths

Abstract

Documentation

Coronavirus (Covid-19) Data in the United States

US counties COVID 19 dataset

NY-TIMES COVID-19 USA dataset

Historical Coronavirus (Covid-19) Data for the United States

Live and Historical Data

Historical Data

U.S. National-Level Data

State-Level Data

County-Level Data

NYT Articles: 2.1M+ (2000-Present) Daily Updated

Context

Content

Features

Inspiration

COVID19_datasets

OpenAI Embeddings for New York Times Articles

Datasets supporting analytical workflow of: Chronic Acid Suppression and...

NY Times Vector Corpus

About

German Version

Use Cases

Topic Extraction

Searching

Legal Research

License

RSS News Feed Collection: Sky News & NYT

News Popularity

U.S. States COVID-19 Cases & Deaths (NYT)

Dataset

Contents

Covid-19 and Hospitals US County Time Series

Context

Content

Data

Hospitals.csv

us-counties.csv

co-est2019-alldata.csv

daily.csv

CountyHospitalCombined.csv, CovCountyHospitalTimeSeries.csv, and StateTestingTimeSeries.csv

MasterTimeSeries.csv

Code

NYT Full Text Articles 1985-2018

A collection of Job Market, Business, World, Business Day, and Technology Articles from New York Times. The dataset includes the full text articles in CSV format and JSON Format.

New York Times Articles & Comments (2020)

Context

Content

Inspiration

Acknowledgements

New York Times Comments

Context

Content

Inspiration

Data collection

Acknowledgements

New York Times Crossword Clues & Answers 1993-2021

News Articles

Context

Content

Acknowledgements

Inspiration

NYT Bestsellers 1931-2024 (Fiction/non-Fiction)

NYT Bestseller Books from 1931-2024

Executive Summary

Description of Data

Power Analysis Results

Exploratory Data Analysis

Top 20 Authors by Number of List Appearances

Top 20 Books by Number of List Appearances

Link to Code Repository

Ethics Statement

Open Source License

NYT COVID US Cases & DeathsSee More Versions

Abstract

Documentation

NYT COVID US Cases & Deaths