19 datasets found
  1. NYT COVID US Cases & Deaths

    • redivis.com
    application/jsonl +7
    Updated Feb 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redivis Demo Organization (2021). NYT COVID US Cases & Deaths [Dataset]. https://redivis.com/datasets/28ec-fsftysdhj
    Explore at:
    avro, csv, stata, application/jsonl, spss, sas, parquet, arrowAvailable download formats
    Dataset updated
    Feb 19, 2021
    Dataset provided by
    Redivis Inc.
    Authors
    Redivis Demo Organization
    Time period covered
    Jan 21, 2020 - Feb 16, 2021
    Area covered
    United States
    Description

    Abstract

    The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time.

    Documentation

    https://github.com/nytimes/covid-19-data

    U.S. National-Level Data

    The daily number of cases and deaths nationwide, including states, U.S. territories and the District of Columbia, can be found in the covid_us table. (Raw CSV file here.)

    %3E date,cases,deaths 2020-01-21,1,0 ...

    State-Level Data

    State-level data can be found in the covid_us_states table. (Raw CSV file here.)

    %3E date,state,fips,cases,deaths 2020-01-21,Washington,53,1,0 ...

    County-Level Data

    County-level data can be found in the covid_us_counties table. (Raw CSV file here.)

    %3E date,county,state,fips,cases,deaths 2020-01-21,Snohomish,Washington,53061,1,0 ...

    In some cases, the geographies where cases are reported do not map to standard county boundaries. See the list of geographic exceptions for more detail on these.

  2. g

    Coronavirus (Covid-19) Data in the United States

    • github.com
    • openicpsr.org
    • +4more
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
    Explore at:
    csvAvailable download formats
    Dataset provided by
    New York Times
    License

    https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE

    Description

    The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

    Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

    We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

    The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.

  3. US counties COVID 19 dataset

    • kaggle.com
    zip
    Updated Dec 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MyrnaMFL (2021). US counties COVID 19 dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/2966461
    Explore at:
    zip(23213106 bytes)Available download formats
    Dataset updated
    Dec 24, 2021
    Authors
    MyrnaMFL
    Area covered
    United States
    Description

    From the New York Times GITHUB source: CSV US counties "The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

    Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

    We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

    The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository. United States Data

    Data on cumulative coronavirus cases and deaths can be found in two files for states and counties.

    Each row of data reports cumulative counts based on our best reporting up to the moment we publish an update. We do our best to revise earlier entries in the data when we receive new information."

    The specific data here, is the data PER US COUNTY.

    The CSV link for counties is: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

  4. NY-TIMES COVID-19 USA dataset

    • kaggle.com
    zip
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eisa (2024). NY-TIMES COVID-19 USA dataset [Dataset]. https://www.kaggle.com/imoore/us-covid19-dataset-live-hourlydaily-updates
    Explore at:
    zip(29335111 bytes)Available download formats
    Dataset updated
    Mar 20, 2024
    Authors
    Eisa
    Area covered
    United States
    Description

    Historical Coronavirus (Covid-19) Data for the United States

    NEW: We are publishing the data behind our excess deaths tracker in order to provide researchers and the public with a better record of the true toll of the pandemic. This data is compiled from official national and municipal data for 24 countries. See the data and documentation in the excess-deaths/ directory.

    [ U.S. Data (Raw CSV) | U.S. State-Level Data (Raw CSV) | U.S. County-Level Data (Raw CSV) ]

    The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

    Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

    We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

    The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.

    Live and Historical Data

    We are providing two sets of data with cumulative counts of coronavirus cases and deaths: one with our most current numbers for each geography and another with historical data showing the tally for each day for each geography.

    The historical data files are at the top level of the directory and contain data up to, but not including the current day. The live data files are in the live/ directory.

    A key difference between the historical and live files is that the numbers in the historical files are the final counts at the end of each day, while the live files have figures that may be a partial count released during the day but cannot necessarily be considered the final, end-of-day tally..

    The historical and live data are released in three files, one for each of these geographic levels: U.S., states and counties.

    Each row of data reports the cumulative number of coronavirus cases and deaths based on our best reporting up to the moment we publish an update. Our counts include both laboratory confirmed and probable cases using criteria that were developed by states and the federal government. Not all geographies are reporting probable cases and yet others are providing confirmed and probable as a single total. Please read here for a full discussion of this issue.

    We do our best to revise earlier entries in the data when we receive new information. If a county is not listed for a date, then there were zero reported confirmed cases and deaths.

    State and county files contain FIPS codes, a standard geographic identifier, to make it easier for an analyst to combine this data with other data sets like a map file or population data.

    Download all the data or clone this repository by clicking the green "Clone or download" button above.

    Historical Data

    U.S. National-Level Data

    The daily number of cases and deaths nationwide, including states, U.S. territories and the District of Columbia, can be found in the us.csv file. (Raw CSV file here.)

    date,cases,deaths
    2020-01-21,1,0
    ...
    

    State-Level Data

    State-level data can be found in the states.csv file. (Raw CSV file here.)

    date,state,fips,cases,deaths
    2020-01-21,Washington,53,1,0
    ...
    

    County-Level Data

    County-level data can be found in the counties.csv file. (Raw CSV file here.)

    date,county,state,fips,c...
    
  5. NYT Articles: 2.1M+ (2000-Present) Daily Updated

    • kaggle.com
    zip
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aryan Singh (2025). NYT Articles: 2.1M+ (2000-Present) Daily Updated [Dataset]. https://www.kaggle.com/datasets/aryansingh0909/nyt-articles-21m-2000-present
    Explore at:
    zip(917844941 bytes)Available download formats
    Dataset updated
    May 31, 2025
    Authors
    Aryan Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    As one of the most renowned online news platforms globally, The New York Times stands out for its exceptional ability to engage and connect with its readers. What sets this publication apart from others is its unique capacity to foster meaningful interactions with its audience. This dataset offers a wealth of information, presenting a valuable opportunity to analyze and gain insights from the extensive collection of news articles available through The New York Times. Explore the data and unlock the potential for in-depth analysis and understanding of news trends and patterns.

    Content

    This dataset contains a comprehensive collection of articles from The New York Times, spanning from January 1, 2000, to the present day. The dataset, titled "**The New York Times Articles Metadata**," includes over 2.1 million articles, capturing a vast range of topics and stories. It is important to note that this dataset is updated daily, ensuring that the latest articles from The New York Times are included, providing an up-to-date and evolving resource for analysis. If you want to know how I update the dataset daily. You can refer to my Scraping New York Times Articles (Daily Updated) this notebook for the code template.

    Features

    The dataset includes key features: 1. Abstract: A brief summary of the article's content. 2. Web URL: The article's web address. 3. Headline: The title or heading of the article. 4. Keywords: Tags associated with the article, providing insights into its content. 5. Pub Date: The publication date of the article. 6. News Desk: The department responsible for the article. 7. Section Name: The section or category of the article. 8. Byline: The author or authors of the article. 9. Word Count: The number of words in the article.

    And many more features...

    Inspiration

    This dataset opens up various possibilities for analysis and exploration, such as:

    1. Trend Analysis: Identify emerging topics and popular themes by analyzing the frequency of keywords and categories over time.
    2. User Engagement: Explore reader comments and reactions to gain insights into public sentiment and opinions on various articles.
    3. Sentiment Analysis: Analyze the emotional tone of news articles using sentiment analysis techniques on headings, snippets, or full text to understand public perception.
    4. Content Recommendation: Build a recommendation system that suggests relevant articles based on user preferences, article content, and historical patterns.
    5. Journalistic Styles: Examine the evolution of writing styles and journalistic preferences over time and across different sections or authors.
    6. Data Visualization: Create visually compelling graphs, word clouds, and interactive dashboards to present meaningful insights and trends derived from the dataset.
    7. Topic Modeling: Employ techniques such as Latent Dirichlet Allocation (LDA) to identify key topics and themes within the articles, providing a deeper understanding of the content.
    8. Social Network Analysis: Uncover connections and influence networks between authors, articles, and readers, revealing patterns of collaboration and engagement.
    9. Geographical Analysis: Explore geographical patterns by analyzing the distribution of news articles based on locations mentioned or covered.
    10. Text Classification: Classify articles into different genres or categories using machine learning models to understand the diversity and distribution of content.

    These are just a few examples to inspire you. Enjoy exploring the rich dataset and discovering valuable insights from The New York Times articles!

  6. COVID19_datasets

    • kaggle.com
    zip
    Updated Apr 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suradech Kongkiatpaiboon (2022). COVID19_datasets [Dataset]. https://www.kaggle.com/datasets/suradechk/covid19-datasets/discussion
    Explore at:
    zip(136322570 bytes)Available download formats
    Dataset updated
    Apr 2, 2022
    Authors
    Suradech Kongkiatpaiboon
    Description

    Collected COVID-19 datasets from various sources as part of DAAN-888 course, Penn State, Spring 2022. Collaborators: Mohamed Abdelgayed, Heather Beckwith, Mayank Sharma, Suradech Kongkiatpaiboon, and Alex Stroud

    **1 - COVID-19 Data in the United States ** Source: The data is collected from multiple public health official sources by NY Times journalists and compiled in one single file. Description: Daily count of new COVID-19 cases and deaths for each state. Data is updated daily and runs from 1/21/2020 to 2/4/2022. URL: https://github.com/nytimes/covid-19-data/blob/master/us-states.csv Data size: 38,814 row and 5 columns.

    **2 - Mask-Wearing Survey Data ** Source: The New York Times is releasing estimates of mask usage by county in the United States. Description: This data comes from a large number of interviews conducted online by the global data and survey firm Dynata, at the request of The New York Times. The firm asked a question about mask usage to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. URL: https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv Data size: 3,142 rows and 6 columns

    **3a - Vaccine Data – Global ** Source: This data comes from the US Centers for Disease Control and Prevention (CDC), Our World in Data (OWiD) and the World Health Organization (WHO). Description: Time series data of vaccine doses administered and the number of fully and partially vaccinated people by country. This data was last updated on February 3, 2022 URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/global_data/time_series_covid19_vaccine_global.csv
    Data Size: 162,521 rows and 8 columns

    **3b -Vaccine Data – United States ** Source: The data is comprised of individual State's public dashboards and data from the US Centers for Disease Control and Prevention (CDC). Description: Time series data of the total vaccine doses shipped and administered by manufacturer, the dose number (first or second) by state. This data was last updated on February 3, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/us_data/time_series/vaccine_data_us_timeline.csv
    Data Size: 141,503 rows and 13 columns

    **4 - Testing Data ** Source: The data is comprised of individual State's public dashboards and data from the U.S. Department of Health & Human Services. Description: Time series data of total tests administered by county and state. This data was last updated on January 25, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/testing_data/county_time_series_covid19_US.csv
    Data size: 322,154 rows and 8 columns

    **5 – US State and Territorial Public Mask Mandates ** Source: Data from state and territory executive orders, administrative orders, resolutions, and proclamations is gathered from government websites and cataloged and coded by one coder using Microsoft Excel, with quality checking provided by one or more other coders. Description: US State and Territorial Public Mask Mandates from April 10, 2020 through August 15, 2021 by County by Day URL: https://data.cdc.gov/Policy-Surveillance/U-S-State-and-Territorial-Public-Mask-Mandates-Fro/62d6-pm5i Data Size: 1,593,869 rows and 10 columns

    **6 – Case Counts & Transmission Level ** Source: This open-source dataset contains seven data items that describe community transmission levels across all counties. This dataset provides the same numbers used to show transmission maps on the COVID Data Tracker and contains reported daily transmission levels at the county level. The dataset is updated every day to include the most current day's data. The calculating procedures below are used to adjust the transmission level to low, moderate, considerable, or high.
    Description: US State and County case counts and transmission level from 16-Aug-2021 to 03-Feb-2022 URL: https://data.cdc.gov/Public-Health-Surveillance/United-States-COVID-19-County-Level-of-Community-T/8396-v7yb Data Size: 550,702 rows and 7 columns

    **7 - World Cases & Vaccination Counts ** Source: This is an open-source dataset collected and maintained by Our World in Data. OWID provides research and data to help against the world’s largest problems.
    Description: This dataset includes vaccinations, tests & positivity, hospital & ICU, confirmed cases, confirmed deaths, reproduction rate, policy responses and other variables of interest. URL: https://github.com/owid/covid-19-data/tree/master/public/data Data Size: 67 columns and 157,000 rows

    **8 - COVID-19 Data in the European Union ** Source: This is an open-source dataset collected and maintained by ECDC. It is an EU agency aimed at strengthening Europe's defenses against infectious diseases.
    Description: This dataset co...

  7. OpenAI Embeddings for New York Times Articles

    • kaggle.com
    zip
    Updated Aug 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dillon Wong (2023). OpenAI Embeddings for New York Times Articles [Dataset]. https://www.kaggle.com/datasets/dilwong/openai-embeddings-for-new-york-times-articles
    Explore at:
    zip(123819767 bytes)Available download formats
    Dataset updated
    Aug 13, 2023
    Authors
    Dillon Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OpenAI's text-embedding-ada-002 embeddings aren't the best out there, but they're certainly easy and cheap to obtain. Here are such vectors in a CSV file for 9100 New York Times (NYT) articles from January 2022 to mid-April 2022.

    Column definitions: - id: An identifier for the article. If you go to https://twitter.com/nytimes/status/{id}, you will find a Twitter tweet that references the NYT article. Like, retweet, and reply statistics for each associated tweet can be found at https://www.kaggle.com/datasets/dilwong/newspopularity - title: The title of the news article - full_url: A URL to the NYT article - comments: If comments are enabled for the article, the number of comments - has_video: Does the article have a video? - has_audio: Does the article have audio? - n_tokens: Number of cl100k_base tokens in the article - embedding: A 1536-dimensional list of floats that provides a semantic representation of the article

  8. Datasets supporting analytical workflow of: Chronic Acid Suppression and...

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bing Zhang; Anna Silverman; Saroja Bangaru; Douglas Arneson; Sonya Dasharathy; Nghia Nguyen; Diane Rodden; Jonathan Shih; Atul Butte; Wael El-Nachef; Brigid Boland; Vivek Rudrapatna (2023). Datasets supporting analytical workflow of: Chronic Acid Suppression and Social Determinants of COVID-19 Infection [Dataset]. http://doi.org/10.6084/m9.figshare.13380356.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Bing Zhang; Anna Silverman; Saroja Bangaru; Douglas Arneson; Sonya Dasharathy; Nghia Nguyen; Diane Rodden; Jonathan Shih; Atul Butte; Wael El-Nachef; Brigid Boland; Vivek Rudrapatna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Publicly available geocoded social determinants of health and mobility datasets used in the analysis of "Chronic Acid Suppression and Social Determinants of COVID-19 Infection".These datasets are required for the analytical workflow shared on Github which demonstrates how the analysis in the manuscript was done using randomly generated samples to protect patient privacy.zcta_county_rel_10.txt - Population and housing density from the 2010 decennial census. Obtained from: https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txtcre-2018-a11.csv - Community Resilience Estimates which is is the capacity of individuals and households to absorb, endure, and recover from the health, social, and economic impacts of a disaster such as a hurricane or pandemic. Data obtained from: https://www.census.gov/data/experimental-data-products/community-resilience-estimates.htmlzcta_tract_rel_10.txt - Relationship between ZCTA and US Census tracts (used to map census tracts to ZCTA). Data obtained from: https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#par_textimage_674173622mask-use-by-county.txt - Mask Use By County comes from a large number of interviews conducted online by the global data and survey firm Dynata at the request of The New York Times. The firm asked a question about mask use to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. Data obtained from: https://github.com/nytimes/covid-19-data/tree/master/mask-usemobility_report_US.txt - Google mobility report which charts movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. Data obtained from: https://github.com/ActiveConclusion/COVID19_mobility/blob/master/google_reports/mobility_report_US.csvACS2015_zctaallvars.csv - Social Deprivation Index is a composite measure of area level deprivation based on seven demographic characteristics collected in the American Community Survey (https://www.census.gov/programs-surveys/acs/) and used to quantify the socio-economic variation in health outcomes. Factors are: Income, Education, Employment, Housing, Household Characteristics, Transportation, Demographics. Data obtained from: https://www.graham-center.org/rgc/maps-data-tools/sdi/social-deprivation-index.html

  9. NY Times Vector Corpus

    • kaggle.com
    zip
    Updated Nov 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TheItCrow (2023). NY Times Vector Corpus [Dataset]. https://www.kaggle.com/datasets/kevinbnisch/ny-times-vectordb-for-topic-extraction
    Explore at:
    zip(3676887181 bytes)Available download formats
    Dataset updated
    Nov 7, 2023
    Authors
    TheItCrow
    Description

    About

    This is the first version of the English dataset for VecTop which contains >250k (2018-10-01 -> 2023-10-23) articles from NY Times which have been embedded with OpenAI's text-embedding-ada-002. This corpus is being used within VecTop to extract the topics and subtopics of a given text. Please refer to the GitHub page for more information and refer to the live demo here for quick evaluation.

    This dataset is also supplied via a postgreSQL backup. It is advisable to import the dataset into a proper database with Vector functionalities for instance results. See the GitHub Repo for that.

    German Version

    A German version with Spiegel Online has already been released here.

    Use Cases

    Topic Extraction

    Given a small or large chunk of text, it is useful to categorize the text into topics. VecTop uses this dataset within a PostgreSQL database to first summarize the unlabeled text (if determined to be too long) and then create word embeddings of it. These word embeddings are then compared to the dataset, and by doing so, VecTop determines the topics and subtopics by looking at the topics and subtopics of the closest embeddings regarding the cosine similarity. As the result, the text is being categorized into topics and subtopics.

    Searching

    The dataset can be used to search for similarities in texts.

    Legal Research

    Legal VecTop will be used to research legal activities. For that, a legal corpus is being built. (Coming soon)

    License

    VecTop and therefore this dataset is being licensed under the Apache-2.0 license

  10. RSS News Feed Collection: Sky News & NYT

    • kaggle.com
    zip
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bkcoban (2023). RSS News Feed Collection: Sky News & NYT [Dataset]. https://www.kaggle.com/datasets/bkcoban/rss-news-feed-collection-sky-news-and-nyt
    Explore at:
    zip(46622 bytes)Available download formats
    Dataset updated
    Dec 21, 2023
    Authors
    bkcoban
    Description

    This dataset comprises current news headlines, links, descriptions, publication dates and categories collected from the RSS feeds of Sky News and The New York Times, spanning a wide range of categories.

    It includes content from Home, UK, World, US, Business, Politics, Technology, Entertainment, Odd News, Sports, Science, Health, Arts, Job Listings, Most Viewed, Sunday Review, and Television.

    This dataset is a resource for news analysis, tracking content trends, media research, and projects in artificial intelligence and natural language processing. Each entry contains the headline, URL, brief description, publication date, and category of the related news item.

  11. News Popularity

    • kaggle.com
    zip
    Updated Nov 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dillon Wong (2022). News Popularity [Dataset]. https://www.kaggle.com/datasets/dilwong/newspopularity/code
    Explore at:
    zip(10524129 bytes)Available download formats
    Dataset updated
    Nov 27, 2022
    Authors
    Dillon Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a CSV file containing Twitter retweets, replies, likes counts for 9100 New York Times (NYT) articles from January 2022 to mid-April 2022. Twitter retweet, replies, likes counts can be used as a measure for how popular an individual article is.

    Column definitions: - id: Twitter ID. The tweet from which the retweet, replies, likes numbers were obtained is at https://twitter.com/nytimes/status/{id} - retweet_count: Number of retweets - reply_count: Number of replies - like_count: Number of likes - url: A URL to the NYT article - date: Timestamp for the tweet - bag_of_phrases: A list of the words/phrases that appear in the NYT article. The text of each article is stored in the CSV file as a bag of lemmatized words, but since some words tend to occur together, those words are instead stored as phrases in which the constitute words are separated by underscores (e.g. "european_union").

    Credit for the photograph here: https://unsplash.com/photos/WYd_PkCa1BY Code for the data scraping here: https://github.com/dilwong/NewsPopularity/blob/master/0%20Data%20Scraping.ipynb

  12. U.S. States COVID-19 Cases & Deaths (NYT)

    • kaggle.com
    zip
    Updated Aug 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Venkatesh Vaishnav (2025). U.S. States COVID-19 Cases & Deaths (NYT) [Dataset]. https://www.kaggle.com/datasets/godeyeisnub/u-s-states-covid-19-cases-and-deaths-nyt
    Explore at:
    zip(610420 bytes)Available download formats
    Dataset updated
    Aug 18, 2025
    Authors
    Venkatesh Vaishnav
    Area covered
    United States
    Description

    Dataset

    This dataset was created by Venkatesh Vaishnav

    Released under Other (specified in description)

    Contents

  13. Covid-19 and Hospitals US County Time Series

    • kaggle.com
    zip
    Updated Oct 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Zahn (2025). Covid-19 and Hospitals US County Time Series [Dataset]. https://www.kaggle.com/datasets/jmzahn/covid19-and-hospitals-us-county-time-series
    Explore at:
    zip(6221564 bytes)Available download formats
    Dataset updated
    Oct 23, 2025
    Authors
    Jacob Zahn
    Area covered
    United States
    Description

    Context

    This data was collected and created for a project in a data science course I took in college in the Spring of 2020. I have updated the data to include more dates into the summer and decided to share it and the code so others can explore it.

    Content

    Data

    Hospitals.csv

    Available here: https://hifld-geoplatform.opendata.arcgis.com/datasets/hospitals

    Information on hospitals in the United States.

    us-counties.csv

    Available here: https://github.com/nytimes/covid-19-data

    Daily covid cases and death data for us counties.

    co-est2019-alldata.csv

    Available here: https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/

    Data sheet available here: https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/co-est2019-alldata.pdf

    2019 county level census estimates.

    daily.csv

    Available here: https://covidtracking.com/api/v1/states/daily.csv

    Daily state level covid testing data.

    Uploaded with Git LFS

    CountyHospitalCombined.csv, CovCountyHospitalTimeSeries.csv, and StateTestingTimeSeries.csv

    Intereim data views created by me to hold cleaned data and used to create the final datset.

    MasterTimeSeries.csv

    Final combined dataset, a days X 3142(num of us counties+dc) long time series with variables stored as a proportion of population.

    Uploaded with Git LFS

    Code

    The python scripts have comments to explain which datasets they're responsible for generating.

    Feel free to use and edit them to tailor the datasets generated to your liking.

    There is also a helper function library in the main directory.

    Scripts can be ran by calling >python

  14. NYT Full Text Articles 1985-2018

    • kaggle.com
    zip
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SKparey (2025). NYT Full Text Articles 1985-2018 [Dataset]. https://www.kaggle.com/datasets/skparey/nyt-full-text-articles-195-2018
    Explore at:
    zip(235705007 bytes)Available download formats
    Dataset updated
    Jan 24, 2025
    Authors
    SKparey
    Description

    A collection of Job Market, Business, World, Business Day, and Technology Articles from New York Times. The dataset includes the full text articles in CSV format and JSON Format.

  15. New York Times Articles & Comments (2020)

    • kaggle.com
    zip
    Updated Jul 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Dornel (2021). New York Times Articles & Comments (2020) [Dataset]. https://www.kaggle.com/benjaminawd/new-york-times-articles-comments-2020
    Explore at:
    zip(2091289314 bytes)Available download formats
    Dataset updated
    Jul 20, 2021
    Authors
    Benjamin Dornel
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    The New York Times is one of the most popular online news platforms in the world. What sets the Times apart from other publications is the ability to engage and connect with its readers. Readers who visit the site can provide their thoughts and reactions to published content in the form of comments, and have been doing so increasingly over the last few years.

    Content

    This dataset contains all comments and articles from January 1, 2020 - December 31, 2020. The articles .csv file contains 16K+ articles with 11 features, and the comments .csv file contains nearly 5M comments with 23 features.

    Inspiration

    There's a ton of things you can do with this dataset, including: 1. Predict the number of comments that an article will receive -- you can use n_comments as a target variable or convert it to a binary classification variable. You can use this the train / test .csv files for this task. 2. Predict how many recommendations a comment will receive using recommendations as a target variable. 3. Predict whether a comment will be selected as a Times Pick using editorsSelection as a target variable. 4. Identify the most popular topics based on article headlines -- you could try using something like KMeans clustering or Latent Dirichlet Allocation (LDA) clustering. 5. Generate news headlines using a Long Short-Term Memory (LSTM) neural network.

    Acknowledgements

    This data was accessed through the New York Times API with nytimes-scraper. A detailed look at the data cleaning process can be found here. I'd like to acknowledge two invaluable sources of inspiration -- Aashita Kersawani's 2018 dataset, and The Analytics Edge 2015 competition.

  16. New York Times Comments

    • kaggle.com
    zip
    Updated May 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aashita Kesarwani (2018). New York Times Comments [Dataset]. https://www.kaggle.com/aashita/nyt-comments
    Explore at:
    zip(502973613 bytes)Available download formats
    Dataset updated
    May 2, 2018
    Authors
    Aashita Kesarwani
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    New York Times has a wide audience and plays a prominent role in shaping people's opinion and outlook on current affairs and also in setting the tone of the public discourse, especially in the USA. The comment section in the articles is very active and it gives a glimpse of readers' take on the matters concerning the articles.

    Content

    The data contains information about the comments made on the articles published in New York Times in Jan-May 2017 and Jan-April 2018. The month-wise data is given in two csv files - one each for the articles on which comments were made and for the comments themselves. The csv files for comments contain over 2 million comments in total with 34 features and those for articles contain 16 features about more than 9,000 articles.

    Inspiration

    The data set is rich in information containing comments' texts, that are largely very well written, along with contextual information such as section/topic of the article, as well as features indicating how well the comment was received by the readers such as editorsSelection and recommendations. This data can serve the purpose of understanding and analyzing the public mood.
    The exploratory kernel here can be used for a review of the features of the dataset and the NB-Logistic model kernel for predicting NYT's pick can be used as a starter for building models on a range of ideas, some of which are:

    1. Predicting the number of upvotes a comment will receive using the feature recommendations as the target variable. With enough training set for the model, we can make a guess of how a hypothetical comment on a certain topic will be received by the community of NYT readers' and this can be considered a tool to gauge public opinion. The design of this model will be very similar to the ones used in ranking the reviews based on guessing how many upvotes the reviews will receive.
    2. Predicting whether a comment will be editor's pick using feature editorsSelection as the target variable. It gives a clue to what NYT considers worth promoting.
    3. Based on a comment, guessing the topic (using sectionName and/or newDesk as the target variable) of the article.
    4. Predicting how likely it is for a comment to get replies (using replyCount feature as the target variable).
    5. Predicting how likely it is for an article to initiate discussion and get comments and upvotes as well as sentiment analysis of the comments' text.
    6. Predicting the same as above for topics (indicated by the features sectionName and/or newDesk).
    7. Analyzing behaviors of the top commenters such as which topics they most likely comment and the sentiment analysis of the comments.

    Data collection

    The python package here written to supplant this dataset can be used to retrieve comments from a customized search of the NYT articles concerning a specific topic, for example - Iraq war or ObamaCare - in a given timeline. The tutorial here gives detailed information about the use of the package with the help of examples.

    Acknowledgements

    • The data was collected with the help of New York Times API to retrieve URL of the articles.
    • The URL used to retrieve comments from a given article in the code in the package written to retrieve the data is taken from the blog by Neal Caren.
  17. New York Times Crossword Clues & Answers 1993-2021

    • kaggle.com
    zip
    Updated Nov 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hugequiz.com (2021). New York Times Crossword Clues & Answers 1993-2021 [Dataset]. https://www.kaggle.com/datasets/darinhawley/new-york-times-crossword-clues-answers-19932021
    Explore at:
    zip(12086306 bytes)Available download formats
    Dataset updated
    Nov 4, 2021
    Authors
    hugequiz.com
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    These are all clues and answers for words in the New York Times crossword from 11/21/93 thru 10/31/21.

    This data was compiled to create these two quizzes: https://hugequiz.com/quizzes/most-common-new-york-times-crossword-answers/ https://hugequiz.com/quizzes/most-common-new-york-times-crossword-answers-by-letter/

  18. News Articles

    • kaggle.com
    zip
    Updated May 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    harishaaram (2018). News Articles [Dataset]. https://www.kaggle.com/harishcscode/all-news-articles-from-home-page-media-house
    Explore at:
    zip(327948548 bytes)Available download formats
    Dataset updated
    May 6, 2018
    Authors
    harishaaram
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The Data is collected from various media houses home page to see which News media shares/writes articles with less gory words.

    Content

    Datasource is obtained from these websites which are downloaded from a time period of Oct 2017 to Nov 2017:

    1. "http://www.nytimes.com/"
    2. "http://www.foxnews.com/"
    3. "http://www.reuters.com/"
    4. "http://www.cnn.com/"
    5. "http://www.huffingtonpost.com/"
    

    Each folder is named in the mmddyyyy convention. And Each CSV file has the media house name as the file name(eg: reuters.csv). The CSV has the following columns:

    • TITLE: the Title of the article.
    • SUMMARY: first few lines of the article's text.
    • TEXT: Full text inside the article
    • URL: web link to the article.
    • KEYWORDS: important words in the article.

    Acknowledgements

    This dataset is under CC0: public domain license.

    Inspiration

    All around the world both good and bad happens, and we get to know only those that are exposed to us. And, that’s the primary responsibility of the media. But the bigger responsibility of these media houses is the way in which they express the content to the people.

    A responsible media house’s content should be original, unbiased, free of exaggeration and should be very sensitive in handling the emotions of it’s readers and viewers. A same story could be told in different ways and these different ways could definitely trigger different emotions among it’s readers.

    It is known that we become who we are by what we say and what we read. Reading a story that’s filled with positive words would make us feel more positive and vice versa. So the wordings of a content definitely plays an equal role as that of the content itself.

    This dataset stands as sample to find out which media house conveys the NEWS in more optimistic way!!!

  19. NYT Bestsellers 1931-2024 (Fiction/non-Fiction)

    • kaggle.com
    zip
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bryant Reese (2024). NYT Bestsellers 1931-2024 (Fiction/non-Fiction) [Dataset]. https://www.kaggle.com/datasets/bryantreese/nyt-bestsellers-1931-2024-fictionnon-fiction/versions/1
    Explore at:
    zip(5604008 bytes)Available download formats
    Dataset updated
    Nov 26, 2024
    Authors
    Bryant Reese
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    NYT Bestseller Books from 1931-2024

    Executive Summary

    This project creates and provides in tabular form a complete list of books from the New York Times Bestseller Lists (1931–2024) in the Fiction and Non-fiction categories. The motivation for this project was curiousity. As a reader I wanted to see historic bestseller trends and also identify how many of the bestseller books I have read and which authors consistently appear on the NYT list. I also think this project provides a good opportunity for literary research in the future. I was surprised to find that only lists specific to one genre (fiction) or within a limited time frame had been created and publicly available. This dataset is unique because it provides both fiction and non-fiction data, as well as some book descriptions, across the entire current (November 2024) history of the NYT bestseller list.

    The scraping and analysis was conducted using Python scripts to extract, clean and process data from PDFs available on Hawes.com from Hawes Publications

    Description of Data

    • Fiction Data (fiction_all.csv):

      • Contains bestseller data for books in the Fiction category.
      • Columns: Date, Rank, Title, Author, Publisher, Description, and Genre.
    • Non-fiction Data (non_fiction_all.csv):

      • Contains bestseller data for books in the Non-fiction category.
      • Columns: Same as Fiction data.
    • Merged Data (merged_genres.csv):

      • Combines Fiction and Non-fiction datasets with a Genre column to identify the category of each book.
    • Author Appearance Data (author_appearances.csv):

      • Groups data by authors, with a count of the number of times each author appeared on the list and their dominant genre.
    • Book Appearance Data (book_appearances.csv):

      • Groups data by book titles, with a count of the number of times each book appeared on the list and its associated genre.

    Power Analysis Results

    A power analysis was not applicable for my project because it does not involve hypothesis testing or sampling methodologies requiring statistical power computations.

    Exploratory Data Analysis

    Top 20 Authors by Number of List Appearances

    • A bar chart visualization highlights the most frequently appearing authors.
    • Key Insight: Fiction authors dominate this category, with 18 out of 20 authors classified as Fiction. Danielle Steel and Stephen King were the two most frequent to appear. I would hyptohesize fiction authors appeared the most because they often create series with strong fanbase retention that would push multiple books to be bestsellers over multiple weeks.

    Top 20 Books by Number of List Appearances

    • A bar chart visualization displays the most frequently appearing books.
    • Key Insight: Non-fiction books are more frequent in this category, with 13 out of 20 entries being Non-fiction. Books turned movies like Tuesdays with Morrie and Unbroken are at the top of list longevity. It is certainly interesting that more non-fiction books are able to last on the list for more weeks considering fiction author's dominance over total weeks on the bestseller lists.

    The bar chart visualizations are saved as interactive HTML files (authors.html and books.html) and can be found in this repository.

    Link to Code Repository

    All scripts for data cleaning, preprocessing, and visualization are publicly available in the GitHub repository: https://github.com/breese5/NYTBestseller1931-2024

    Ethics Statement

    This dataset and analysis were developed for educational and exploratory purposes. While efforts were made to ensure the accuracy of the data, there may be inconsistencies introduced during preprocessing or due to the nature of scraping from PDF turned TXT files. Some additional cleaning manual cleaning to clear out some abnormal spacing could better the dataset however this was difficult given the size.

    The dataset: -Reflects historical bestseller lists, so is not representative of the "best" books necesarily btu is a matter of opinion.

    Open Source License

    This dataset and all associated scripts are released under the MIT License, allowing for open use, modification, and sharing with proper attribution.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Redivis Demo Organization (2021). NYT COVID US Cases & Deaths [Dataset]. https://redivis.com/datasets/28ec-fsftysdhj
Organization logo

NYT COVID US Cases & Deaths

Explore at:
avro, csv, stata, application/jsonl, spss, sas, parquet, arrowAvailable download formats
Dataset updated
Feb 19, 2021
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Time period covered
Jan 21, 2020 - Feb 16, 2021
Area covered
United States
Description

Abstract

The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time.

Documentation

https://github.com/nytimes/covid-19-data

U.S. National-Level Data

The daily number of cases and deaths nationwide, including states, U.S. territories and the District of Columbia, can be found in the covid_us table. (Raw CSV file here.)

%3E date,cases,deaths 2020-01-21,1,0 ...

State-Level Data

State-level data can be found in the covid_us_states table. (Raw CSV file here.)

%3E date,state,fips,cases,deaths 2020-01-21,Washington,53,1,0 ...

County-Level Data

County-level data can be found in the covid_us_counties table. (Raw CSV file here.)

%3E date,county,state,fips,cases,deaths 2020-01-21,Snohomish,Washington,53061,1,0 ...

In some cases, the geographies where cases are reported do not map to standard county boundaries. See the list of geographic exceptions for more detail on these.

Search
Clear search
Close search
Google apps
Main menu