Facebook
TwitterThe New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time.
https://github.com/nytimes/covid-19-data
U.S. National-Level Data
The daily number of cases and deaths nationwide, including states, U.S. territories and the District of Columbia, can be found in the covid_us table. (Raw CSV file here.)
%3E date,cases,deaths 2020-01-21,1,0 ...
State-Level Data
State-level data can be found in the covid_us_states table. (Raw CSV file here.)
%3E date,state,fips,cases,deaths 2020-01-21,Washington,53,1,0 ...
County-Level Data
County-level data can be found in the covid_us_counties table. (Raw CSV file here.)
%3E date,county,state,fips,cases,deaths 2020-01-21,Snohomish,Washington,53061,1,0 ...
In some cases, the geographies where cases are reported do not map to standard county boundaries. See the list of geographic exceptions for more detail on these.
Facebook
Twitterhttps://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
Facebook
TwitterFrom the New York Times GITHUB source: CSV US counties "The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository. United States Data
Data on cumulative coronavirus cases and deaths can be found in two files for states and counties.
Each row of data reports cumulative counts based on our best reporting up to the moment we publish an update. We do our best to revise earlier entries in the data when we receive new information."
The specific data here, is the data PER US COUNTY.
The CSV link for counties is: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
Facebook
TwitterNEW: We are publishing the data behind our excess deaths tracker in order to provide researchers and the public with a better record of the true toll of the pandemic. This data is compiled from official national and municipal data for 24 countries. See the data and documentation in the excess-deaths/ directory.
[ U.S. Data (Raw CSV) | U.S. State-Level Data (Raw CSV) | U.S. County-Level Data (Raw CSV) ]
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
We are providing two sets of data with cumulative counts of coronavirus cases and deaths: one with our most current numbers for each geography and another with historical data showing the tally for each day for each geography.
The historical data files are at the top level of the directory and contain data up to, but not including the current day. The live data files are in the live/ directory.
A key difference between the historical and live files is that the numbers in the historical files are the final counts at the end of each day, while the live files have figures that may be a partial count released during the day but cannot necessarily be considered the final, end-of-day tally..
The historical and live data are released in three files, one for each of these geographic levels: U.S., states and counties.
Each row of data reports the cumulative number of coronavirus cases and deaths based on our best reporting up to the moment we publish an update. Our counts include both laboratory confirmed and probable cases using criteria that were developed by states and the federal government. Not all geographies are reporting probable cases and yet others are providing confirmed and probable as a single total. Please read here for a full discussion of this issue.
We do our best to revise earlier entries in the data when we receive new information. If a county is not listed for a date, then there were zero reported confirmed cases and deaths.
State and county files contain FIPS codes, a standard geographic identifier, to make it easier for an analyst to combine this data with other data sets like a map file or population data.
Download all the data or clone this repository by clicking the green "Clone or download" button above.
The daily number of cases and deaths nationwide, including states, U.S. territories and the District of Columbia, can be found in the us.csv file. (Raw CSV file here.)
date,cases,deaths
2020-01-21,1,0
...
State-level data can be found in the states.csv file. (Raw CSV file here.)
date,state,fips,cases,deaths
2020-01-21,Washington,53,1,0
...
County-level data can be found in the counties.csv file. (Raw CSV file here.)
date,county,state,fips,c...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
As one of the most renowned online news platforms globally, The New York Times stands out for its exceptional ability to engage and connect with its readers. What sets this publication apart from others is its unique capacity to foster meaningful interactions with its audience. This dataset offers a wealth of information, presenting a valuable opportunity to analyze and gain insights from the extensive collection of news articles available through The New York Times. Explore the data and unlock the potential for in-depth analysis and understanding of news trends and patterns.
This dataset contains a comprehensive collection of articles from The New York Times, spanning from January 1, 2000, to the present day. The dataset, titled "**The New York Times Articles Metadata**," includes over 2.1 million articles, capturing a vast range of topics and stories. It is important to note that this dataset is updated daily, ensuring that the latest articles from The New York Times are included, providing an up-to-date and evolving resource for analysis. If you want to know how I update the dataset daily. You can refer to my Scraping New York Times Articles (Daily Updated) this notebook for the code template.
The dataset includes key features: 1. Abstract: A brief summary of the article's content. 2. Web URL: The article's web address. 3. Headline: The title or heading of the article. 4. Keywords: Tags associated with the article, providing insights into its content. 5. Pub Date: The publication date of the article. 6. News Desk: The department responsible for the article. 7. Section Name: The section or category of the article. 8. Byline: The author or authors of the article. 9. Word Count: The number of words in the article.
And many more features...
This dataset opens up various possibilities for analysis and exploration, such as:
These are just a few examples to inspire you. Enjoy exploring the rich dataset and discovering valuable insights from The New York Times articles!
Facebook
TwitterCollected COVID-19 datasets from various sources as part of DAAN-888 course, Penn State, Spring 2022. Collaborators: Mohamed Abdelgayed, Heather Beckwith, Mayank Sharma, Suradech Kongkiatpaiboon, and Alex Stroud
**1 - COVID-19 Data in the United States ** Source: The data is collected from multiple public health official sources by NY Times journalists and compiled in one single file. Description: Daily count of new COVID-19 cases and deaths for each state. Data is updated daily and runs from 1/21/2020 to 2/4/2022. URL: https://github.com/nytimes/covid-19-data/blob/master/us-states.csv Data size: 38,814 row and 5 columns.
**2 - Mask-Wearing Survey Data ** Source: The New York Times is releasing estimates of mask usage by county in the United States. Description: This data comes from a large number of interviews conducted online by the global data and survey firm Dynata, at the request of The New York Times. The firm asked a question about mask usage to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. URL: https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv Data size: 3,142 rows and 6 columns
**3a - Vaccine Data – Global **
Source: This data comes from the US Centers for Disease Control and Prevention (CDC), Our World in Data (OWiD) and the World Health Organization (WHO).
Description: Time series data of vaccine doses administered and the number of fully and partially vaccinated people by country. This data was last updated on February 3, 2022
URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/global_data/time_series_covid19_vaccine_global.csv
Data Size: 162,521 rows and 8 columns
**3b -Vaccine Data – United States **
Source: The data is comprised of individual State's public dashboards and data from the US Centers for Disease Control and Prevention (CDC).
Description: Time series data of the total vaccine doses shipped and administered by manufacturer, the dose number (first or second) by state. This data was last updated on February 3, 2022.
URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/us_data/time_series/vaccine_data_us_timeline.csv
Data Size: 141,503 rows and 13 columns
**4 - Testing Data **
Source: The data is comprised of individual State's public dashboards and data from the U.S. Department of Health & Human Services.
Description: Time series data of total tests administered by county and state. This data was last updated on January 25, 2022.
URL: https://github.com/govex/COVID-19/blob/master/data_tables/testing_data/county_time_series_covid19_US.csv
Data size: 322,154 rows and 8 columns
**5 – US State and Territorial Public Mask Mandates ** Source: Data from state and territory executive orders, administrative orders, resolutions, and proclamations is gathered from government websites and cataloged and coded by one coder using Microsoft Excel, with quality checking provided by one or more other coders. Description: US State and Territorial Public Mask Mandates from April 10, 2020 through August 15, 2021 by County by Day URL: https://data.cdc.gov/Policy-Surveillance/U-S-State-and-Territorial-Public-Mask-Mandates-Fro/62d6-pm5i Data Size: 1,593,869 rows and 10 columns
**6 – Case Counts & Transmission Level **
Source: This open-source dataset contains seven data items that describe community transmission levels across all counties. This dataset provides the same numbers used to show transmission maps on the COVID Data Tracker and contains reported daily transmission levels at the county level. The dataset is updated every day to include the most current day's data. The calculating procedures below are used to adjust the transmission level to low, moderate, considerable, or high.
Description: US State and County case counts and transmission level from 16-Aug-2021 to 03-Feb-2022
URL: https://data.cdc.gov/Public-Health-Surveillance/United-States-COVID-19-County-Level-of-Community-T/8396-v7yb
Data Size: 550,702 rows and 7 columns
**7 - World Cases & Vaccination Counts **
Source: This is an open-source dataset collected and maintained by Our World in Data. OWID provides research and data to help against the world’s largest problems.
Description: This dataset includes vaccinations, tests & positivity, hospital & ICU, confirmed cases, confirmed deaths, reproduction rate, policy responses and other variables of interest.
URL: https://github.com/owid/covid-19-data/tree/master/public/data
Data Size: 67 columns and 157,000 rows
**8 - COVID-19 Data in the European Union **
Source: This is an open-source dataset collected and maintained by ECDC. It is an EU agency aimed at strengthening Europe's defenses against infectious diseases.
Description: This dataset co...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OpenAI's text-embedding-ada-002 embeddings aren't the best out there, but they're certainly easy and cheap to obtain. Here are such vectors in a CSV file for 9100 New York Times (NYT) articles from January 2022 to mid-April 2022.
Column definitions:
- id: An identifier for the article. If you go to https://twitter.com/nytimes/status/{id}, you will find a Twitter tweet that references the NYT article. Like, retweet, and reply statistics for each associated tweet can be found at https://www.kaggle.com/datasets/dilwong/newspopularity
- title: The title of the news article
- full_url: A URL to the NYT article
- comments: If comments are enabled for the article, the number of comments
- has_video: Does the article have a video?
- has_audio: Does the article have audio?
- n_tokens: Number of cl100k_base tokens in the article
- embedding: A 1536-dimensional list of floats that provides a semantic representation of the article
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Publicly available geocoded social determinants of health and mobility datasets used in the analysis of "Chronic Acid Suppression and Social Determinants of COVID-19 Infection".These datasets are required for the analytical workflow shared on Github which demonstrates how the analysis in the manuscript was done using randomly generated samples to protect patient privacy.zcta_county_rel_10.txt - Population and housing density from the 2010 decennial census. Obtained from: https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txtcre-2018-a11.csv - Community Resilience Estimates which is is the capacity of individuals and households to absorb, endure, and recover from the health, social, and economic impacts of a disaster such as a hurricane or pandemic. Data obtained from: https://www.census.gov/data/experimental-data-products/community-resilience-estimates.htmlzcta_tract_rel_10.txt - Relationship between ZCTA and US Census tracts (used to map census tracts to ZCTA). Data obtained from: https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#par_textimage_674173622mask-use-by-county.txt - Mask Use By County comes from a large number of interviews conducted online by the global data and survey firm Dynata at the request of The New York Times. The firm asked a question about mask use to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. Data obtained from: https://github.com/nytimes/covid-19-data/tree/master/mask-usemobility_report_US.txt - Google mobility report which charts movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. Data obtained from: https://github.com/ActiveConclusion/COVID19_mobility/blob/master/google_reports/mobility_report_US.csvACS2015_zctaallvars.csv - Social Deprivation Index is a composite measure of area level deprivation based on seven demographic characteristics collected in the American Community Survey (https://www.census.gov/programs-surveys/acs/) and used to quantify the socio-economic variation in health outcomes. Factors are: Income, Education, Employment, Housing, Household Characteristics, Transportation, Demographics. Data obtained from: https://www.graham-center.org/rgc/maps-data-tools/sdi/social-deprivation-index.html
Facebook
TwitterThis is the first version of the English dataset for VecTop which contains >250k (2018-10-01 -> 2023-10-23) articles from NY Times which have been embedded with OpenAI's text-embedding-ada-002. This corpus is being used within VecTop to extract the topics and subtopics of a given text. Please refer to the GitHub page for more information and refer to the live demo here for quick evaluation.
This dataset is also supplied via a postgreSQL backup. It is advisable to import the dataset into a proper database with Vector functionalities for instance results. See the GitHub Repo for that.
A German version with Spiegel Online has already been released here.
Given a small or large chunk of text, it is useful to categorize the text into topics. VecTop uses this dataset within a PostgreSQL database to first summarize the unlabeled text (if determined to be too long) and then create word embeddings of it. These word embeddings are then compared to the dataset, and by doing so, VecTop determines the topics and subtopics by looking at the topics and subtopics of the closest embeddings regarding the cosine similarity. As the result, the text is being categorized into topics and subtopics.
The dataset can be used to search for similarities in texts.
Legal VecTop will be used to research legal activities. For that, a legal corpus is being built. (Coming soon)
VecTop and therefore this dataset is being licensed under the Apache-2.0 license
Facebook
TwitterThis dataset comprises current news headlines, links, descriptions, publication dates and categories collected from the RSS feeds of Sky News and The New York Times, spanning a wide range of categories.
It includes content from Home, UK, World, US, Business, Politics, Technology, Entertainment, Odd News, Sports, Science, Health, Arts, Job Listings, Most Viewed, Sunday Review, and Television.
This dataset is a resource for news analysis, tracking content trends, media research, and projects in artificial intelligence and natural language processing. Each entry contains the headline, URL, brief description, publication date, and category of the related news item.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a CSV file containing Twitter retweets, replies, likes counts for 9100 New York Times (NYT) articles from January 2022 to mid-April 2022. Twitter retweet, replies, likes counts can be used as a measure for how popular an individual article is.
Column definitions:
- id: Twitter ID. The tweet from which the retweet, replies, likes numbers were obtained is at https://twitter.com/nytimes/status/{id}
- retweet_count: Number of retweets
- reply_count: Number of replies
- like_count: Number of likes
- url: A URL to the NYT article
- date: Timestamp for the tweet
- bag_of_phrases: A list of the words/phrases that appear in the NYT article. The text of each article is stored in the CSV file as a bag of lemmatized words, but since some words tend to occur together, those words are instead stored as phrases in which the constitute words are separated by underscores (e.g. "european_union").
Credit for the photograph here: https://unsplash.com/photos/WYd_PkCa1BY Code for the data scraping here: https://github.com/dilwong/NewsPopularity/blob/master/0%20Data%20Scraping.ipynb
Facebook
TwitterThis dataset was created by Venkatesh Vaishnav
Released under Other (specified in description)
Facebook
TwitterThis data was collected and created for a project in a data science course I took in college in the Spring of 2020. I have updated the data to include more dates into the summer and decided to share it and the code so others can explore it.
Available here: https://hifld-geoplatform.opendata.arcgis.com/datasets/hospitals
Information on hospitals in the United States.
Available here: https://github.com/nytimes/covid-19-data
Daily covid cases and death data for us counties.
Available here: https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/
Data sheet available here: https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/co-est2019-alldata.pdf
2019 county level census estimates.
Available here: https://covidtracking.com/api/v1/states/daily.csv
Daily state level covid testing data.
Uploaded with Git LFS
Intereim data views created by me to hold cleaned data and used to create the final datset.
Final combined dataset, a days X 3142(num of us counties+dc) long time series with variables stored as a proportion of population.
Uploaded with Git LFS
The python scripts have comments to explain which datasets they're responsible for generating.
Feel free to use and edit them to tailor the datasets generated to your liking.
There is also a helper function library in the main directory.
Scripts can be ran by calling >python
Facebook
Twitter
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The New York Times is one of the most popular online news platforms in the world. What sets the Times apart from other publications is the ability to engage and connect with its readers. Readers who visit the site can provide their thoughts and reactions to published content in the form of comments, and have been doing so increasingly over the last few years.
This dataset contains all comments and articles from January 1, 2020 - December 31, 2020. The articles .csv file contains 16K+ articles with 11 features, and the comments .csv file contains nearly 5M comments with 23 features.
There's a ton of things you can do with this dataset, including:
1. Predict the number of comments that an article will receive -- you can use n_comments as a target variable or convert it to a binary classification variable. You can use this the train / test .csv files for this task.
2. Predict how many recommendations a comment will receive using recommendations as a target variable.
3. Predict whether a comment will be selected as a Times Pick using editorsSelection as a target variable.
4. Identify the most popular topics based on article headlines -- you could try using something like KMeans clustering or Latent Dirichlet Allocation (LDA) clustering.
5. Generate news headlines using a Long Short-Term Memory (LSTM) neural network.
This data was accessed through the New York Times API with nytimes-scraper. A detailed look at the data cleaning process can be found here. I'd like to acknowledge two invaluable sources of inspiration -- Aashita Kersawani's 2018 dataset, and The Analytics Edge 2015 competition.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
New York Times has a wide audience and plays a prominent role in shaping people's opinion and outlook on current affairs and also in setting the tone of the public discourse, especially in the USA. The comment section in the articles is very active and it gives a glimpse of readers' take on the matters concerning the articles.
The data contains information about the comments made on the articles published in New York Times in Jan-May 2017 and Jan-April 2018. The month-wise data is given in two csv files - one each for the articles on which comments were made and for the comments themselves. The csv files for comments contain over 2 million comments in total with 34 features and those for articles contain 16 features about more than 9,000 articles.
The data set is rich in information containing comments' texts, that are largely very well written, along with contextual information such as section/topic of the article, as well as features indicating how well the comment was received by the readers such as editorsSelection and recommendations. This data can serve the purpose of understanding and analyzing the public mood.
The exploratory kernel here can be used for a review of the features of the dataset and the NB-Logistic model kernel for predicting NYT's pick can be used as a starter for building models on a range of ideas, some of which are:
recommendations as the target variable. With enough training set for the model, we can make a guess of how a hypothetical comment on a certain topic will be received by the community of NYT readers' and this can be considered a tool to gauge public opinion. The design of this model will be very similar to the ones used in ranking the reviews based on guessing how many upvotes the reviews will receive.editorsSelection as the target variable. It gives a clue to what NYT considers worth promoting.sectionName and/or newDesk as the target variable) of the article.replyCount feature as the target variable).sectionName and/or newDesk).The python package here written to supplant this dataset can be used to retrieve comments from a customized search of the NYT articles concerning a specific topic, for example - Iraq war or ObamaCare - in a given timeline. The tutorial here gives detailed information about the use of the package with the help of examples.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
These are all clues and answers for words in the New York Times crossword from 11/21/93 thru 10/31/21.
This data was compiled to create these two quizzes: https://hugequiz.com/quizzes/most-common-new-york-times-crossword-answers/ https://hugequiz.com/quizzes/most-common-new-york-times-crossword-answers-by-letter/
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Data is collected from various media houses home page to see which News media shares/writes articles with less gory words.
Datasource is obtained from these websites which are downloaded from a time period of Oct 2017 to Nov 2017:
1. "http://www.nytimes.com/"
2. "http://www.foxnews.com/"
3. "http://www.reuters.com/"
4. "http://www.cnn.com/"
5. "http://www.huffingtonpost.com/"
Each folder is named in the mmddyyyy convention. And Each CSV file has the media house name as the file name(eg: reuters.csv). The CSV has the following columns:
TITLE: the Title of the article.SUMMARY: first few lines of the article's text.TEXT: Full text inside the articleURL: web link to the article.KEYWORDS: important words in the article.This dataset is under CC0: public domain license.
All around the world both good and bad happens, and we get to know only those that are exposed to us. And, that’s the primary responsibility of the media. But the bigger responsibility of these media houses is the way in which they express the content to the people.
A responsible media house’s content should be original, unbiased, free of exaggeration and should be very sensitive in handling the emotions of it’s readers and viewers. A same story could be told in different ways and these different ways could definitely trigger different emotions among it’s readers.
It is known that we become who we are by what we say and what we read. Reading a story that’s filled with positive words would make us feel more positive and vice versa. So the wordings of a content definitely plays an equal role as that of the content itself.
This dataset stands as sample to find out which media house conveys the NEWS in more optimistic way!!!
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This project creates and provides in tabular form a complete list of books from the New York Times Bestseller Lists (1931–2024) in the Fiction and Non-fiction categories. The motivation for this project was curiousity. As a reader I wanted to see historic bestseller trends and also identify how many of the bestseller books I have read and which authors consistently appear on the NYT list. I also think this project provides a good opportunity for literary research in the future. I was surprised to find that only lists specific to one genre (fiction) or within a limited time frame had been created and publicly available. This dataset is unique because it provides both fiction and non-fiction data, as well as some book descriptions, across the entire current (November 2024) history of the NYT bestseller list.
The scraping and analysis was conducted using Python scripts to extract, clean and process data from PDFs available on Hawes.com from Hawes Publications
Fiction Data (fiction_all.csv):
Date, Rank, Title, Author, Publisher, Description, and Genre.Non-fiction Data (non_fiction_all.csv):
Merged Data (merged_genres.csv):
Genre column to identify the category of each book.Author Appearance Data (author_appearances.csv):
Book Appearance Data (book_appearances.csv):
A power analysis was not applicable for my project because it does not involve hypothesis testing or sampling methodologies requiring statistical power computations.
The bar chart visualizations are saved as interactive HTML files (authors.html and books.html) and can be found in this repository.
All scripts for data cleaning, preprocessing, and visualization are publicly available in the GitHub repository: https://github.com/breese5/NYTBestseller1931-2024
This dataset and analysis were developed for educational and exploratory purposes. While efforts were made to ensure the accuracy of the data, there may be inconsistencies introduced during preprocessing or due to the nature of scraping from PDF turned TXT files. Some additional cleaning manual cleaning to clear out some abnormal spacing could better the dataset however this was difficult given the size.
The dataset: -Reflects historical bestseller lists, so is not representative of the "best" books necesarily btu is a matter of opinion.
This dataset and all associated scripts are released under the MIT License, allowing for open use, modification, and sharing with proper attribution.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThe New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time.
https://github.com/nytimes/covid-19-data
U.S. National-Level Data
The daily number of cases and deaths nationwide, including states, U.S. territories and the District of Columbia, can be found in the covid_us table. (Raw CSV file here.)
%3E date,cases,deaths 2020-01-21,1,0 ...
State-Level Data
State-level data can be found in the covid_us_states table. (Raw CSV file here.)
%3E date,state,fips,cases,deaths 2020-01-21,Washington,53,1,0 ...
County-Level Data
County-level data can be found in the covid_us_counties table. (Raw CSV file here.)
%3E date,county,state,fips,cases,deaths 2020-01-21,Snohomish,Washington,53061,1,0 ...
In some cases, the geographies where cases are reported do not map to standard county boundaries. See the list of geographic exceptions for more detail on these.