100+ datasets found

g
Coronavirus (Covid-19) Data in the United States
github.com
openicpsr.org
+2more
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
Explore at:
csvAvailable download formats
Dataset provided by
New York Times
License
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
Description
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
c
Manually Labeled Data Set for the Ongoing Event Detection Task (2,200 news...
ri.conicet.gov.ar
datosdeinvestigacion.conicet.gov.ar
+2more
Updated Apr 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maisonnave, Mariano; Delbianco, Fernando Andrés; Tohmé, Fernando Abel; Maguitman, Ana Gabriela (2023). Manually Labeled Data Set for the Ongoing Event Detection Task (2,200 news extracts from the NYT Annotated Corpus with manually labeled ongoing event triggers) [Dataset]. http://doi.org/10.17632/7d54rvzxkr.1
Explore at:
Unique identifier
https://doi.org/10.17632/7d54rvzxkr.1
Dataset updated
Apr 19, 2023
Authors
Maisonnave, Mariano; Delbianco, Fernando Andrés; Tohmé, Fernando Abel; Maguitman, Ana Gabriela
License
Attribution-NonCommercial-ShareAlike 2.5 (CC BY-NC-SA 2.5)https://creativecommons.org/licenses/by-nc-sa/2.5/
License information was derived automatically
Dataset funded by
Universidad Nacional del Sur
Description
The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event. The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier). Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example. Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data. Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format. YYYYMMDDTHHMMSS ... ... ... The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article. The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.
The New York Times US Coronavirus Database
console.cloud.google.com
Updated Jun 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:The%20New%20York%20Times&inv=1&invt=Ab3YfA (2020). The New York Times US Coronavirus Database [Dataset]. https://console.cloud.google.com/marketplace/product/the-new-york-times/covid19_us_cases
Explore at:
Dataset updated
Jun 26, 2020
Dataset provided by
Googlehttp://google.com/
Area covered
United States
Description
This is the US Coronavirus data repository from The New York Times . This data includes COVID-19 cases and deaths reported by state and county. The New York Times compiled this data based on reports from state and local health agencies. More information on the data repository is available here . For additional reporting and data visualizations, see The New York Times’ U.S. coronavirus interactive site . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . This dataset has significant public interest in light of the COVID-19 crisis. All bytes processed in queries against this dataset will be zeroed out, making this part of the query free. Data joined with the dataset will be billed at the normal rate to prevent abuse. After September 15, queries over these datasets will revert to the normal billing rate. Users of The New York Times public-use data files must comply with data use restrictions to ensure that the information will be used solely for noncommercial purposes.
A
The New York Times Coronavirus (Covid-19) Cases and Deaths in the United...
data.amerigeoss.org
csv
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UN Humanitarian Data Exchange (2025). The New York Times Coronavirus (Covid-19) Cases and Deaths in the United States [Dataset]. https://data.amerigeoss.org/es/dataset/nyt-covid-19-data
Explore at:
csvAvailable download formats
Dataset updated
Jun 5, 2025
Dataset provided by
UN Humanitarian Data Exchange
Area covered
United States
Description
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.

United States Data

Data on cumulative coronavirus cases and deaths can be found in two files for states and counties.

Each row of data reports cumulative counts based on our best reporting up to the moment we publish an update. We do our best to revise earlier entries in the data when we receive new information.

Both files contain FIPS codes, a standard geographic identifier, to make it easier for an analyst to combine this data with other data sets like a map file or population data.

State-Level Data

State-level data can be found in the us-states.csv file.

date,state,fips,cases,deaths 2020-01-21,Washington,53,1,0 ...

County-Level Data

County-level data can be found in the us-counties.csv file.

date,county,state,fips,cases,deaths 2020-01-21,Snohomish,Washington,53061,1,0 ...

In some cases, the geographies where cases are reported do not map to standard county boundaries. See the list of geographic exceptions for more detail on these.

Github Repository

This dataset contains COVID-19 data for the United States of America made available by The New York Times on github at https://github.com/nytimes/covid-19-data
w
Dataset of books called The get over : a short story prequel to the New York...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called The get over : a short story prequel to the New York Times bestselling novel Monster [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=The+get+over+%3A+a+short+story+prequel+to+the+New+York+Times+bestselling+novel+Monster
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is The get over : a short story prequel to the New York Times bestselling novel Monster. It features 7 columns including author, publication date, language, and book publisher.
h
Supporting Data For "Journalism, advertising, or something in between? How...
datahub.hku.hk
Updated Sep 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samantha Marie Stanley (2021). Supporting Data For "Journalism, advertising, or something in between? How The School of the New York Times Teaches Journalists About Native Advertising" [Dataset]. http://doi.org/10.25442/hku.16532811.v1
Explore at:
Unique identifier
https://doi.org/10.25442/hku.16532811.v1
Dataset updated
Sep 15, 2021
Dataset provided by
HKU Data Repository
Authors
Samantha Marie Stanley
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset supplements the manuscript, “Journalism, advertising, or something in between? How The School of the New York Times Teaches Journalists About Native Advertising” submitted as part of the PhD requirement in Social Sciences.Part 1 Dataset 1.1: Blank Interview Questionnaire (Raw) - This is the questionnaire and interviewer instructions used in the expert interview. The questionnaire has been anonymized to protect the identity of the interviewee.Dataset 1.2: Interview Transcript (Processed) - This is the full transcript of the expert interview, which was transcribed using Otter.ai and edited by the principle investigator. The transcript has been anonymized to protect the identity of the interviewee.Part 2Dataset 2.1: Blank Field Note Worksheet (Raw) - This worksheet was copied and used for each participant observation session.Dataset 2.2: Field Note Workbook (Processed) - This is the combined set of participant observation field notes generated as part of the study.Part 3Dataset 3.1: Course Video Files (Raw) - This file is a compilation of all 64 videos offered in the course studied. This data is restricted to protect the identity of the course instructor, who appears in each video and in most videos the instructor appears throughout the full length. Dataset 3.2: Course Video Transcripts (Processed)- This is the combined transcript of all 64 videos offered in the course studied, which was transcribed using Otter.ai and edited by the principle investigator. The transcript has been anonymized to protect the identity of the course instructor, who is not a participant in this study.Attachment 1: Human Research Ethics Committee ApplicationAttachment 2: Human Research Ethics Committee Approval LetterAttachment 3: Study Participation Consent FormAttachment 4: Data Management Plan
w
The New York Times Annotated Corpus
data.wu.ac.at
abacus.library.ubc.ca
+1more
Updated Oct 10, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global (2013). The New York Times Annotated Corpus [Dataset]. https://data.wu.ac.at/schema/datahub_io/ZWMzMGMzOTMtZjQwNS00ZDM1LTlkNjktNGYyYzBhMjhlZWM3
Explore at:
Dataset updated
Oct 10, 2013
Dataset provided by
Global
Description
About

From website:

The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:

Over 1.8 million articles (excluding wire services articles that appeared during the covered period).

Over 650,000 article summaries written by library scientists.

Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors.

Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com.

Java tools for parsing corpus documents from .xml into a memory resident object.

As part of the New York Times' indexing procedures, most articles are manually summarized and tagged by a staff of library scientists. This collection contains over 650,000 article-summary pairs which may prove to be useful in the development and evaluation of algorithms for automated document summarization. Also, over 1.5 million documents have at least one tag. Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. For instance if one article mentions "Bill Clinton" and another refers to "President William Jefferson Clinton", both articles will be tagged with "CLINTON, BILL".

The New York Times has established a community website for researchers working on the data set at http://groups.google.com/group/nytnlp and encourages feedback and discussion about the corpus.

Access/re-use

Not open. Available on DVD for $300 from LDC Catalog, which states:

Portions © 1987-2008 New York Times, © 2008 Trustees of the University of Pennsylvania
Collection of NYT's most viewed articles between 2015-05-30T07:13:57.0+1:00...
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Fischer (2023). Collection of NYT's most viewed articles between 2015-05-30T07:13:57.0+1:00 and 2015-06-02T17:44:11.0+1:00 [Dataset]. http://doi.org/10.6084/m9.figshare.1434028.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1434028.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Markus Fischer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data was acquired through the NYT's REST API at http://api.nytimes.com/svc/mostpopular/v2/mostviewed/all-sections/1?api-key={your_api_key}. The intervals for data retrieval were irregular and are stored in the dataset in the column 'INSERT_DATE'. Each row also contains the raw JSON object retrieved by the API 'JSON', a SHA-512 hash of it 'HASH' and several parsed fields from the object. In total, there are 10 retrivals. The dataset can be used to monitor changes in the most viewed articles, query the changing number of hits for a keyword in the title, etc.

Copyright (c) 2015 The New York Times Company. All Rights Reserved.
US counties COVID 19 dataset
kaggle.com
Updated Aug 11, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MyrnaMFL (2020). US counties COVID 19 dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1412810
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/1412810
Dataset updated
Aug 11, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MyrnaMFL
Area covered
United States
Description
From the New York Times GITHUB source: CSV US counties "The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository. United States Data

Data on cumulative coronavirus cases and deaths can be found in two files for states and counties.

Each row of data reports cumulative counts based on our best reporting up to the moment we publish an update. We do our best to revise earlier entries in the data when we receive new information."

The specific data here, is the data PER US COUNTY.

The CSV link for counties is: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
N
New York County, NY annual median income by work experience and sex dataset:...
neilsberg.com
csv, json
Updated Feb 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). New York County, NY annual median income by work experience and sex dataset: Aged 15+, 2010-2023 (in 2023 inflation-adjusted dollars) // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/new-york-county-ny-income-by-gender/
Explore at:
json, csvAvailable download formats
Dataset updated
Feb 27, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Manhattan, New York, New York
Variables measured
Income for Male Population, Income for Female Population, Income for Male Population working full time, Income for Male Population working part time, Income for Female Population working full time, Income for Female Population working part time
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 5-Year Estimates. The dataset covers the years 2010 to 2023, representing 14 years of data. To analyze income differences between genders (male and female), we conducted an initial data analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series (R-CPI-U-RS) based on current methodologies. For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents median income data over a decade or more for males and females categorized by Total, Full-Time Year-Round (FT), and Part-Time (PT) employment in New York County. It showcases annual income, providing insights into gender-specific income distributions and the disparities between full-time and part-time work. The dataset can be utilized to gain insights into gender-based pay disparity trends and explore the variations in income for male and female individuals.

Key observations: Insights from 2023

Based on our analysis ACS 2019-2023 5-Year Estimates, we present the following observations: - All workers, aged 15 years and older: In New York County, the median income for all workers aged 15 years and older, regardless of work hours, was $72,134 for males and $54,928 for females.
These income figures indicate a substantial gender-based pay disparity, showcasing a gap of approximately 24% between the median incomes of males and females in New York County. With women, regardless of work hours, earning 76 cents to each dollar earned by men, this income disparity reveals a concerning trend toward wage inequality that demands attention in thecounty of New York County.
- Full-time workers, aged 15 years and older: In New York County, among full-time, year-round workers aged 15 years and older, males earned a median income of $119,785, while females earned $96,975, leading to a 19% gender pay gap among full-time workers. This illustrates that women earn 81 cents for each dollar earned by men in full-time roles. This analysis indicates a widening gender pay gap, showing a substantial income disparity where women, despite working full-time, face a more significant wage discrepancy compared to men in the same roles.
Remarkably, across all roles, including non-full-time employment, women displayed a similar gender pay gap percentage. This indicates a consistent gender pay gap scenario across various employment types in New York County, showcasing a consistent income pattern irrespective of employment status.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. All incomes have been adjusting for inflation and are presented in 2023-inflation-adjusted dollars.

Gender classifications include:

Male

Female

Employment type classifications include:

Full-time, year-round: A full-time, year-round worker is a person who worked full time (35 or more hours per week) and 50 or more weeks during the previous calendar year.

Part-time: A part-time worker is a person who worked less than 35 hours per week during the previous calendar year.

Variables / Data Columns

Year: This column presents the data year. Expected values are 2010 to 2023

Male Total Income: Annual median income, for males regardless of work hours

Male FT Income: Annual median income, for males working full time, year-round

Male PT Income: Annual median income, for males working part time

Female Total Income: Annual median income, for females regardless of work hours

Female FT Income: Annual median income, for females working full time, year-round

Female PT Income: Annual median income, for females working part time

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for New York County median household income by race. You can refer the same here
N
Stanford, New York annual income distribution by work experience and gender...
neilsberg.com
csv, json
Updated Feb 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Stanford, New York annual income distribution by work experience and gender dataset: Number of individuals ages 15+ with income, 2023 // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/stanford-ny-income-by-gender/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 27, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Stanford, New York
Variables measured
Income for Male Population, Income for Female Population, Income for Male Population working full time, Income for Male Population working part time, Income for Female Population working full time, Income for Female Population working part time, Number of males working full time for a given income bracket, Number of males working part time for a given income bracket, Number of females working full time for a given income bracket, Number of females working part time for a given income bracket
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To portray the number of individuals for both the genders (Male and Female), within each income bracket we conducted an initial analysis and categorization of the American Community Survey data. Households are categorized, and median incomes are reported based on the self-identified gender of the head of the household. For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents the detailed breakdown of the count of individuals within distinct income brackets, categorizing them by gender (men and women) and employment type - full-time (FT) and part-time (PT), offering valuable insights into the diverse income landscapes within Stanford town. The dataset can be utilized to gain insights into gender-based income distribution within the Stanford town population, aiding in data analysis and decision-making..

Key observations

Employment patterns: Within Stanford town, among individuals aged 15 years and older with income, there were 1,600 men and 1,358 women in the workforce. Among them, 762 men were engaged in full-time, year-round employment, while 483 women were in full-time, year-round roles.

Annual income under $24,999: Of the male population working full-time, 2.10% fell within the income range of under $24,999, while 2.28% of the female population working full-time was represented in the same income bracket.

Annual income above $100,000: 48.69% of men in full-time roles earned incomes exceeding $100,000, while 20.91% of women in full-time positions earned within this income bracket.

Refer to the research insights for more key observations on more income brackets ( Annual income under $24,999, Annual income between $25,000 and $49,999, Annual income between $50,000 and $74,999, Annual income between $75,000 and $99,999 and Annual income above $100,000) and employment types (full-time year-round and part-time)

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Income brackets:

$1 to $2,499 or loss

$2,500 to $4,999

$5,000 to $7,499

$7,500 to $9,999

$10,000 to $12,499

$12,500 to $14,999

$15,000 to $17,499

$17,500 to $19,999

$20,000 to $22,499

$22,500 to $24,999

$25,000 to $29,999

$30,000 to $34,999

$35,000 to $39,999

$40,000 to $44,999

$45,000 to $49,999

$50,000 to $54,999

$55,000 to $64,999

$65,000 to $74,999

$75,000 to $99,999

$100,000 or more

Variables / Data Columns

Income Bracket: This column showcases 20 income brackets ranging from $1 to $100,000+..

Full-Time Males: The count of males employed full-time year-round and earning within a specified income bracket

Part-Time Males: The count of males employed part-time and earning within a specified income bracket

Full-Time Females: The count of females employed full-time year-round and earning within a specified income bracket

Part-Time Females: The count of females employed part-time and earning within a specified income bracket

Employment type classifications include:

Full-time, year-round: A full-time, year-round worker is a person who worked full time (35 or more hours per week) and 50 or more weeks during the previous calendar year.

Part-time: A part-time worker is a person who worked less than 35 hours per week during the previous calendar year.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Stanford town median household income by race. You can refer the same here
N
Napoli, New York annual median income by work experience and sex dataset:...
neilsberg.com
csv, json
Updated Feb 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Napoli, New York annual median income by work experience and sex dataset: Aged 15+, 2010-2023 (in 2023 inflation-adjusted dollars) // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/a52ab4ea-f4ce-11ef-8577-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 27, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York, Napoli
Variables measured
Income for Male Population, Income for Female Population, Income for Male Population working full time, Income for Male Population working part time, Income for Female Population working full time, Income for Female Population working part time
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 5-Year Estimates. The dataset covers the years 2010 to 2023, representing 14 years of data. To analyze income differences between genders (male and female), we conducted an initial data analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series (R-CPI-U-RS) based on current methodologies. For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents median income data over a decade or more for males and females categorized by Total, Full-Time Year-Round (FT), and Part-Time (PT) employment in Napoli town. It showcases annual income, providing insights into gender-specific income distributions and the disparities between full-time and part-time work. The dataset can be utilized to gain insights into gender-based pay disparity trends and explore the variations in income for male and female individuals.

Key observations: Insights from 2023

Based on our analysis ACS 2019-2023 5-Year Estimates, we present the following observations: - All workers, aged 15 years and older: In Napoli town, the median income for all workers aged 15 years and older, regardless of work hours, was $42,386 for males and $27,230 for females.
These income figures highlight a substantial gender-based income gap in Napoli town. Women, regardless of work hours, earn 64 cents for each dollar earned by men. This significant gender pay gap, approximately 36%, underscores concerning gender-based income inequality in the town of Napoli town.
- Full-time workers, aged 15 years and older: In Napoli town, among full-time, year-round workers aged 15 years and older, males earned a median income of $53,795, while females earned $63,750
Surprisingly, within the subset of full-time workers, women earn a higher income than men, earning 1.19 dollars for every dollar earned by men. This suggests that within full-time roles, womens median incomes significantly surpass mens, contrary to broader workforce trends.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. All incomes have been adjusting for inflation and are presented in 2023-inflation-adjusted dollars.

Gender classifications include:

Male

Female

Employment type classifications include:

Full-time, year-round: A full-time, year-round worker is a person who worked full time (35 or more hours per week) and 50 or more weeks during the previous calendar year.

Part-time: A part-time worker is a person who worked less than 35 hours per week during the previous calendar year.

Variables / Data Columns

Year: This column presents the data year. Expected values are 2010 to 2023

Male Total Income: Annual median income, for males regardless of work hours

Male FT Income: Annual median income, for males working full time, year-round

Male PT Income: Annual median income, for males working part time

Female Total Income: Annual median income, for females regardless of work hours

Female FT Income: Annual median income, for females working full time, year-round

Female PT Income: Annual median income, for females working part time

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Napoli town median household income by race. You can refer the same here
d
NYC Wi-Fi Hotspot Locations
catalog.data.gov
data.cityofnewyork.us
+5more
Updated Sep 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofnewyork.us (2022). NYC Wi-Fi Hotspot Locations [Dataset]. https://catalog.data.gov/dataset/nyc-wi-fi-hotspot-locations-df7c0
Explore at:
Dataset updated
Sep 30, 2022
Dataset provided by
data.cityofnewyork.us
Area covered
New York
Description
NYC Wi-Fi Hotspot Locations Wi-Fi Providers: CityBridge, LLC (Free Beta): LinkNYC 1 gigabyte (GB), Free Wi-Fi Internet Kiosks Spot On Networks (Free) NYC HOUSING AUTHORITY (NYCHA) Properties Fiberless (Free): Wi-Fi access on Governors Island Free - up to 5 Mbps for users as the part of Governors Island Trust Governors Island Connectivity Challenge AT&T (Free): Wi-Fi access is free for all users at all times. Partners: In several parks, the NYC partner organizations provide publicly accessible Wi-Fi. Visit these parks to learn more information about their Wi-Fi service and how to connect. Cable (Limited-Free): In NYC Parks provided by NYC DoITT Cable television franchisees. ALTICEUSA previously known as “Cablevision” and SPECTRUM previously known as “Time Warner Cable” (Limited Free) Connect for 3 free 10 minute sessions every 30 days or purchase a 99 cent day pass through midnight. Wi-Fi service is free at all times to Cablevision’s Optimum Online and Time Warner Cable broadband subscribers. Wi-Fi Provider: Chelsea Wi-Fi (Free) Wi-Fi access is free for all users at all times. Chelsea Improvement Company has partnered with Google to provide Wi-Fi a free wireless Internet zone, a broadband region bounded by West 19th Street, Gansevoort Street, Eighth Avenue, and the High Line Park. Wi-Fi Provider: Downtown Brooklyn Wi-Fi (Free) The Downtown Brooklyn Partnership - the New York City Economic Development Corporation to provide Wi-Fi to the area bordered by Schermerhorn Street, Cadman Plaza West, Flatbush Avenue, and Tillary Street, along with select public spaces in the NYCHA Ingersoll and Whitman Houses. Wi-Fi Provider: Manhattan Downtown Alliance Wi-Fi (Free) Lower Manhattan Several public spaces all along Water Street, Front Street and the East River Esplanade south of Fulton Street and in several other locations throughout Lower Manhattan. Wi-Fi Provider: Harlem Wi-Fi (Free) The network will extend 95 city blocks, from 110th to 138th Streets between Frederick Douglass Boulevard and Madison Avenue is the free outdoor public wireless network. Wi-Fi Provider: Transit Wireless (Free) Wi-Fi Services in the New York City subway system is available in certain underground stations. For more information visit http://www.transitwireless.com/stations/. Wi-Fi Provider: Public Pay Telephone Franchisees (Free) Using existing payphone infrastructure, the City of New York has teamed up with private partners to provide free Wi-Fi service at public payphone kiosks across the five boroughs at no cost to taxpayers. Wi-Fi Provider: New York Public Library Using Wireless Internet Access (Wi-Fi): All Library locations offer free wireless access (Wi-Fi) in public areas at all times the libraries are open. Connecting to the Library's Wireless Network •You must have a computer or other device equipped with an 802.11b-compatible wireless card. •Using your computer's network utilities, look for the wireless network named "NYPL." •The "NYPL" wireless network does not require a password to connect. Limitations and Disclaimers Regarding Wireless Access •The Library's wireless network is not secure. Information sent from or to your laptop can be captured by anyone else with a wireless device and the appropriate software, within three hundred feet. •Library staff is not able to provide technical assistance and no guarantee can be provided that you will be able to make a wireless connection. •The Library assumes no responsibility for the safety of equipment or for laptop configurations, security, or data files resulting from connection to the Library's network
About COVID-19 Public Datasets
console.cloud.google.com
Updated Jun 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Datasets%20Program&inv=1&invt=Ab2YUw (2022). About COVID-19 Public Datasets [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-public-data-program
Explore at:
Dataset updated
Jun 19, 2022
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Description
In an effort to help combat COVID-19, we created a COVID-19 Public Datasets program to make data more accessible to researchers, data scientists and analysts. The program will host a repository of public datasets that relate to the COVID-19 crisis and make them free to access and analyze. These include datasets from the New York Times, European Centre for Disease Prevention and Control, Google, Global Health Data from the World Bank, and OpenStreetMap. Free hosting and queries of COVID datasets As with all data in the Google Cloud Public Datasets Program , Google pays for storage of datasets in the program. BigQuery also provides free queries over certain COVID-related datasets to support the response to COVID-19. Queries on COVID datasets will not count against the BigQuery sandbox free tier , where you can query up to 1TB free each month. Limitations and duration Queries of COVID data are free. If, during your analysis, you join COVID datasets with non-COVID datasets, the bytes processed in the non-COVID datasets will be counted against the free tier, then charged accordingly, to prevent abuse. Queries of COVID datasets will remain free until Sept 15, 2021. The contents of these datasets are provided to the public strictly for educational and research purposes only. We are not onboarding or managing PHI or PII data as part of the COVID-19 Public Dataset Program. Google has practices & policies in place to ensure that data is handled in accordance with widely recognized patient privacy and data security policies. See the list of all datasets included in the program
Data from: CBS News/New York Times National Surveys, 1983
icpsr.umich.edu
ascii, sas, spss
Updated Jan 18, 2006
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inter-university Consortium for Political and Social Research [distributor] (2006). CBS News/New York Times National Surveys, 1983 [Dataset]. http://doi.org/10.3886/ICPSR08243.v2
Explore at:
sas, spss, asciiAvailable download formats
Unique identifier
https://doi.org/10.3886/ICPSR08243.v2
Dataset updated
Jan 18, 2006
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
License
https://www.icpsr.umich.edu/web/ICPSR/studies/8243/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/8243/terms
Time period covered
Jan 1983 - Oct 1983
Area covered
United States
Description
These seven datasets are part of an ongoing data collection effort in which The New York Times and CBS News are equal partners. Each survey includes questions about President Ronald Reagan's performance in office, especially with respect to economic and foreign affairs. In addition, each survey provides information on respondents' views concerning other social and political issues, as well as respondents' personal backgrounds. The surveys were conducted in January, April, June, September (twice), and October (twice). The October surveys took place before and after President Reagan's speech about Grenada on October 27, 1983. The October samples are weighted separately, and two discrete datasets, which may be analyzed separately or combined, are available (Parts 6 and 7). Topics covered in Part 1, January Survey, include Reagan's handling of economic and foreign affairs, various proposals to reduce the federal deficit, unemployment, and Social Security. In Part 2, April Survey, individuals responded to questions about Reagan's handling of economic and foreign affairs, the environment, and defense policy, and were also asked about their willingness to vote for a Black candidate, candidates endorsed by labor unions, and candidates endorsed by homosexual organizations. Two versions of the questionnaire were used, to test alternative question wording. For Part 3, June Survey, questions were asked on Reagan's presidency, possible presidential candidates in 1984, foreign policy, economic policy, merit pay for public school teachers, federal spending on education, and tennis. Part 4, Plane Survey, queried respondents about the Korean passenger plane shot down by the Soviet Union in September 1983, including their opinions on the American response to the attack. The questionnaire also included questions about Reagan's handling of foreign and economic policy. Part 5, September Survey, covered telephone service, United States troops in Lebanon, possible presidential candidates, and President Reagan's handling of economic and foreign policy. Two versions of the questionnaire were used, to test alternative question wording. A question about the cease-fire agreement in Lebanon was included in only one of those versions. Part 6, October (Prespeech) Survey, was conducted before President Reagan gave his speech on Grenada. Respondents were asked their opinions on having United States troops in Grenada and Lebanon, the attack on the Marine barracks in Lebanon, and Reagan's handling of foreign policy. Part 7, October (Postspeech) Survey, was conducted after President Reagan's speech on Grenada and concerned the same issues that were covered in the Prespeech Survey.
w
gnoss.com, a Linked Open Data knowledge network
data.wu.ac.at
example/rdf+xml +3
Updated Aug 29, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). gnoss.com, a Linked Open Data knowledge network [Dataset]. https://data.wu.ac.at/schema/linkeddatacatalog_dws_informatik_uni-mannheim_de/NTkxYTU5N2QtOGFlOC00ZGZlLWFkMDgtMGUxODllYWU1ODEy
Explore at:
example/rdf+xml, example/rdfa, meta/void, meta/owlAvailable download formats
Dataset updated
Aug 29, 2014
License
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
Description
gnoss.com is a society on the Net: it enables people, companies and any human group or organization to connect, interact and work according to their interests in a linked open knowledge network. The project is in public beta phase since September 2009. It already has 28,000 subscribers and a knowledge network of more than 3,000 communities (November 2012) on topics related to innovation, technology, business and education, among others. gnoss.com works within the philosophy of Open Data and proposes a solution so that data can be linked (Linked Data). Data are expressed in OWL-RDF files.

Currently, more than 100,000 resources have been published in the platform. For each resource, the platform enables to download an RDF document describing its content and metadata. Different languages such as FOAF, SIOC and SKOS are used in this description. Our whole dataset has about 4,500,000 RDF Triples. In addition, most of these resources are linked to one or more resources of the Freebase and New York Time datasets. We have about 20,000 links to these datasets.

GNOSS is a software platform, created by RIAM Intelearning LAB S.L., to build specialized online social networks with dynamic semantic publishing. GNOSS integrates knowledge management, informal learning and collaborative work in a Linked Data environment. Every GNOSS space incorporates semantic facet-based searches and semantic context creation which drastically improves user experience. GNOSS runs on technologies and standards of the semantic web, which makes it possible to structure and link all kinds of content, among them and with other Open Data (Linked Data), and to reinforce and to amplify the knowledge management processes with facet-based searches and generation of documentary and personal contexts for specific information. GNOSS provides people, groups and organizations with the necessary tools to create and develop their digital identity, connect their intelligences, create communities based on their interests and motivations, and activate thriving processes of collective creativity, brainpower, debate and thinking.

For further information: http://noticias.gnoss.com

gnoss.com es una sociedad en la Red: permite a personas, empresas y a cualquier colectivo conectar, interactuar y trabajar de acuerdo con sus intereses en una red abierta de conocimiento. El proyecto está en beta pública desde septiembre de 2009. Gnoss.com tiene 27.000 miembros registrados y una red de conocimiento de más de 3.000 comunidades (Octubre 2012) en temas como innovación, tecnología, negocios y educación, entre otros. gnoss.com trabaja con la filosofía de Datos Abiertos y propone una solución en la que los datos puedan ser vinculados (Linked Data). Los datos se expresan en ficheros OWL-RDF.

En la actualidad, se han publicado más de 100.000 recursos en la plataforma. Para cada recurso, la plataforma ofrece un archivo RDF que describe su contenido y metadatos. Diferentes lenguajes como FOAF, SIOC y SKOS se usan en esta descripción. Nuestra base de datos tiene más de 4.500.000 triples. Además, la mayoría de los recursos de gnoss.com están vinculados con otros recursos de Freebase y The New York Times. Tenemos más de 20.000 enlaces de estos datasets.

GNOSS es una plataforma de software, creada por RIAM Intelearning LAB SL, para construir redes sociales especializadas a través de la publicación semántica dinámica de contenidos (dynamic semantic publishing). GNOSS integra gestión del conocimiento, aprendizaje informal y trabajo colaborativo en un entorno de datos enlazados ( Linked Data ). Cada espacio GNOSS incorpora búsquedas semánticas facetadas y la generación semántica de contextos lo que se permite mejorar considerablemente la experiencia del usuario.

GNOSS funciona sobre las tecnologías y estándares de la web semántica, lo que hace posible, por un lado, estructurar y enlazar entre sí y con los intereses de las personas toda clase de contenidos (Linked Data), y por otro refuerza y amplifica los procesos de gestión del conocimiento con búsquedas facetadas y la generación de contextos documentales y personales para una determinada información.

GNOSS provee a las personas, grupos y organizaciones de las herramientas necesarias para crear y desplegar su identidad digital; conectar inteligencias; crear comunidades basadas en sus intereses y motivaciones; y activar robustos procesos de creatividad, inteligencia, deliberación y pensamiento colectivo.

Últimas novedades de GNOSS: http://noticias.gnoss.com
B
Message Understanding Conference (MUC) 7
borealisdata.ca
search.dataone.org
Updated Apr 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nancy Chinchor (2023). Message Understanding Conference (MUC) 7 [Dataset]. http://doi.org/10.5683/SP2/EQ7FTZ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/EQ7FTZ
Dataset updated
Apr 17, 2023
Dataset provided by
Borealis
Authors
Nancy Chinchor
License
https://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP2/EQ7FTZhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP2/EQ7FTZ
Description
Introduction Message Understanding Conference (MUC) 7 was produced by Linguistic Data Consortium (LDC) catalog number LDC2001T02 and ISBN 1-58563-205-8. In the 1990s, the MUC evaluations funded the development of metrics and statistical algorithms to support government evaluations of emerging information extraction technologies. Additional information from NIST can be found here. Data The following list shows the correspondence between versions of the IE task definition and stages of the MUC-7 evaluation. Version #Stage 4.1 training and dryrun 4.2 formalrun 5.1 final The dryrun and formalrun have different domains; the dryrun (and training) consists of aircrashes scenarios and the formalrun consists of missile launches scenarios. The final version updates especially the Template Relations portion of the guidelines. Normally, for each scenario, two datasets are provided: training and test. When the evaluation cycle begins, the label for the scenario dataset is training. Then the corresponding test dataset for that same scenario is used for the dryrun testing. For the formal run, a formal training set is given out four weeks before the test answers are due. The formal test is given out one week before the test answers are due. After the entire evaluation and meeting have been held, final edits are made if necessary. Samples Please view this text sample. Updates August 22, 2001: This publication was inadvertently released without the guidelines documentation and the scoring software. These documents and programs have now been added to the publication and if you previously purchased this corpus and would like to download a complete copy of the corpus please contact ldc@ldc.upenn.edu. Copyright Portions © 1996 New York Times, © 2001 Trustees of the University of Pennsylvania
w
Dataset of book subjects that contain Following 9/11 : religion coverage in...
workwithdata.com
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book subjects that contain Following 9/11 : religion coverage in the New York times [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Following+9/11+:+religion+coverage+in+the+New+York+times&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 3 rows and is filtered where the books is Following 9/11 : religion coverage in the New York times. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
h
CrosswordQA
huggingface.co
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Albert Xu (2022). CrosswordQA [Dataset]. https://huggingface.co/datasets/albertxu/CrosswordQA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2022
Authors
Albert Xu
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for CrosswordQA

Dataset Summary

The CrosswordQA dataset is a set of over 6 million clue-answer pairs scraped from the New York Times and many other crossword publishers. The dataset was created to train the Berkeley Crossword Solver's QA model. See our paper for more information. Answers are automatically segmented (e.g., BUZZLIGHTYEAR -> Buzz Lightyear), and thus may occasionally be segmented incorrectly.

Supported Tasks and Leaderboards

[Needs… See the full description on the dataset page: https://huggingface.co/datasets/albertxu/CrosswordQA.
The Dynamics of Collective Action Corpus
zenodo.org
data.niaid.nih.gov
bin
Updated Oct 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dustin S. Stoltz; Dustin S. Stoltz; Marshall A. Taylor; Marshall A. Taylor; Jennifer S.K. Dudley; Jennifer S.K. Dudley (2023). The Dynamics of Collective Action Corpus [Dataset]. http://doi.org/10.5281/zenodo.8414335
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8414335
Dataset updated
Oct 7, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dustin S. Stoltz; Dustin S. Stoltz; Marshall A. Taylor; Marshall A. Taylor; Jennifer S.K. Dudley; Jennifer S.K. Dudley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This respository includes two datasets, a Document-Term Matrix and associated metadata, for 17,493 New York Times articles covering protest events, both saved as single R objects.

These datasets are based on the original Dynamics of Collective Action (DoCA) dataset (Wang and Soule 2012; Earl, Soule, and McCarthy). The original DoCA datset contains variables for protest events referenced in roughly 19,676 New York Times articles reporting on collective action events occurring in the US between 1960 and 1995. Data were collected as part of the Dynamics of Collective Action Project at Stanford University. Research assistants read every page of all daily issues of the New York Times to find descriptions of 23,624 distinct protest events. The text for the news articles were not included in the original DoCA data.

We attempted to recollect the raw text in a semi-supervised fashion by matching article titles to create the Dynamics of Collective Action Corpus. In addition to hand-checking random samples and hand-collecting some articles (specifically, in the case of false positives), we also used some automated matching processes to ensure the recollected article titles matched their respective titles in the DoCA dataset. The final number of recollected and matched articles is 17,493.

We then subset the original DoCA dataset to include only rows that match a recollected article. The "20231006_dca_metadata_subset.Rds" contains all of the metadata variables from the original DoCA dataset (see Codebook), with the addition of "pdf_file" and "pub_title" which is the title of the recollected article (and may differ from the "title" variable in the original dataset), for a total of 106 variables and 21,126 rows (noting that a row is a distinct protest events and one article may cover more than one protest event).

Once collected, we prepared these texts using typical preprocessing procedures (and some less typical procedures, which were necessary given that these were OCRed texts). We followed these steps in this order: We removed headers and footers that were consistent across all digitized stories and any web links or HTML; added a single space before an uppercase letter when it was flush against a lowercase letter to its right (e.g., turning "JohnKennedy'' into "John Kennedy''); removed excess whitespace; converted all characters to the broadest range of Latin characters and then transliterated to ``Basic Latin'' ASCII characters; replaced curly quotes with their ASCII counterparts; replaced contractions (e.g., turned "it's'' into "it is''); removed punctuation; removed capitalization; removed numbers; fixed word kerning; applied a final extra round of whitespace removal.

We then tokenized them by following the rule that each word is a character string surrounded by a single space. At this step, each document is then a list of tokens. We count each unique token to create a document-term matrix (DTM), where each row is an article, each column is a unique token (occurring at least once in the corpus as a whole), and each cell is the number of times each token occurred in each article. Finally, we removed words (i.e., columns in the DTM) that occurred less than four times in the corpus as a whole or were only a single character in length (likely orphaned characters from the OCRing process). The final DTM has 66,552 unique words, 10,134,304 total tokens and 17,493. The "20231006_dca_dtm.Rds" is a sparse matrix class object from the Matrix R package.

In R, use the load() function to load the objects `dca_dtm` and `dca_meta`. To associate the `dca_meta` to the `dca_dtm` , match the "pdf_file" variable in`dca_meta` to the rownames of `dca_dtm`.

Facebook

Twitter

Click to copy link

Link copied

Cite

New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data

Coronavirus (Covid-19) Data in the United States

Explore at:

csvAvailable download formats

Dataset provided by

New York Times

License

https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE

Description

The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.

Clear search

Close search

Google apps

Main menu

Coronavirus (Covid-19) Data in the United States

Manually Labeled Data Set for the Ongoing Event Detection Task (2,200 news...

The New York Times US Coronavirus Database

The New York Times Coronavirus (Covid-19) Cases and Deaths in the United...

United States Data

State-Level Data

County-Level Data

Github Repository

Dataset of books called The get over : a short story prequel to the New York...

Supporting Data For "Journalism, advertising, or something in between? How...

The New York Times Annotated Corpus

About

Access/re-use

Collection of NYT's most viewed articles between 2015-05-30T07:13:57.0+1:00...

US counties COVID 19 dataset

New York County, NY annual median income by work experience and sex dataset:...

About this dataset

Content

Inspiration

Recommended for further research

Stanford, New York annual income distribution by work experience and gender...

About this dataset

Content

Inspiration

Recommended for further research

Napoli, New York annual median income by work experience and sex dataset:...

About this dataset

Content

Inspiration

Recommended for further research

NYC Wi-Fi Hotspot Locations

About COVID-19 Public Datasets

Data from: CBS News/New York Times National Surveys, 1983

gnoss.com, a Linked Open Data knowledge network

Message Understanding Conference (MUC) 7

Dataset of book subjects that contain Following 9/11 : religion coverage in...

CrosswordQA

The Dynamics of Collective Action Corpus

Coronavirus (Covid-19) Data in the United States