https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
Attribution-NonCommercial-ShareAlike 2.5 (CC BY-NC-SA 2.5)https://creativecommons.org/licenses/by-nc-sa/2.5/
License information was derived automatically
The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event. The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier). Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example. Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data. Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format. YYYYMMDDTHHMMSS ... ... ... The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article. The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.
This is the US Coronavirus data repository from The New York Times . This data includes COVID-19 cases and deaths reported by state and county. The New York Times compiled this data based on reports from state and local health agencies. More information on the data repository is available here . For additional reporting and data visualizations, see The New York Times’ U.S. coronavirus interactive site . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . This dataset has significant public interest in light of the COVID-19 crisis. All bytes processed in queries against this dataset will be zeroed out, making this part of the query free. Data joined with the dataset will be billed at the normal rate to prevent abuse. After September 15, queries over these datasets will revert to the normal billing rate. Users of The New York Times public-use data files must comply with data use restrictions to ensure that the information will be used solely for noncommercial purposes.
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
Data on cumulative coronavirus cases and deaths can be found in two files for states and counties.
Each row of data reports cumulative counts based on our best reporting up to the moment we publish an update. We do our best to revise earlier entries in the data when we receive new information.
Both files contain FIPS codes, a standard geographic identifier, to make it easier for an analyst to combine this data with other data sets like a map file or population data.
State-level data can be found in the us-states.csv file.
date,state,fips,cases,deaths
2020-01-21,Washington,53,1,0
...
County-level data can be found in the us-counties.csv file.
date,county,state,fips,cases,deaths
2020-01-21,Snohomish,Washington,53061,1,0
...
In some cases, the geographies where cases are reported do not map to standard county boundaries. See the list of geographic exceptions for more detail on these.
This dataset contains COVID-19 data for the United States of America made available by The New York Times on github at https://github.com/nytimes/covid-19-data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is The get over : a short story prequel to the New York Times bestselling novel Monster. It features 7 columns including author, publication date, language, and book publisher.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset supplements the manuscript, “Journalism, advertising, or something in between? How The School of the New York Times Teaches Journalists About Native Advertising” submitted as part of the PhD requirement in Social Sciences.Part 1 Dataset 1.1: Blank Interview Questionnaire (Raw) - This is the questionnaire and interviewer instructions used in the expert interview. The questionnaire has been anonymized to protect the identity of the interviewee.Dataset 1.2: Interview Transcript (Processed) - This is the full transcript of the expert interview, which was transcribed using Otter.ai and edited by the principle investigator. The transcript has been anonymized to protect the identity of the interviewee.Part 2Dataset 2.1: Blank Field Note Worksheet (Raw) - This worksheet was copied and used for each participant observation session.Dataset 2.2: Field Note Workbook (Processed) - This is the combined set of participant observation field notes generated as part of the study.Part 3Dataset 3.1: Course Video Files (Raw) - This file is a compilation of all 64 videos offered in the course studied. This data is restricted to protect the identity of the course instructor, who appears in each video and in most videos the instructor appears throughout the full length. Dataset 3.2: Course Video Transcripts (Processed)- This is the combined transcript of all 64 videos offered in the course studied, which was transcribed using Otter.ai and edited by the principle investigator. The transcript has been anonymized to protect the identity of the course instructor, who is not a participant in this study.Attachment 1: Human Research Ethics Committee ApplicationAttachment 2: Human Research Ethics Committee Approval LetterAttachment 3: Study Participation Consent FormAttachment 4: Data Management Plan
From website:
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:
As part of the New York Times' indexing procedures, most articles are manually summarized and tagged by a staff of library scientists. This collection contains over 650,000 article-summary pairs which may prove to be useful in the development and evaluation of algorithms for automated document summarization. Also, over 1.5 million documents have at least one tag. Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. For instance if one article mentions "Bill Clinton" and another refers to "President William Jefferson Clinton", both articles will be tagged with "CLINTON, BILL".
The New York Times has established a community website for researchers working on the data set at http://groups.google.com/group/nytnlp and encourages feedback and discussion about the corpus.
Not open. Available on DVD for $300 from LDC Catalog, which states:
Portions © 1987-2008 New York Times, © 2008 Trustees of the University of Pennsylvania
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data was acquired through the NYT's REST API at http://api.nytimes.com/svc/mostpopular/v2/mostviewed/all-sections/1?api-key={your_api_key}. The intervals for data retrieval were irregular and are stored in the dataset in the column 'INSERT_DATE'. Each row also contains the raw JSON object retrieved by the API 'JSON', a SHA-512 hash of it 'HASH' and several parsed fields from the object. In total, there are 10 retrivals. The dataset can be used to monitor changes in the most viewed articles, query the changing number of hits for a keyword in the title, etc.
Copyright (c) 2015 The New York Times Company. All Rights Reserved.
From the New York Times GITHUB source: CSV US counties "The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository. United States Data
Data on cumulative coronavirus cases and deaths can be found in two files for states and counties.
Each row of data reports cumulative counts based on our best reporting up to the moment we publish an update. We do our best to revise earlier entries in the data when we receive new information."
The specific data here, is the data PER US COUNTY.
The CSV link for counties is: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents median income data over a decade or more for males and females categorized by Total, Full-Time Year-Round (FT), and Part-Time (PT) employment in New York County. It showcases annual income, providing insights into gender-specific income distributions and the disparities between full-time and part-time work. The dataset can be utilized to gain insights into gender-based pay disparity trends and explore the variations in income for male and female individuals.
Key observations: Insights from 2023
Based on our analysis ACS 2019-2023 5-Year Estimates, we present the following observations: - All workers, aged 15 years and older: In New York County, the median income for all workers aged 15 years and older, regardless of work hours, was $72,134 for males and $54,928 for females.
These income figures indicate a substantial gender-based pay disparity, showcasing a gap of approximately 24% between the median incomes of males and females in New York County. With women, regardless of work hours, earning 76 cents to each dollar earned by men, this income disparity reveals a concerning trend toward wage inequality that demands attention in thecounty of New York County.
- Full-time workers, aged 15 years and older: In New York County, among full-time, year-round workers aged 15 years and older, males earned a median income of $119,785, while females earned $96,975, leading to a 19% gender pay gap among full-time workers. This illustrates that women earn 81 cents for each dollar earned by men in full-time roles. This analysis indicates a widening gender pay gap, showing a substantial income disparity where women, despite working full-time, face a more significant wage discrepancy compared to men in the same roles.Remarkably, across all roles, including non-full-time employment, women displayed a similar gender pay gap percentage. This indicates a consistent gender pay gap scenario across various employment types in New York County, showcasing a consistent income pattern irrespective of employment status.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. All incomes have been adjusting for inflation and are presented in 2023-inflation-adjusted dollars.
Gender classifications include:
Employment type classifications include:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for New York County median household income by race. You can refer the same here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents the detailed breakdown of the count of individuals within distinct income brackets, categorizing them by gender (men and women) and employment type - full-time (FT) and part-time (PT), offering valuable insights into the diverse income landscapes within Stanford town. The dataset can be utilized to gain insights into gender-based income distribution within the Stanford town population, aiding in data analysis and decision-making..
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Income brackets:
Variables / Data Columns
Employment type classifications include:
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Stanford town median household income by race. You can refer the same here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents median income data over a decade or more for males and females categorized by Total, Full-Time Year-Round (FT), and Part-Time (PT) employment in Napoli town. It showcases annual income, providing insights into gender-specific income distributions and the disparities between full-time and part-time work. The dataset can be utilized to gain insights into gender-based pay disparity trends and explore the variations in income for male and female individuals.
Key observations: Insights from 2023
Based on our analysis ACS 2019-2023 5-Year Estimates, we present the following observations: - All workers, aged 15 years and older: In Napoli town, the median income for all workers aged 15 years and older, regardless of work hours, was $42,386 for males and $27,230 for females.
These income figures highlight a substantial gender-based income gap in Napoli town. Women, regardless of work hours, earn 64 cents for each dollar earned by men. This significant gender pay gap, approximately 36%, underscores concerning gender-based income inequality in the town of Napoli town.
- Full-time workers, aged 15 years and older: In Napoli town, among full-time, year-round workers aged 15 years and older, males earned a median income of $53,795, while females earned $63,750Surprisingly, within the subset of full-time workers, women earn a higher income than men, earning 1.19 dollars for every dollar earned by men. This suggests that within full-time roles, womens median incomes significantly surpass mens, contrary to broader workforce trends.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. All incomes have been adjusting for inflation and are presented in 2023-inflation-adjusted dollars.
Gender classifications include:
Employment type classifications include:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Napoli town median household income by race. You can refer the same here
NYC Wi-Fi Hotspot Locations Wi-Fi Providers: CityBridge, LLC (Free Beta): LinkNYC 1 gigabyte (GB), Free Wi-Fi Internet Kiosks Spot On Networks (Free) NYC HOUSING AUTHORITY (NYCHA) Properties Fiberless (Free): Wi-Fi access on Governors Island Free - up to 5 Mbps for users as the part of Governors Island Trust Governors Island Connectivity Challenge AT&T (Free): Wi-Fi access is free for all users at all times. Partners: In several parks, the NYC partner organizations provide publicly accessible Wi-Fi. Visit these parks to learn more information about their Wi-Fi service and how to connect. Cable (Limited-Free): In NYC Parks provided by NYC DoITT Cable television franchisees. ALTICEUSA previously known as “Cablevision” and SPECTRUM previously known as “Time Warner Cable” (Limited Free) Connect for 3 free 10 minute sessions every 30 days or purchase a 99 cent day pass through midnight. Wi-Fi service is free at all times to Cablevision’s Optimum Online and Time Warner Cable broadband subscribers. Wi-Fi Provider: Chelsea Wi-Fi (Free) Wi-Fi access is free for all users at all times. Chelsea Improvement Company has partnered with Google to provide Wi-Fi a free wireless Internet zone, a broadband region bounded by West 19th Street, Gansevoort Street, Eighth Avenue, and the High Line Park. Wi-Fi Provider: Downtown Brooklyn Wi-Fi (Free) The Downtown Brooklyn Partnership - the New York City Economic Development Corporation to provide Wi-Fi to the area bordered by Schermerhorn Street, Cadman Plaza West, Flatbush Avenue, and Tillary Street, along with select public spaces in the NYCHA Ingersoll and Whitman Houses. Wi-Fi Provider: Manhattan Downtown Alliance Wi-Fi (Free) Lower Manhattan Several public spaces all along Water Street, Front Street and the East River Esplanade south of Fulton Street and in several other locations throughout Lower Manhattan. Wi-Fi Provider: Harlem Wi-Fi (Free) The network will extend 95 city blocks, from 110th to 138th Streets between Frederick Douglass Boulevard and Madison Avenue is the free outdoor public wireless network. Wi-Fi Provider: Transit Wireless (Free) Wi-Fi Services in the New York City subway system is available in certain underground stations. For more information visit http://www.transitwireless.com/stations/. Wi-Fi Provider: Public Pay Telephone Franchisees (Free) Using existing payphone infrastructure, the City of New York has teamed up with private partners to provide free Wi-Fi service at public payphone kiosks across the five boroughs at no cost to taxpayers. Wi-Fi Provider: New York Public Library Using Wireless Internet Access (Wi-Fi): All Library locations offer free wireless access (Wi-Fi) in public areas at all times the libraries are open. Connecting to the Library's Wireless Network •You must have a computer or other device equipped with an 802.11b-compatible wireless card. •Using your computer's network utilities, look for the wireless network named "NYPL." •The "NYPL" wireless network does not require a password to connect. Limitations and Disclaimers Regarding Wireless Access •The Library's wireless network is not secure. Information sent from or to your laptop can be captured by anyone else with a wireless device and the appropriate software, within three hundred feet. •Library staff is not able to provide technical assistance and no guarantee can be provided that you will be able to make a wireless connection. •The Library assumes no responsibility for the safety of equipment or for laptop configurations, security, or data files resulting from connection to the Library's network
In an effort to help combat COVID-19, we created a COVID-19 Public Datasets program to make data more accessible to researchers, data scientists and analysts. The program will host a repository of public datasets that relate to the COVID-19 crisis and make them free to access and analyze. These include datasets from the New York Times, European Centre for Disease Prevention and Control, Google, Global Health Data from the World Bank, and OpenStreetMap. Free hosting and queries of COVID datasets As with all data in the Google Cloud Public Datasets Program , Google pays for storage of datasets in the program. BigQuery also provides free queries over certain COVID-related datasets to support the response to COVID-19. Queries on COVID datasets will not count against the BigQuery sandbox free tier , where you can query up to 1TB free each month. Limitations and duration Queries of COVID data are free. If, during your analysis, you join COVID datasets with non-COVID datasets, the bytes processed in the non-COVID datasets will be counted against the free tier, then charged accordingly, to prevent abuse. Queries of COVID datasets will remain free until Sept 15, 2021. The contents of these datasets are provided to the public strictly for educational and research purposes only. We are not onboarding or managing PHI or PII data as part of the COVID-19 Public Dataset Program. Google has practices & policies in place to ensure that data is handled in accordance with widely recognized patient privacy and data security policies. See the list of all datasets included in the program
https://www.icpsr.umich.edu/web/ICPSR/studies/8243/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/8243/terms
These seven datasets are part of an ongoing data collection effort in which The New York Times and CBS News are equal partners. Each survey includes questions about President Ronald Reagan's performance in office, especially with respect to economic and foreign affairs. In addition, each survey provides information on respondents' views concerning other social and political issues, as well as respondents' personal backgrounds. The surveys were conducted in January, April, June, September (twice), and October (twice). The October surveys took place before and after President Reagan's speech about Grenada on October 27, 1983. The October samples are weighted separately, and two discrete datasets, which may be analyzed separately or combined, are available (Parts 6 and 7). Topics covered in Part 1, January Survey, include Reagan's handling of economic and foreign affairs, various proposals to reduce the federal deficit, unemployment, and Social Security. In Part 2, April Survey, individuals responded to questions about Reagan's handling of economic and foreign affairs, the environment, and defense policy, and were also asked about their willingness to vote for a Black candidate, candidates endorsed by labor unions, and candidates endorsed by homosexual organizations. Two versions of the questionnaire were used, to test alternative question wording. For Part 3, June Survey, questions were asked on Reagan's presidency, possible presidential candidates in 1984, foreign policy, economic policy, merit pay for public school teachers, federal spending on education, and tennis. Part 4, Plane Survey, queried respondents about the Korean passenger plane shot down by the Soviet Union in September 1983, including their opinions on the American response to the attack. The questionnaire also included questions about Reagan's handling of foreign and economic policy. Part 5, September Survey, covered telephone service, United States troops in Lebanon, possible presidential candidates, and President Reagan's handling of economic and foreign policy. Two versions of the questionnaire were used, to test alternative question wording. A question about the cease-fire agreement in Lebanon was included in only one of those versions. Part 6, October (Prespeech) Survey, was conducted before President Reagan gave his speech on Grenada. Respondents were asked their opinions on having United States troops in Grenada and Lebanon, the attack on the Marine barracks in Lebanon, and Reagan's handling of foreign policy. Part 7, October (Postspeech) Survey, was conducted after President Reagan's speech on Grenada and concerned the same issues that were covered in the Prespeech Survey.
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
gnoss.com is a society on the Net: it enables people, companies and any human group or organization to connect, interact and work according to their interests in a linked open knowledge network. The project is in public beta phase since September 2009. It already has 28,000 subscribers and a knowledge network of more than 3,000 communities (November 2012) on topics related to innovation, technology, business and education, among others. gnoss.com works within the philosophy of Open Data and proposes a solution so that data can be linked (Linked Data). Data are expressed in OWL-RDF files.
Currently, more than 100,000 resources have been published in the platform. For each resource, the platform enables to download an RDF document describing its content and metadata. Different languages such as FOAF, SIOC and SKOS are used in this description. Our whole dataset has about 4,500,000 RDF Triples. In addition, most of these resources are linked to one or more resources of the Freebase and New York Time datasets. We have about 20,000 links to these datasets.
GNOSS is a software platform, created by RIAM Intelearning LAB S.L., to build specialized online social networks with dynamic semantic publishing. GNOSS integrates knowledge management, informal learning and collaborative work in a Linked Data environment. Every GNOSS space incorporates semantic facet-based searches and semantic context creation which drastically improves user experience. GNOSS runs on technologies and standards of the semantic web, which makes it possible to structure and link all kinds of content, among them and with other Open Data (Linked Data), and to reinforce and to amplify the knowledge management processes with facet-based searches and generation of documentary and personal contexts for specific information. GNOSS provides people, groups and organizations with the necessary tools to create and develop their digital identity, connect their intelligences, create communities based on their interests and motivations, and activate thriving processes of collective creativity, brainpower, debate and thinking.
For further information: http://noticias.gnoss.com
gnoss.com es una sociedad en la Red: permite a personas, empresas y a cualquier colectivo conectar, interactuar y trabajar de acuerdo con sus intereses en una red abierta de conocimiento. El proyecto está en beta pública desde septiembre de 2009. Gnoss.com tiene 27.000 miembros registrados y una red de conocimiento de más de 3.000 comunidades (Octubre 2012) en temas como innovación, tecnología, negocios y educación, entre otros. gnoss.com trabaja con la filosofía de Datos Abiertos y propone una solución en la que los datos puedan ser vinculados (Linked Data). Los datos se expresan en ficheros OWL-RDF.
En la actualidad, se han publicado más de 100.000 recursos en la plataforma. Para cada recurso, la plataforma ofrece un archivo RDF que describe su contenido y metadatos. Diferentes lenguajes como FOAF, SIOC y SKOS se usan en esta descripción. Nuestra base de datos tiene más de 4.500.000 triples. Además, la mayoría de los recursos de gnoss.com están vinculados con otros recursos de Freebase y The New York Times. Tenemos más de 20.000 enlaces de estos datasets.
GNOSS es una plataforma de software, creada por RIAM Intelearning LAB SL, para construir redes sociales especializadas a través de la publicación semántica dinámica de contenidos (dynamic semantic publishing). GNOSS integra gestión del conocimiento, aprendizaje informal y trabajo colaborativo en un entorno de datos enlazados ( Linked Data ). Cada espacio GNOSS incorpora búsquedas semánticas facetadas y la generación semántica de contextos lo que se permite mejorar considerablemente la experiencia del usuario.
GNOSS funciona sobre las tecnologías y estándares de la web semántica, lo que hace posible, por un lado, estructurar y enlazar entre sí y con los intereses de las personas toda clase de contenidos (Linked Data), y por otro refuerza y amplifica los procesos de gestión del conocimiento con búsquedas facetadas y la generación de contextos documentales y personales para una determinada información.
GNOSS provee a las personas, grupos y organizaciones de las herramientas necesarias para crear y desplegar su identidad digital; conectar inteligencias; crear comunidades basadas en sus intereses y motivaciones; y activar robustos procesos de creatividad, inteligencia, deliberación y pensamiento colectivo.
Últimas novedades de GNOSS: http://noticias.gnoss.com
https://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP2/EQ7FTZhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP2/EQ7FTZ
Introduction Message Understanding Conference (MUC) 7 was produced by Linguistic Data Consortium (LDC) catalog number LDC2001T02 and ISBN 1-58563-205-8. In the 1990s, the MUC evaluations funded the development of metrics and statistical algorithms to support government evaluations of emerging information extraction technologies. Additional information from NIST can be found here. Data The following list shows the correspondence between versions of the IE task definition and stages of the MUC-7 evaluation. Version #Stage 4.1 training and dryrun 4.2 formalrun 5.1 final The dryrun and formalrun have different domains; the dryrun (and training) consists of aircrashes scenarios and the formalrun consists of missile launches scenarios. The final version updates especially the Template Relations portion of the guidelines. Normally, for each scenario, two datasets are provided: training and test. When the evaluation cycle begins, the label for the scenario dataset is training. Then the corresponding test dataset for that same scenario is used for the dryrun testing. For the formal run, a formal training set is given out four weeks before the test answers are due. The formal test is given out one week before the test answers are due. After the entire evaluation and meeting have been held, final edits are made if necessary. Samples Please view this text sample. Updates August 22, 2001: This publication was inadvertently released without the guidelines documentation and the scoring software. These documents and programs have now been added to the publication and if you previously purchased this corpus and would like to download a complete copy of the corpus please contact ldc@ldc.upenn.edu. Copyright Portions © 1996 New York Times, © 2001 Trustees of the University of Pennsylvania
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 3 rows and is filtered where the books is Following 9/11 : religion coverage in the New York times. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for CrosswordQA
Dataset Summary
The CrosswordQA dataset is a set of over 6 million clue-answer pairs scraped from the New York Times and many other crossword publishers. The dataset was created to train the Berkeley Crossword Solver's QA model. See our paper for more information. Answers are automatically segmented (e.g., BUZZLIGHTYEAR -> Buzz Lightyear), and thus may occasionally be segmented incorrectly.
Supported Tasks and Leaderboards
[Needs… See the full description on the dataset page: https://huggingface.co/datasets/albertxu/CrosswordQA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This respository includes two datasets, a Document-Term Matrix and associated metadata, for 17,493 New York Times articles covering protest events, both saved as single R objects.
These datasets are based on the original Dynamics of Collective Action (DoCA) dataset (Wang and Soule 2012; Earl, Soule, and McCarthy). The original DoCA datset contains variables for protest events referenced in roughly 19,676 New York Times articles reporting on collective action events occurring in the US between 1960 and 1995. Data were collected as part of the Dynamics of Collective Action Project at Stanford University. Research assistants read every page of all daily issues of the New York Times to find descriptions of 23,624 distinct protest events. The text for the news articles were not included in the original DoCA data.
We attempted to recollect the raw text in a semi-supervised fashion by matching article titles to create the Dynamics of Collective Action Corpus. In addition to hand-checking random samples and hand-collecting some articles (specifically, in the case of false positives), we also used some automated matching processes to ensure the recollected article titles matched their respective titles in the DoCA dataset. The final number of recollected and matched articles is 17,493.
We then subset the original DoCA dataset to include only rows that match a recollected article. The "20231006_dca_metadata_subset.Rds" contains all of the metadata variables from the original DoCA dataset (see Codebook), with the addition of "pdf_file" and "pub_title" which is the title of the recollected article (and may differ from the "title" variable in the original dataset), for a total of 106 variables and 21,126 rows (noting that a row is a distinct protest events and one article may cover more than one protest event).
Once collected, we prepared these texts using typical preprocessing procedures (and some less typical procedures, which were necessary given that these were OCRed texts). We followed these steps in this order: We removed headers and footers that were consistent across all digitized stories and any web links or HTML; added a single space before an uppercase letter when it was flush against a lowercase letter to its right (e.g., turning "JohnKennedy'' into "John Kennedy''); removed excess whitespace; converted all characters to the broadest range of Latin characters and then transliterated to ``Basic Latin'' ASCII characters; replaced curly quotes with their ASCII counterparts; replaced contractions (e.g., turned "it's'' into "it is''); removed punctuation; removed capitalization; removed numbers; fixed word kerning; applied a final extra round of whitespace removal.
We then tokenized them by following the rule that each word is a character string surrounded by a single space. At this step, each document is then a list of tokens. We count each unique token to create a document-term matrix (DTM), where each row is an article, each column is a unique token (occurring at least once in the corpus as a whole), and each cell is the number of times each token occurred in each article. Finally, we removed words (i.e., columns in the DTM) that occurred less than four times in the corpus as a whole or were only a single character in length (likely orphaned characters from the OCRing process). The final DTM has 66,552 unique words, 10,134,304 total tokens and 17,493. The "20231006_dca_dtm.Rds" is a sparse matrix class object from the Matrix R package.
In R, use the load() function to load the objects `dca_dtm` and `dca_meta`. To associate the `dca_meta` to the `dca_dtm` , match the "pdf_file" variable in`dca_meta` to the rownames of `dca_dtm`.
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.