https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Newspaper Navigator
Dataset Summary
This dataset provides a Parquet-converted version of the Newspaper Navigator dataset from the Library of Congress. Originally released as JSON, Newspaper Navigator contains over 16 million pages of historic US newspapers annotated with bounding boxes, predicted visual types (e.g., photographs, maps), and OCR content. This work was carried out as part of a project by Benjamin Germain Lee et al. This version of the… See the full description on the dataset page: https://huggingface.co/datasets/biglam/newspaper-navigator.
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/).
[The Newspaper Navigator dataset] consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project.
One of these categories is 'maps'. In the original training data for Newspaper Navigator, there were relatively few labelled examples of maps. The predictions for maps have an Average Precision of 69.5%, and 34 images in the validation data.
This dataset contains a sample of these images which have been predicted as 'maps'. It also includes additional labels which indicate whether the predicted map image is a 'map' or 'not a map'.
The data is organised as follows:
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/).
[The Newspaper Navigator dataset] consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project.
One of these categories is 'advertisements. This dataset contains a sample of these images with additional labels indicating if the advert is 'illustrated' or 'not illustrated'.
The data is organised as follows:
This dataset was created for use in an under-review Programming Historian tutorial (http://programminghistorian.github.io/ph-submissions/lessons/computer-vision-deep-learning-pt1) The primary aim of the data was to provide a realistic example dataset for teaching computer vision for working with digitised heritage material. The data is shared here since it may be useful for others. This data documentation is a work in progress and will be updated when the Programming Historian tutorial is released publicly.
The metadata CSV file contains the following columns:
- filepath
- pub_date
- page_seq_num
- edition_seq_num
- batch
- lccn
- box
- score
- ocr
- place_of_publication
- geographic_coverage
- name
- publisher
- url
- page_url
- month
- year
- iiif_url
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). [The Newspaper Navigator dataset] consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project. source: https://news-navigator.labs.loc.gov/ One of these categories is 'photographs'. This dataset contains a sample of these images with additional labels indicating if the photograph has one or more of the following labels: "human", "animal", "human-structure" and "landscape" The data is organised as follows: The images themselves can be found in images.zip
newspaper-navigator-sample-metadata.csv
contains metadata about each image drawn from the Newspaper Navigator Dataset. multi_label.csv
contains the labels for the images as a CSV file annotations.csv
conains the labels for the images with additional metadata This dataset was created for use in an under-review Programming Historian tutorial (http://programminghistorian.github.io/ph-submissions/lessons/computer-vision-deep-learning-pt2) The primary aim of the data was to provide a realistic example dataset for teaching computer vision for working with digitised heritage material. The data is shared here since it may be useful for others. This data documentation is a work in progress and will be updated when the Programming Historian tutorial is released publicly. The metadata CSV file contains the following columns: - filepath - pub_date - page_seq_num - edition_seq_num - batch - lccn - box - score - ocr - place_of_publication - geographic_coverage - name - publisher - url - page_url - month - year - iiif_url
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
davanstrien/newspaper_navigator dataset hosted on Hugging Face and contributed by the HF Datasets community
https://github.com/LibraryOfCongress/newspaper-navigator/blob/master/LICENSEhttps://github.com/LibraryOfCongress/newspaper-navigator/blob/master/LICENSE
The goal of Newspaper Navigator is to re-imagine searching over the visual content in Chronicling America. The project consists of two stages: Creating the Newspaper Navigator dataset by extracting headlines, photographs, illustrations, maps, comics, cartoons, and advertisements from 16.3 million historic newspaper pages in Chronicling America using emerging machine learning techniques. In addition to the visual content, the dataset includes captions and other relevant text derived from the METS/ALTO OCR, as well as image embeddings for fast similarity querying. Creating an exploratory search application for the Newspaper Navigator dataset in order to enable new ways for the American public to navigate Chronicling America.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Beyond Words
Dataset Summary
The Beyond Words dataset is a crowdsourced collection of bounding box annotations on World War I-era historical newspaper pages from the Library of Congress’s Chronicling America collection. Volunteers marked seven types of visual content — photographs, illustrations, maps, comics, editorial cartoons, headlines, and advertisements — enabling the training of the visual content recognition model behind the Newspaper Navigator… See the full description on the dataset page: https://huggingface.co/datasets/biglam/loc_beyond_words.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Newspaper Navigator
Dataset Summary
This dataset provides a Parquet-converted version of the Newspaper Navigator dataset from the Library of Congress. Originally released as JSON, Newspaper Navigator contains over 16 million pages of historic US newspapers annotated with bounding boxes, predicted visual types (e.g., photographs, maps), and OCR content. This work was carried out as part of a project by Benjamin Germain Lee et al. This version of the… See the full description on the dataset page: https://huggingface.co/datasets/biglam/newspaper-navigator.