7 datasets found
  1. h

    Data from: newspaper-navigator

    • huggingface.co
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigLAM: BigScience Libraries, Archives and Museums (2025). newspaper-navigator [Dataset]. https://huggingface.co/datasets/biglam/newspaper-navigator
    Explore at:
    Dataset updated
    May 20, 2025
    Dataset authored and provided by
    BigLAM: BigScience Libraries, Archives and Museums
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Newspaper Navigator

      Dataset Summary
    

    This dataset provides a Parquet-converted version of the Newspaper Navigator dataset from the Library of Congress. Originally released as JSON, Newspaper Navigator contains over 16 million pages of historic US newspapers annotated with bounding boxes, predicted visual types (e.g., photographs, maps), and OCR content. This work was carried out as part of a project by Benjamin Germain Lee et al. This version of the… See the full description on the dataset page: https://huggingface.co/datasets/biglam/newspaper-navigator.

  2. Images from Newspaper Navigator predicted as maps, with human corrected...

    • zenodo.org
    csv, json, txt, zip
    Updated Mar 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien; Daniel van Strien (2021). Images from Newspaper Navigator predicted as maps, with human corrected labels [Dataset]. http://doi.org/10.5281/zenodo.4156510
    Explore at:
    txt, json, zip, csvAvailable download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel van Strien; Daniel van Strien
    Description

    The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/).

    [The Newspaper Navigator dataset] consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project.

    source: https://news-navigator.labs.loc.gov/

    One of these categories is 'maps'. In the original training data for Newspaper Navigator, there were relatively few labelled examples of maps. The predictions for maps have an Average Precision of 69.5%, and 34 images in the validation data.

    This dataset contains a sample of these images which have been predicted as 'maps'. It also includes additional labels which indicate whether the predicted map image is a 'map' or 'not a map'.

    The data is organised as follows:

    • The images themselves can be found in 'newspaper_maps.zip'
    • `2020_30_10_13_19_228_sample.json` contains metadata about each image drawn from the Newspaper Navigator Dataset.
    • map_labels.csv contains the labels for the images as a CSV file
  3. 19th Century United States Newspaper Advert images with 'illustrated' or...

    • zenodo.org
    • data.niaid.nih.gov
    csv, zip
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien; Daniel van Strien (2022). 19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels [Dataset]. http://doi.org/10.5281/zenodo.5838410
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jan 12, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel van Strien; Daniel van Strien
    Area covered
    United States
    Description

    The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/).

    [The Newspaper Navigator dataset] consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project.

    source: https://news-navigator.labs.loc.gov/

    One of these categories is 'advertisements. This dataset contains a sample of these images with additional labels indicating if the advert is 'illustrated' or 'not illustrated'.

    The data is organised as follows:

    • The images themselves can be found in `images.zip`
    • `newspaper-navigator-sample-metadata.csv` contains metadata about each image drawn from the Newspaper Navigator Dataset.
    • `ads.csv` contains the labels for the images as a CSV file
    • `sample.csv` contains additional metadata about the images (based on the newspapers those images came from).

    This dataset was created for use in an under-review Programming Historian tutorial (http://programminghistorian.github.io/ph-submissions/lessons/computer-vision-deep-learning-pt1) The primary aim of the data was to provide a realistic example dataset for teaching computer vision for working with digitised heritage material. The data is shared here since it may be useful for others. This data documentation is a work in progress and will be updated when the Programming Historian tutorial is released publicly.

    The metadata CSV file contains the following columns:

    - filepath
    - pub_date
    - page_seq_num
    - edition_seq_num
    - batch
    - lccn
    - box
    - score
    - ocr
    - place_of_publication
    - geographic_coverage
    - name
    - publisher
    - url
    - page_url
    - month
    - year
    - iiif_url

  4. o

    19th Century United States Newspaper images predicted as Photographs with...

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Jan 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien (2022). 19th Century United States Newspaper images predicted as Photographs with labels for "human", "animal", "human-structure" and "landscape" [Dataset]. http://doi.org/10.5281/zenodo.4487141
    Explore at:
    Dataset updated
    Jan 11, 2022
    Authors
    Daniel van Strien
    Area covered
    United States
    Description

    The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). [The Newspaper Navigator dataset] consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project. source: https://news-navigator.labs.loc.gov/ One of these categories is 'photographs'. This dataset contains a sample of these images with additional labels indicating if the photograph has one or more of the following labels: "human", "animal", "human-structure" and "landscape" The data is organised as follows: The images themselves can be found in images.zip newspaper-navigator-sample-metadata.csv contains metadata about each image drawn from the Newspaper Navigator Dataset. multi_label.csv contains the labels for the images as a CSV file annotations.csv conains the labels for the images with additional metadata This dataset was created for use in an under-review Programming Historian tutorial (http://programminghistorian.github.io/ph-submissions/lessons/computer-vision-deep-learning-pt2) The primary aim of the data was to provide a realistic example dataset for teaching computer vision for working with digitised heritage material. The data is shared here since it may be useful for others. This data documentation is a work in progress and will be updated when the Programming Historian tutorial is released publicly. The metadata CSV file contains the following columns: - filepath - pub_date - page_seq_num - edition_seq_num - batch - lccn - box - score - ocr - place_of_publication - geographic_coverage - name - publisher - url - page_url - month - year - iiif_url

  5. h

    newspaper_navigator

    • huggingface.co
    Updated Oct 14, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien (2022). newspaper_navigator [Dataset]. https://huggingface.co/datasets/davanstrien/newspaper_navigator
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 14, 2022
    Authors
    Daniel van Strien
    License

    https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/

    Description

    davanstrien/newspaper_navigator dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. O

    Data from: Newspaper Navigator

    • opendatalab.com
    zip
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Library of Congress (2023). Newspaper Navigator [Dataset]. https://opendatalab.com/OpenDataLab/Newspaper_Navigator
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    LC Labs
    University of Washington
    Library of Congress
    License

    https://github.com/LibraryOfCongress/newspaper-navigator/blob/master/LICENSEhttps://github.com/LibraryOfCongress/newspaper-navigator/blob/master/LICENSE

    Description

    The goal of Newspaper Navigator is to re-imagine searching over the visual content in Chronicling America. The project consists of two stages: Creating the Newspaper Navigator dataset by extracting headlines, photographs, illustrations, maps, comics, cartoons, and advertisements from 16.3 million historic newspaper pages in Chronicling America using emerging machine learning techniques. In addition to the visual content, the dataset includes captions and other relevant text derived from the METS/ALTO OCR, as well as image embeddings for fast similarity querying. Creating an exploratory search application for the Newspaper Navigator dataset in order to enable new ways for the American public to navigate Chronicling America.

  7. h

    loc_beyond_words

    • huggingface.co
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigLAM: BigScience Libraries, Archives and Museums (2023). loc_beyond_words [Dataset]. https://huggingface.co/datasets/biglam/loc_beyond_words
    Explore at:
    Dataset updated
    Mar 2, 2023
    Dataset authored and provided by
    BigLAM: BigScience Libraries, Archives and Museums
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Beyond Words

      Dataset Summary
    

    The Beyond Words dataset is a crowdsourced collection of bounding box annotations on World War I-era historical newspaper pages from the Library of Congress’s Chronicling America collection. Volunteers marked seven types of visual content — photographs, illustrations, maps, comics, editorial cartoons, headlines, and advertisements — enabling the training of the visual content recognition model behind the Newspaper Navigator… See the full description on the dataset page: https://huggingface.co/datasets/biglam/loc_beyond_words.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BigLAM: BigScience Libraries, Archives and Museums (2025). newspaper-navigator [Dataset]. https://huggingface.co/datasets/biglam/newspaper-navigator

Data from: newspaper-navigator

Newspaper Navigator

biglam/newspaper-navigator

Related Article
Explore at:
Dataset updated
May 20, 2025
Dataset authored and provided by
BigLAM: BigScience Libraries, Archives and Museums
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Dataset Card for Newspaper Navigator

  Dataset Summary

This dataset provides a Parquet-converted version of the Newspaper Navigator dataset from the Library of Congress. Originally released as JSON, Newspaper Navigator contains over 16 million pages of historic US newspapers annotated with bounding boxes, predicted visual types (e.g., photographs, maps), and OCR content. This work was carried out as part of a project by Benjamin Germain Lee et al. This version of the… See the full description on the dataset page: https://huggingface.co/datasets/biglam/newspaper-navigator.

Search
Clear search
Close search
Google apps
Main menu