7 datasets found
  1. P

    PubLayNet Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xu Zhong; Jianbin Tang; Antonio Jimeno Yepes, PubLayNet Dataset [Dataset]. https://paperswithcode.com/dataset/publaynet
    Explore at:
    Authors
    Xu Zhong; Jianbin Tang; Antonio Jimeno Yepes
    Description

    PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.

  2. h

    publaynet

    • huggingface.co
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    eddie (2024). publaynet [Dataset]. https://huggingface.co/datasets/psyche/publaynet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 31, 2024
    Authors
    eddie
    Description

    psyche/publaynet dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. t

    Zhong, X., Tang, J., Yepes, A.J. (2024). Dataset: PublayNet: largest dataset...

    • service.tib.eu
    Updated Dec 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Zhong, X., Tang, J., Yepes, A.J. (2024). Dataset: PublayNet: largest dataset ever for document layout analysis. https://doi.org/10.57702/f4kresfh [Dataset]. https://service.tib.eu/ldmservice/dataset/publaynet--largest-dataset-ever-for-document-layout-analysis
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The PublayNet dataset is the largest dataset ever for document layout analysis task.

  4. h

    publaynet

    • huggingface.co
    Updated Aug 1, 1999
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ma Wenkang (1999). publaynet [Dataset]. https://huggingface.co/datasets/Mwk19990801/publaynet
    Explore at:
    Dataset updated
    Aug 1, 1999
    Authors
    Ma Wenkang
    Description

    Mwk19990801/publaynet dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    DocLayNet-base

    • huggingface.co
    • opendatalab.com
    Updated Apr 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Guillou (2023). DocLayNet-base [Dataset]. https://huggingface.co/datasets/pierreguillou/DocLayNet-base
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2023
    Authors
    Pierre Guillou
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide smallline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.

  6. h

    DocLayNet-v1.1

    • huggingface.co
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Docling (2024). DocLayNet-v1.1 [Dataset]. https://huggingface.co/datasets/ds4sd/DocLayNet-v1.1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    Docling
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for DocLayNet v1.1

      Dataset Summary
    

    DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank:

    Human Annotation: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation ofโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ds4sd/DocLayNet-v1.1.

  7. Tesseract OCR of IIT-CDIP Dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Davis; Brian Davis (2022). Tesseract OCR of IIT-CDIP Dataset [Dataset]. http://doi.org/10.5281/zenodo.6540454
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Brian Davis; Brian Davis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is Tesseract generated transcriptions (no images) of (most of) the IIT-CDIP dataset. To download the images of the IIT-CDIP dataset go to https://data.nist.gov/od/id/mds2-2531

    The directory struture of this dataset is the same as the IIT-CDIP dataset (although has everything in one tar, with "a.a", "a.b", ... directories) and can thus be combine with the image IIT-CDIP dataset using rsync or similar tool. This dataset contains a "X.layout.json" for each "X.png" in the IIT-CDIP dataset (doesn't have sections 'a', 'w', 'x', 'y', and 'z').

    The jsons contain block/paragraph, line and word bounding boxes, with transcriptions for the words following the Tesseract format. The line and word annotations are directly taken from Tesseract. The block and paragraph output of Tesseract was discarded. The images were then run through both the Publaynet and PrimaNet models available on LayoutParser (https://layout-parser.github.io/). The combine output of these models became the block/paragraph annotations (we kept the Tesseract output format, but each block has 1 paragraph of exactly the same shape).

    Important: There is also a "rotation" value in the json (0, 90, 180, or 270) indicating the json may be for a rotated version of the IIT-CDIP image by the given amount (attempted to rotated documents to upright position to get better OCR results).

    These are the annotations used to pre-train Dessurt (https://arxiv.org/abs/2203.16618).

    These annotations will be worse than those that would be obtained using a commercial OCR system (like those used to pre-train LayoutLMv2/v3).

    The code used to produce these annotations is available here: https://github.com/herobd/ocr

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xu Zhong; Jianbin Tang; Antonio Jimeno Yepes, PubLayNet Dataset [Dataset]. https://paperswithcode.com/dataset/publaynet

PubLayNet Dataset

Explore at:
Authors
Xu Zhong; Jianbin Tang; Antonio Jimeno Yepes
Description

PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.

Search
Clear search
Close search
Google apps
Main menu