7 datasets found

P
PubLayNet Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xu Zhong; Jianbin Tang; Antonio Jimeno Yepes, PubLayNet Dataset [Dataset]. https://paperswithcode.com/dataset/publaynet
Explore at:
Authors
Xu Zhong; Jianbin Tang; Antonio Jimeno Yepes
Description
PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.
h
publaynet
huggingface.co
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
eddie (2024). publaynet [Dataset]. https://huggingface.co/datasets/psyche/publaynet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2024
Authors
eddie
Description
psyche/publaynet dataset hosted on Hugging Face and contributed by the HF Datasets community
t
Zhong, X., Tang, J., Yepes, A.J. (2024). Dataset: PublayNet: largest dataset...
service.tib.eu
Updated Dec 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Zhong, X., Tang, J., Yepes, A.J. (2024). Dataset: PublayNet: largest dataset ever for document layout analysis. https://doi.org/10.57702/f4kresfh [Dataset]. https://service.tib.eu/ldmservice/dataset/publaynet--largest-dataset-ever-for-document-layout-analysis
Explore at:
Dataset updated
Dec 16, 2024
Description
The PublayNet dataset is the largest dataset ever for document layout analysis task.
h
publaynet
huggingface.co
Updated Aug 1, 1999
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ma Wenkang (1999). publaynet [Dataset]. https://huggingface.co/datasets/Mwk19990801/publaynet
Explore at:
Dataset updated
Aug 1, 1999
Authors
Ma Wenkang
Description
Mwk19990801/publaynet dataset hosted on Hugging Face and contributed by the HF Datasets community
h
DocLayNet-base
huggingface.co
opendatalab.com
Updated Apr 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pierre Guillou (2023). DocLayNet-base [Dataset]. https://huggingface.co/datasets/pierreguillou/DocLayNet-base
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 27, 2023
Authors
Pierre Guillou
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide smallline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.
h
DocLayNet-v1.1
huggingface.co
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Docling (2024). DocLayNet-v1.1 [Dataset]. https://huggingface.co/datasets/ds4sd/DocLayNet-v1.1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2024
Dataset authored and provided by
Docling
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for DocLayNet v1.1

Dataset Summary

DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank:

Human Annotation: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of… See the full description on the dataset page: https://huggingface.co/datasets/ds4sd/DocLayNet-v1.1.
Tesseract OCR of IIT-CDIP Dataset
zenodo.org
data.niaid.nih.gov
application/gzip
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Davis; Brian Davis (2022). Tesseract OCR of IIT-CDIP Dataset [Dataset]. http://doi.org/10.5281/zenodo.6540454
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6540454
Dataset updated
May 13, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Brian Davis; Brian Davis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is Tesseract generated transcriptions (no images) of (most of) the IIT-CDIP dataset. To download the images of the IIT-CDIP dataset go to https://data.nist.gov/od/id/mds2-2531

The directory struture of this dataset is the same as the IIT-CDIP dataset (although has everything in one tar, with "a.a", "a.b", ... directories) and can thus be combine with the image IIT-CDIP dataset using rsync or similar tool. This dataset contains a "X.layout.json" for each "X.png" in the IIT-CDIP dataset (doesn't have sections 'a', 'w', 'x', 'y', and 'z').

The jsons contain block/paragraph, line and word bounding boxes, with transcriptions for the words following the Tesseract format. The line and word annotations are directly taken from Tesseract. The block and paragraph output of Tesseract was discarded. The images were then run through both the Publaynet and PrimaNet models available on LayoutParser (https://layout-parser.github.io/). The combine output of these models became the block/paragraph annotations (we kept the Tesseract output format, but each block has 1 paragraph of exactly the same shape).

Important: There is also a "rotation" value in the json (0, 90, 180, or 270) indicating the json may be for a rotated version of the IIT-CDIP image by the given amount (attempted to rotated documents to upright position to get better OCR results).

These are the annotations used to pre-train Dessurt (https://arxiv.org/abs/2203.16618).

These annotations will be worse than those that would be obtained using a commercial OCR system (like those used to pre-train LayoutLMv2/v3).

The code used to produce these annotations is available here: https://github.com/herobd/ocr
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Xu Zhong; Jianbin Tang; Antonio Jimeno Yepes, PubLayNet Dataset [Dataset]. https://paperswithcode.com/dataset/publaynet

PubLayNet Dataset

Explore at:

Authors

Xu Zhong; Jianbin Tang; Antonio Jimeno Yepes

Description

PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.

Clear search

Close search

Google apps

Main menu

PubLayNet Dataset

publaynet

Zhong, X., Tang, J., Yepes, A.J. (2024). Dataset: PublayNet: largest dataset...

publaynet

DocLayNet-base

DocLayNet-v1.1

Tesseract OCR of IIT-CDIP Dataset

PubLayNet Dataset