https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for PDF Association dataset (PDFA)
Dataset Summary
PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf documents analysis. The purpose of that subset differs in that regard, as focus has been done on making the dataset machine learning-ready for vision-language models.
An example page of one pdf document, with added bounding boxes… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/pdfa-eng-wds.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for Industry Documents Library (IDL)
Dataset Summary
Industry Documents Library (IDL) is a document dataset filtered from UCSF documents library with 19 million pages kept as valid samples. Each document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the idl_data project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/idl-wds.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for DocVQA Dataset
Dataset Summary
DocVQA dataset is a document dataset introduced in Mathew et al. (2021) consisting of 50,000 questions defined on 12,000+ document images. Please visit the challenge page (https://rrc.cvc.uab.es/?ch=17) and paper (https://arxiv.org/abs/2007.00398) for further information.
Usage
This dataset can be used with current releases of Hugging Face datasets library. Here is an example using a custom collator to bundle… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/docvqa-single-page-questions.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for Conceptual Captions (CC3M)
Dataset Summary
Conceptual Captions is a dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/cc3m-wds.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for Conceptual Captions 12M (CC12M)
Dataset Summary
Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M).
Usage
This instance of Conceptual Captions is in webdataset .tar format. It can be used with webdataset library or upcoming releases of Hugging Face datasets.… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/cc12m-wds.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for PDF Association dataset (PDFA)
Dataset Summary
PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf documents analysis. The purpose of that subset differs in that regard, as focus has been done on making the dataset machine learning-ready for vision-language models.
An example page of one pdf document, with added bounding boxes… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/pdfa-eng-wds.