5 datasets found

h
pdfa-eng-wds
huggingface.co
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixel Parsing (2024). pdfa-eng-wds [Dataset]. https://huggingface.co/datasets/pixparse/pdfa-eng-wds
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2024
Dataset authored and provided by
Pixel Parsing
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for PDF Association dataset (PDFA)

Dataset Summary

PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf documents analysis. The purpose of that subset differs in that regard, as focus has been done on making the dataset machine learning-ready for vision-language models.

An example page of one pdf document, with added bounding boxes… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/pdfa-eng-wds.
h
idl-wds
huggingface.co
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixel Parsing (2024). idl-wds [Dataset]. https://huggingface.co/datasets/pixparse/idl-wds
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2024
Dataset authored and provided by
Pixel Parsing
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for Industry Documents Library (IDL)

Dataset Summary

Industry Documents Library (IDL) is a document dataset filtered from UCSF documents library with 19 million pages kept as valid samples. Each document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the idl_data project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/idl-wds.
h
docvqa-single-page-questions
huggingface.co
Updated Mar 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixel Parsing (2024). docvqa-single-page-questions [Dataset]. https://huggingface.co/datasets/pixparse/docvqa-single-page-questions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2024
Dataset authored and provided by
Pixel Parsing
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for DocVQA Dataset

Dataset Summary

DocVQA dataset is a document dataset introduced in Mathew et al. (2021) consisting of 50,000 questions defined on 12,000+ document images. Please visit the challenge page (https://rrc.cvc.uab.es/?ch=17) and paper (https://arxiv.org/abs/2007.00398) for further information.

Usage

This dataset can be used with current releases of Hugging Face datasets library. Here is an example using a custom collator to bundle… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/docvqa-single-page-questions.
h
cc3m-wds
huggingface.co
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixel Parsing (2023). cc3m-wds [Dataset]. https://huggingface.co/datasets/pixparse/cc3m-wds
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2023
Dataset authored and provided by
Pixel Parsing
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for Conceptual Captions (CC3M)

Dataset Summary

Conceptual Captions is a dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/cc3m-wds.
h
cc12m-wds
huggingface.co
Updated Jan 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixel Parsing (2024). cc12m-wds [Dataset]. https://huggingface.co/datasets/pixparse/cc12m-wds
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2024
Dataset authored and provided by
Pixel Parsing
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for Conceptual Captions 12M (CC12M)

Dataset Summary

Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M).

Usage

This instance of Conceptual Captions is in webdataset .tar format. It can be used with webdataset library or upcoming releases of Hugging Face datasets.… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/cc12m-wds.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Pixel Parsing (2024). pdfa-eng-wds [Dataset]. https://huggingface.co/datasets/pixparse/pdfa-eng-wds

pdfa-eng-wds

pixparse/pdfa-eng-wds

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 30, 2024

Dataset authored and provided by

Pixel Parsing

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Dataset Card for PDF Association dataset (PDFA)

  Dataset Summary

PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf documents analysis. The purpose of that subset differs in that regard, as focus has been done on making the dataset machine learning-ready for vision-language models.

An example page of one pdf document, with added bounding boxes… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/pdfa-eng-wds.

Clear search

Close search

Google apps

Main menu

pdfa-eng-wds

idl-wds

docvqa-single-page-questions

cc3m-wds

cc12m-wds

pdfa-eng-wds

pixparse/pdfa-eng-wds