3 datasets found

h
invoices-donut-data-v1-with-ocr
huggingface.co
Updated Mar 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Pansa (2019). invoices-donut-data-v1-with-ocr [Dataset]. https://huggingface.co/datasets/MJPansa/invoices-donut-data-v1-with-ocr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 23, 2019
Authors
Marco Pansa
Description
bbox column is [x, y, width, height] ymean is y position of the mean of the box line is the line number calculated using ymean
synthdog-ko
huggingface.co
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NAVER CLOVA INFORMATION EXTRACTION (2024). synthdog-ko [Dataset]. https://huggingface.co/datasets/naver-clova-ix/synthdog-ko
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2024
Dataset provided by
Naver Corporationhttp://www.navercorp.com/
Authors
NAVER CLOVA INFORMATION EXTRACTION
Description
Donut 🍩 : OCR-Free Document Understanding Transformer (ECCV 2022) -- SynthDoG datasets

For more information, please visit https://github.com/clovaai/donut

The links to the SynthDoG-generated datasets are here:

synthdog-en: English, 0.5M. synthdog-zh: Chinese, 0.5M. synthdog-ja: Japanese, 0.5M. synthdog-ko: Korean, 0.5M.

To generate synthetic datasets with our SynthDoG, please see ./synthdog/README.md and our paper for details.

How to Cite

If you find this work useful… See the full description on the dataset page: https://huggingface.co/datasets/naver-clova-ix/synthdog-ko.
donut_vqa
huggingface.co
Updated Jul 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jina AI (2025). donut_vqa [Dataset]. https://huggingface.co/datasets/jinaai/donut_vqa
Explore at:
Dataset updated
Jul 20, 2025
Dataset authored and provided by
Jina AI
Description
DonutVQA Dataset

This dataset is derived from the donut-vqa dataset, reformatting the test split with modified field names, so that it can be used in the ViDoRe benchmark. The text_description column contains OCR text extracted from the images using EasyOCR.

Disclaimer

This dataset may contain publicly available images or text data. All data is provided for research and educational purposes only. If you are the rights holder of any content and have concerns regarding… See the full description on the dataset page: https://huggingface.co/datasets/jinaai/donut_vqa.
Not seeing a result you expected?
Learn how you can add new datasets to our index.