Facebook
TwitterOmniDocBench
English | 简体中文 OmniDocBench is an evaluation dataset for diverse document parsing in real-world scenarios, with the following characteristics:
Diverse Document Types: The evaluation set contains 1355 PDF pages, covering 9 document types, 4 layout types and 3 language types. It has broad coverage including academic papers, financial reports, newspapers, textbooks, handwritten notes, etc. Rich Annotations: Contains location information for 15 block-level (text paragraphs… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/OmniDocBench.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Forked from opendatalab/OmniDocBench.
Sampler
We have added a simple Python tool for filtering and performing stratified sampling on OmniDocBench data.
Features
Filter JSON entries based on custom criteria Perform stratified sampling based on multiple categories Handle nested JSON fields
Installation
Local Development Install (Recommended)
git clone https://huggingface.co/Quivr/OmniDocBench.git cd OmniDocBench pip install -r requirements.txt #… See the full description on the dataset page: https://huggingface.co/datasets/Quivr/OmniDocBench.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ZH_EN_RecBench
zh_en_rec_bench is a benchmark dataset designed to evaluate the robustness and generalization capabilities of text recognition models across multiple scenarios and both Chinese and English scripts. It is constructed by sampling and manually correcting subsets of data from OmniDocBench and TC-STR, with erroneous ground truth labels revised to ensure high-quality evaluation.
Dataset Overview
This benchmark includes four distinct text recognition scenarios:… See the full description on the dataset page: https://huggingface.co/datasets/puhuilab/zh_en_rec_bench.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterOmniDocBench
English | 简体中文 OmniDocBench is an evaluation dataset for diverse document parsing in real-world scenarios, with the following characteristics:
Diverse Document Types: The evaluation set contains 1355 PDF pages, covering 9 document types, 4 layout types and 3 language types. It has broad coverage including academic papers, financial reports, newspapers, textbooks, handwritten notes, etc. Rich Annotations: Contains location information for 15 block-level (text paragraphs… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/OmniDocBench.