3 datasets found
  1. h

    OmniDocBench

    • huggingface.co
    Updated Dec 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenDataLab (2024). OmniDocBench [Dataset]. https://huggingface.co/datasets/opendatalab/OmniDocBench
    Explore at:
    Dataset updated
    Dec 11, 2024
    Dataset authored and provided by
    OpenDataLab
    Description

    OmniDocBench

    English | 简体中文 OmniDocBench is an evaluation dataset for diverse document parsing in real-world scenarios, with the following characteristics:

    Diverse Document Types: The evaluation set contains 1355 PDF pages, covering 9 document types, 4 layout types and 3 language types. It has broad coverage including academic papers, financial reports, newspapers, textbooks, handwritten notes, etc. Rich Annotations: Contains location information for 15 block-level (text paragraphs… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/OmniDocBench.

  2. h

    OmniDocBench

    • huggingface.co
    Updated Jan 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quivr (2025). OmniDocBench [Dataset]. https://huggingface.co/datasets/Quivr/OmniDocBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 31, 2025
    Dataset authored and provided by
    Quivr
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Forked from opendatalab/OmniDocBench.

      Sampler
    

    We have added a simple Python tool for filtering and performing stratified sampling on OmniDocBench data.

      Features
    

    Filter JSON entries based on custom criteria Perform stratified sampling based on multiple categories Handle nested JSON fields

      Installation
    
    
    
    
    
      Local Development Install (Recommended)
    

    git clone https://huggingface.co/Quivr/OmniDocBench.git cd OmniDocBench pip install -r requirements.txt #… See the full description on the dataset page: https://huggingface.co/datasets/Quivr/OmniDocBench.

  3. h

    zh_en_rec_bench

    • huggingface.co
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    puhuilab (2025). zh_en_rec_bench [Dataset]. https://huggingface.co/datasets/puhuilab/zh_en_rec_bench
    Explore at:
    Dataset updated
    Jun 26, 2025
    Authors
    puhuilab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ZH_EN_RecBench

    zh_en_rec_bench is a benchmark dataset designed to evaluate the robustness and generalization capabilities of text recognition models across multiple scenarios and both Chinese and English scripts. It is constructed by sampling and manually correcting subsets of data from OmniDocBench and TC-STR, with erroneous ground truth labels revised to ensure high-quality evaluation.

      Dataset Overview
    

    This benchmark includes four distinct text recognition scenarios:… See the full description on the dataset page: https://huggingface.co/datasets/puhuilab/zh_en_rec_bench.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenDataLab (2024). OmniDocBench [Dataset]. https://huggingface.co/datasets/opendatalab/OmniDocBench

OmniDocBench

opendatalab/OmniDocBench

Explore at:
20 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 11, 2024
Dataset authored and provided by
OpenDataLab
Description

OmniDocBench

English | 简体中文 OmniDocBench is an evaluation dataset for diverse document parsing in real-world scenarios, with the following characteristics:

Diverse Document Types: The evaluation set contains 1355 PDF pages, covering 9 document types, 4 layout types and 3 language types. It has broad coverage including academic papers, financial reports, newspapers, textbooks, handwritten notes, etc. Rich Annotations: Contains location information for 15 block-level (text paragraphs… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/OmniDocBench.

Search
Clear search
Close search
Google apps
Main menu