56 datasets found
  1. ocr-benchmark

    • huggingface.co
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OmniAI (2025). ocr-benchmark [Dataset]. https://huggingface.co/datasets/getomni-ai/ocr-benchmark
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2025
    Dataset provided by
    OmniAI Technology, Inc.
    Authors
    OmniAI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    OmniAI OCR Benchmark

    A comprehensive benchmark that compares OCR and data extraction capabilities of different multimodal LLMs such as gpt-4o and gemini-2.0, evaluating both text and JSON extraction accuracy. Benchmark Results (Feb 2025) | Source Code

  2. h

    OCR-benchmark

    • huggingface.co
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    black (2025). OCR-benchmark [Dataset]. https://huggingface.co/datasets/blackcrow228/OCR-benchmark
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    black
    Description

    blackcrow228/OCR-benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. Noisy OCR Dataset (NOD)

    • zenodo.org
    bin
    Updated Jul 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Hegghammer; Thomas Hegghammer (2021). Noisy OCR Dataset (NOD) [Dataset]. http://doi.org/10.5281/zenodo.5068735
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas Hegghammer; Thomas Hegghammer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).

    Source images

    The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.

    Artificial noise application

    The dataset was created as follows:
    - First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
    - Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
    - Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.

    This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.

    The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.

    References:

    Barcha, Pedro. 2017. โ€œOld Books Dataset.โ€ GitHub Repository. GitHub. https:
    //github.com/PedroBarcha/old-books-dataset.

    Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. โ€œYarmouk
    Arabic OCR Dataset.โ€ In 2018 8th International Conference on Computer Science
    and Information Technology (CSIT)
    , 150โ€“54. IEEE.

    Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs

  4. h

    sensor-ocr-benchmark

    • huggingface.co
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    seafog winters (2024). sensor-ocr-benchmark [Dataset]. https://huggingface.co/datasets/famousdetectiveadrianmonk/sensor-ocr-benchmark
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2024
    Authors
    seafog winters
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    My Custom Dataset

      Description
    

    The original dataset was modified to inserto fake sensor information in bottom of image.

      Usage
    

    from datasets import load_dataset dataset = load_dataset("famousdetectiveadrianmonk/sensor-ocr-benchmark") example = dataset['train'][0] img = example['pixel_values'] sensor_zoomin = img.crop((600, 850, 1250, 1050))

      Attribution
    

    This dataset is based on the original dataset provided by Segments.ai. The original dataset canโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/famousdetectiveadrianmonk/sensor-ocr-benchmark.

  5. Synthetic Printed Words and Test Protocols Data for Bangla OCR

    • figshare.com
    zip
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koushik Roy; MD Sazzad Hossain; Pritom Saha; Shadman Rohan; Fuad Rahman; Imranul Ashrafi; Ifty Mohammad Rezwan; B M Mainul Hossain; Ahmedul Kabir; Nabeel Mohammed (2023). Synthetic Printed Words and Test Protocols Data for Bangla OCR [Dataset]. http://doi.org/10.6084/m9.figshare.20186825.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Koushik Roy; MD Sazzad Hossain; Pritom Saha; Shadman Rohan; Fuad Rahman; Imranul Ashrafi; Ifty Mohammad Rezwan; B M Mainul Hossain; Ahmedul Kabir; Nabeel Mohammed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic Printed word image data and test protocols word image Data repository for the paper "A Multifaceted Evaluation of Representation of Graphemes for Practically Effective Bangla OCR." In this paper, we have utilized the popular Convolutional Recurrent Neural Network (CRNN) architecture and implemented our grapheme representation strategies to design the final labels of the model. Due to the absence of a large-scale Bangla word-level printed dataset, we created a synthetically generated Bangla corpus containing 2 million samples that are representative and sufficiently varied in terms of fonts, domain, and vocabulary size to train our Bangla OCR model. To test the various aspects of our model, we have also created 6 test protocols. Finally, to establish the generalizability of our grapheme representation methods, we have performed training and testing on external handwriting datasets. Updates: 10 June 2023: The paper has been accepted for publication in International Journal on Document Analysis and Recognition (IJDAR).

  6. E

    Benchmark for the evaluation of named entity recognition over ancient...

    • live.european-language-grid.eu
    • zenodo.org
    • +1more
    png
    Updated Aug 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Benchmark for the evaluation of named entity recognition over ancient documents [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7802
    Explore at:
    pngAvailable download formats
    Dataset updated
    Aug 20, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of a multilingual noisy corpora for named entity recognition (NER).The noisy versions are simulated from the CoNLL-02 (Spanish and Dutch) and CoNLL-03 (English) NER corpora.The original collections are re-OCRed and four types of noises at two different levels are added in order to simulate various OCR output.More precisely, we first extracted raw texts and converted them into images. These images have been contaminated by adding some common noises when using a scanner. We further extract OCRed data using tesseract open sourceOCR engine v-3.04.01. Consequently to the image noise insertions, OCRed data contains degradations. Original and noisy texts are finally aligned.This archive contains three folders (one per language). The folders contain the degraded images, the noisy texts extracted by the OCR and their aligned version with clean data.

  7. S

    TibOCR-Bench: A Comprehensive Benchmark and Training Pipeline for Tibetan...

    • scidb.cn
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kuntharrgyal Khysru; LAMA Jie (2025). TibOCR-Bench: A Comprehensive Benchmark and Training Pipeline for Tibetan Multimodal OCR [Dataset]. http://doi.org/10.57760/sciencedb.28968
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Kuntharrgyal Khysru; LAMA Jie
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    To effectively support the training and evaluation of Tibetan OCR models in practical application scenarios involving multiple fonts and complex text structures, we have constructed a multi-source, high-quality Tibetan text image dataset. The overall data construction includes two complementary strategies: forward construction and reverse construction. (1) Positive construction: Firstly, collect Tibetan language images in real scenes, and then manually annotate the corresponding text content. This method ensures the authenticity and practical relevance of the data, effectively covering the diverse language usage scenarios and inherent complexity in Tibetan OCR tasks. (2) Reverse construction: Firstly, select text content suitable for OCR tasks (such as advertising slogans, slogans, or standard documents), then choose appropriate background images and use multiple fonts and visual effects to synthesize the text image dataset. This method efficiently enhances the structural diversity and scale of the dataset. These two strategies complement each other and together form a comprehensive resource library for training and evaluating Tibetan OCR models.

  8. Benchmark for the evaluation of Named Entity Linking over ancient documents

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet; Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet (2020). Benchmark for the evaluation of Named Entity Linking over ancient documents [Dataset]. http://doi.org/10.5281/zenodo.3490333
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet; Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Benchmark for the evaluation of Named Entity Linking over ancient documents
    Elvys Linhares Pontes, Ahmed Hamdi, Nicolas Sidere, and Antoine Doucet
    University of Avignon: elvys.linhares-pontes@univ-avignon.fr; University of La Rochelle: {elvys.linhares_pontes,ahmed.hamdi,nicolas.sidere,antoine.doucet}@univ-lr.fr

    These are the supplementary materials for the ICADL 2019 paper Impact of OCR Quality on Named Entity Linking. If you end up using whole or parts of this resource, please use the following citation:

    • Linhares Pontes, E., Hamdi, A., Sidere, N., and Doucet, A. (2019). Impact of OCR Quality on Named Entity Linking. In Proceedings of 21st International Conference on Asia-Pacific Digital Libraries ICADL 2019, Kuala Lumpur, Malaysia.

    or alternatively use the following `bib`:

    @inproceedings{linhares2019icadl,
     title="Impact of OCR Quality on Named Entity Linking.",
     author={Linhares Pontes, Elvys, and Hamdi, Ahmed, and Sidere, Nicolas, and Doucet, Antoine},
     year={2019},
     booktitle={Proceedings of 21st International Conference on Asia-Pacific Digital Libraries ICADL 2019}
     }

    Files
    This archive contains six folders -- one per dataset -- as well as this README. The folders contain the degraded images, the noisy texts extracted by the OCR and their aligned version with clean data. This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).

    Acknowledgments
    This work has been supported by the European Union's Horizon 2020 research and innovation programme under grant 770299 [NewsEye](https://www.newseye.eu/).

  9. Z

    Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts...

    • data.niaid.nih.gov
    Updated Mar 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SyedKhaleel Jageer (2025). Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts with Corresponding Ground Truth Annotations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15009379
    Explore at:
    Dataset updated
    Mar 22, 2025
    Dataset authored and provided by
    SyedKhaleel Jageer
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides a benchmark for Tamil Optical Character Recognition (OCR), covering both handwritten (Hangual) and printed Tamil text. It includes high-quality ground truth (GT) text files paired with corresponding TIFF images, making it valuable for training and evaluating OCR models, particularly for Tesseract, deep learning-based recognition, and AI research.

    Dataset Highlights

    Total Size: 15GB (Sample from the full 60GB dataset)

    Total Pairs:Approximately 1,903,284 text-image pairs

    Handwritten Fonts (9 Unicode Fonts):

    Aazhi, Gnani, Hemalatha, Indumathi, Kalayarasi, Siva_01, Siva_02, Sudeeptha, Yogeshwaran

    Printed Fonts (9 Unicode Fonts):

    AnekTamil, Arima, KarlaTamilInclined, TAU-Barathi, TAU-Kambar, TAU-Marutham, TAU-Mullai, TAU-Neythal, TAU-Valluvar

    Data Source:

    The text corpus (GT text files) is curated from Wikipedia and Wikisource, ensuring linguistic diversity.

    The fonts are publicly available Unicode Tamil fonts, sourced from Google Fonts and Tamil Virtual University.

    File Structure

    Tamil_OCR_Dataset/โ”œโ”€โ”€ Hangual_Fonts/โ”‚ โ”œโ”€โ”€ Aazhi/โ”‚ โ”‚ โ”œโ”€โ”€ gt/โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ 00001.gt.txtโ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ 00002.gt.txtโ”‚ โ”‚ โ”œโ”€โ”€ images/โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ 00001.tiffโ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ 00002.tiffโ”‚ โ”œโ”€โ”€ Gnani/โ”‚ โ”œโ”€โ”€ ...โ”œโ”€โ”€ Printed_Fonts/โ”‚ โ”œโ”€โ”€ AnekTamil/โ”‚ โ”œโ”€โ”€ HindMadurai/โ”‚ โ”œโ”€โ”€ ...

    Cite this work

    @dataset{tamilocr_dataset_2025, author = {Syedkhaleel Jageer}, title = {Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts with Corresponding Ground Truth Annotations}, year = {2025}, publisher = {Zenodo}, doi = {10.5281/zenodo.15009380}, url = {https://doi.org/10.5281/zenodo.15009380}}

  10. IIIT5K-Words

    • kaggle.com
    Updated May 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prathamesh Zade (2023). IIIT5K-Words [Dataset]. http://doi.org/10.34740/kaggle/dsv/5671242
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Prathamesh Zade
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    IIIT5K-Words

    The IIIT5K Words Dataset is a comprehensive collection of labeled word images, curated by the International Institute of Information Technology, Hyderabad (IIIT-H). It is designed to facilitate research and development in optical character recognition (OCR), word recognition, and related fields.

    The dataset contains a diverse set of 5,000 word images, covering various fonts, styles, and sizes. Each word image represents a single English word and is accompanied by its corresponding ground truth label, providing accurate transcription for training and evaluation purposes.

    Please refer: IIIT5K-Words official site

    Note: In order to view mat files use this code

    install requirements

    !pip install shutil pymatreader

    unzip the zip file

    import shutil

    shutil.unpack_archive('IIIT5K-Word_V3.0.tar.gz', 'data')

    view mat files

    from pymatreader import read_mat

    testdata_mat = read_mat('testdata.mat')

    testCharBound_mat = read_mat('testCharBound.mat')

    testdata_mat

    Key Features: - Size: The dataset comprises 5,000 word images, making it suitable for training and evaluating OCR algorithms. - Diversity: The dataset encompasses a wide range of fonts, styles, and sizes to ensure the inclusion of various challenges encountered in real-world scenarios. - Ground Truth Labels: Each word image is paired with its ground truth label, enabling supervised learning approaches and facilitating evaluation metrics calculation. - Quality Annotation: The dataset has been carefully curated by experts at IIIT-H, ensuring high-quality annotations and accurate transcription of the word images. - Research Applications: The dataset serves as a valuable resource for OCR, word recognition, text detection, and related research areas.

    Potential Use Cases: - Optical Character Recognition (OCR) Systems: The dataset can be employed to train and benchmark OCR models, improving their accuracy and robustness. - Word Recognition Algorithms: Researchers can utilize the dataset to develop and evaluate word recognition algorithms, including deep learning-based approaches. - Text Detection: The dataset can aid in the development and evaluation of algorithms for text detection in natural scenes. - Font and Style Analysis: Researchers can leverage the dataset to study font and style variations, character segmentation, and other related analyses.

    Citation:

    @InProceedings{MishraBMVC12, author = "Mishra, A. and Alahari, K. and Jawahar, C.~V.", title = "Scene Text Recognition using Higher Order Language Priors", booktitle = "BMVC", year = "2012", }

  11. H

    A BENCHMARK DATASET FOR MANIPURI MEETEI-MAYEK HANDWRITTEN CHARACTER...

    • dataverse.harvard.edu
    • search.dataone.org
    • +5more
    Updated Sep 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pangambam Singh (2019). A BENCHMARK DATASET FOR MANIPURI MEETEI-MAYEK HANDWRITTEN CHARACTER RECOGNITION [Dataset]. http://doi.org/10.7910/DVN/OMU2DV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Pangambam Singh
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Manipur
    Description

    A benchmark dataset is always required for any classification or recognition system. To the best of our knowledge, no benchmark dataset exists for handwritten character recognition of Manipuri Meetei-Mayek script in public domain so far. Manipuri, also referred to as Meeteilon or sometimes Meiteilon, is a Sino-Tibetan language and also one of the Eight Scheduled languages of Indian Constitution. It is the official language and lingua franca of the southeastern Himalayan state of Manipur, in northeastern India. This language is also used by a significant number of people as their communicating language over the north-east India, and some parts of Bangladesh and Myanmar. It is the most widely spoken language in Northeast India after Bengali and Assamese languages. In this work, we introduce a handwritten Manipuri Meetei-Mayek character dataset which consists of more than 5000 data samples which were collected from a diverse population group that belongs to different age groups (from 4 years to 60 years), genders, educational backgrounds, occupations, communities from three different districts of Manipur, India (Imphal East District, Thoubal District and Kangpokpi District) during March and April 2019. Each individual was asked to write down all the Manipuri characters on one A4-size paper. The recorded responses are scanned with the help of a scanner and then each character is manually segmented from the scanned images. This dataset consists of segmented scanned images of handwritten Manipuri Meetei-Mayek characters (Mapi Mayek, Lonsum Mayek, Cheitap Mayek, Cheising Mayek, Khutam Mayek) of size 128X128 pixels in .JPG format as well as in .MAT format.

  12. d

    A benchmark dataset for Manipuri Meetei-Mayek handwritten character...

    • search.dataone.org
    Updated Jun 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pangambam Singh (2025). A benchmark dataset for Manipuri Meetei-Mayek handwritten character recognition [Dataset]. http://doi.org/10.5061/dryad.r4xgxd27w
    Explore at:
    Dataset updated
    Jun 15, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Pangambam Singh
    Time period covered
    Jan 1, 2019
    Description

    A benchmark dataset is always required for any classification or recognition system. To the best of our knowledge, no benchmark dataset exists for handwritten character recognition of Manipuri Meetei-Mayek script in public domain so far.ร‚ Manipuri, also referred to as Meeteilon or sometimes Meiteilon, is a Sino-Tibetan language and also one of the Eight Scheduled languages of Indian Constitution. It is the official language and lingua franca of the southeastern Himalayan state of Manipur, in northeastern India. This language is also used by a significant number of people as their communicating language over the north-east India, and some parts of Bangladesh and Myanmar. It is the most widely spoken language in Northeast India after Bengali and Assamese languages.ร‚ In this work, we introduce a handwritten Manipuri Meetei-Mayek character dataset which consists of more than 5000 data samples which were collected from a diverse population group that belongs to different age groups (from 4 yea...

  13. h

    OCR-Reasoning

    • huggingface.co
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huang (2025). OCR-Reasoning [Dataset]. https://huggingface.co/datasets/mx262/OCR-Reasoning
    Explore at:
    Dataset updated
    May 21, 2025
    Authors
    Huang
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

    Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodalโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/mx262/OCR-Reasoning.

  14. olmOCR-bench

    • huggingface.co
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). olmOCR-bench [Dataset]. https://huggingface.co/datasets/allenai/olmOCR-bench
    Explore at:
    Dataset updated
    Jul 23, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    olmOCR-bench

    olmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have. This benchmark evaluates the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information. Quick links:

    ๐Ÿ“ƒ Paper ๐Ÿ› ๏ธ Code ๐ŸŽฎ Demo

      Table 1. Distribution of Test Classes by Document Source
    

    Document Source Text Present Textโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmOCR-bench.

  15. S

    A dataset of Manchu ancient book word images for OCR tasks, China,...

    • scidb.cn
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun Haipeng; Tao Wenhao; Bi Xiaojun (2025). A dataset of Manchu ancient book word images for OCR tasks, China, 1733โ€“1867. [Dataset]. http://doi.org/10.57760/sciencedb.25676
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 29, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Sun Haipeng; Tao Wenhao; Bi Xiaojun
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    China
    Description

    This dataset consists of 24,280 high-resolution word images extracted from Manchu ancient books dating from 1733 to 1867, collected within the present-day territory of China. The images were sourced from the Series of Rare Ancient Books in Manchu and Chinese curated by the National Library of China. Each of the 2,428 unique Manchu words in the dataset is represented by exactly 10 distinct image samples, resulting in a balanced and well-structured dataset suitable for training and evaluating deep learning models in the task of Manchu OCR (optical character recognition).This dataset was constructed using a semi-automated workflow to address the challenges posed by manual segmentation of historical scriptsโ€”such as high annotation costs and time-consuming processingโ€”and to preserve the visual details of each page. The image acquisition process involved high-precision scanning at 600 dpi. Word regions were first identified using computer vision algorithms, followed by manual verification and correction to ensure the accuracy and completeness of the extracted samples.All images are stored in standard .jpg format with consistent resolution and naming conventions. The dataset is divided into structured folders by word category, and accompanying metadata files provide annotations, including word labels, file paths, and page source references. The released version has no missing data entries, and the dataset has been quality-checked to exclude samples with severe degradation, such as illegible characters, torn pages, or significant shadowing.To our knowledge, this is the largest publicly available Manchu word image dataset to date. It offers a valuable resource for researchers in historical document analysis, Manchu linguistics, and machine learning-based OCR. The dataset can be used for model training and evaluation, benchmarking segmentation algorithms, and exploring multimodal representations of Manchu script.

  16. Data from: MDIW-13: New Database and Benchmark for Script Identification

    • zenodo.org
    pdf
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miguel A. Ferrer; Miguel A. Ferrer; Abhijit Das; Abhijit Das; Moises Diaz; Moises Diaz; Aythami Morales; Aythami Morales; Cristina Carmona - Duarte; Cristina Carmona - Duarte; Umapada Pal; Umapada Pal (2024). MDIW-13: New Database and Benchmark for Script Identification [Dataset]. http://doi.org/10.5281/zenodo.6343658
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Miguel A. Ferrer; Miguel A. Ferrer; Abhijit Das; Abhijit Das; Moises Diaz; Moises Diaz; Aythami Morales; Aythami Morales; Cristina Carmona - Duarte; Cristina Carmona - Duarte; Umapada Pal; Umapada Pal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Script identification is a necessary step in some applications involving document analysis in a multi-script and multi-language environment. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspapers and handwritten letters and notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given.

    https://www.dropbox.com/s/vtmy0l4gjxun0oe/Multiscript_SIW_Database_Feb25_acceptedPaper.zip?dl=0

    Please, cite our work if you find useful the database:

    • M. A. Ferrer, A. Das, M. Diaz, A. Morales, C. Carmona-Duarte, U. Pal (2022), "MDIW-13: New Database and Benchmark for Script Identification", Multimedia Tools and Applications, Pages 1-14. Accepted
    • A. Das, M. A. Ferrer, A. Morales, M. Diaz, U. Pal, et al. "SIW 2021: ICDAR Competition on Script Identification in the Wild". 16th International Conference on Document Analysis and Recognition (ICDAR 2021). Lecture Notes in Computer Science, vol 12824. Springer. Sep. 5-10, 2021, Lausanne, Switzerland, pp. 738-753. doi: 10.1007/978-3-030-86337-1_49
  17. h

    OCRFlux-pubtabnet-single

    • huggingface.co
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chatdoc.com (2025). OCRFlux-pubtabnet-single [Dataset]. https://huggingface.co/datasets/ChatDOC/OCRFlux-pubtabnet-single
    Explore at:
    Dataset updated
    Jun 17, 2025
    Authors
    chatdoc.com
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OCRFlux-pubtabnet-single

    OCRFlux-pubtabnet-single is a benchmark of 9064 table images and their corresponding ground-truth HTML, which are derived from the public PubTabNet benchmark with some format transformations. This dataset can be used to measure the performance of OCR systems in single-page table parsing. Quick links:

    ๐Ÿค— Model ๐Ÿ› ๏ธ Code

      Data Mix
    
    
    
    
    
    
    
      Table 1: Tables breakdown by complexity (whether they contain rowspan or colspan cells)โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ChatDOC/OCRFlux-pubtabnet-single.
    
  18. Handwritten Math Expressions Dataset

    • kaggle.com
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GOVINDARAM SRIRAM (2024). Handwritten Math Expressions Dataset [Dataset]. https://www.kaggle.com/datasets/govindaramsriram/handwritten-math-expressions-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    GOVINDARAM SRIRAM
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description:
    This dataset contains images of handwritten mathematical expressions paired with their corresponding textual representations and answers. The expressions include various arithmetic operations such as addition (+), subtraction (-), multiplication (*), division (รท), and parentheses for grouping operations. The dataset is designed to support tasks such as Optical Character Recognition (OCR), handwritten text recognition, and sequence modeling for solving mathematical expressions.

    Key Features:

    • Images: Contains high-quality images of handwritten mathematical equations.
    • Annotations: A CSV file with two columns:
      • Expression: The mathematical expression in text form.
      • Answer: The evaluated result of the expression.
    • Complexity: Includes basic operations, grouped expressions with parentheses, and diverse handwriting styles to simulate real-world challenges.
    • Applications: Ideal for developing and benchmarking OCR systems, training deep learning models, and fine-tuning pretrained models for handwritten text recognition.

    This dataset serves as a valuable resource for researchers and practitioners working on handwriting recognition and mathematical problem-solving automation.

  19. h

    textocr-gpt4v

    • huggingface.co
    Updated Apr 3, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jimmy Carter (2015). textocr-gpt4v [Dataset]. https://huggingface.co/datasets/jimmycarter/textocr-gpt4v
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2015
    Authors
    Jimmy Carter
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for TextOCR-GPT4V

      Dataset Summary
    

    TextOCR-GPT4V is Meta's TextOCR dataset dataset captioned with emphasis on text OCR using GPT4V. To get the image, you will need to agree to their terms of service.

      Supported Tasks
    

    The TextOCR-GPT4V dataset is intended for generating benchmarks for comparison of an MLLM to GPT4v.

      Languages
    

    The caption languages are in English, while various texts in images are in many languages such as Spanish, Japaneseโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/jimmycarter/textocr-gpt4v.

  20. h

    impresso-ocr-benchmark-impresso-nzz

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emanuela Boros, impresso-ocr-benchmark-impresso-nzz [Dataset]. https://huggingface.co/datasets/emanuelaboros/impresso-ocr-benchmark-impresso-nzz
    Explore at:
    Authors
    Emanuela Boros
    Description

    emanuelaboros/impresso-ocr-benchmark-impresso-nzz dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OmniAI (2025). ocr-benchmark [Dataset]. https://huggingface.co/datasets/getomni-ai/ocr-benchmark
Organization logo

ocr-benchmark

getomni-ai/ocr-benchmark

Explore at:
127 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2025
Dataset provided by
OmniAI Technology, Inc.
Authors
OmniAI
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

OmniAI OCR Benchmark

A comprehensive benchmark that compares OCR and data extraction capabilities of different multimodal LLMs such as gpt-4o and gemini-2.0, evaluating both text and JSON extraction accuracy. Benchmark Results (Feb 2025) | Source Code

Search
Clear search
Close search
Google apps
Main menu