56 datasets found

ocr-benchmark
huggingface.co
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OmniAI (2025). ocr-benchmark [Dataset]. https://huggingface.co/datasets/getomni-ai/ocr-benchmark
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2025
Dataset provided by
OmniAI Technology, Inc.
Authors
OmniAI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
OmniAI OCR Benchmark

A comprehensive benchmark that compares OCR and data extraction capabilities of different multimodal LLMs such as gpt-4o and gemini-2.0, evaluating both text and JSON extraction accuracy. Benchmark Results (Feb 2025) | Source Code
h
OCR-benchmark
huggingface.co
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
black (2025). OCR-benchmark [Dataset]. https://huggingface.co/datasets/blackcrow228/OCR-benchmark
Explore at:
Dataset updated
Jul 27, 2025
Authors
black
Description
blackcrow228/OCR-benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community
Noisy OCR Dataset (NOD)
zenodo.org
bin
Updated Jul 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Hegghammer; Thomas Hegghammer (2021). Noisy OCR Dataset (NOD) [Dataset]. http://doi.org/10.5281/zenodo.5068735
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5068735
Dataset updated
Jul 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas Hegghammer; Thomas Hegghammer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).

Source images

The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.

Artificial noise application

The dataset was created as follows:
- First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
- Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
- Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.

This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.

The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.

References:

Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https:
//github.com/PedroBarcha/old-books-dataset.

Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk
Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science
and Information Technology (CSIT), 150–54. IEEE.

Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs
h
sensor-ocr-benchmark
huggingface.co
Updated Jun 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
seafog winters (2024). sensor-ocr-benchmark [Dataset]. https://huggingface.co/datasets/famousdetectiveadrianmonk/sensor-ocr-benchmark
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 22, 2024
Authors
seafog winters
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
My Custom Dataset

Description

The original dataset was modified to inserto fake sensor information in bottom of image.

Usage

from datasets import load_dataset dataset = load_dataset("famousdetectiveadrianmonk/sensor-ocr-benchmark") example = dataset['train'][0] img = example['pixel_values'] sensor_zoomin = img.crop((600, 850, 1250, 1050))

Attribution

This dataset is based on the original dataset provided by Segments.ai. The original dataset can… See the full description on the dataset page: https://huggingface.co/datasets/famousdetectiveadrianmonk/sensor-ocr-benchmark.
Synthetic Printed Words and Test Protocols Data for Bangla OCR
figshare.com
zip
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koushik Roy; MD Sazzad Hossain; Pritom Saha; Shadman Rohan; Fuad Rahman; Imranul Ashrafi; Ifty Mohammad Rezwan; B M Mainul Hossain; Ahmedul Kabir; Nabeel Mohammed (2023). Synthetic Printed Words and Test Protocols Data for Bangla OCR [Dataset]. http://doi.org/10.6084/m9.figshare.20186825.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20186825.v1
Dataset updated
Jun 13, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Koushik Roy; MD Sazzad Hossain; Pritom Saha; Shadman Rohan; Fuad Rahman; Imranul Ashrafi; Ifty Mohammad Rezwan; B M Mainul Hossain; Ahmedul Kabir; Nabeel Mohammed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic Printed word image data and test protocols word image Data repository for the paper "A Multifaceted Evaluation of Representation of Graphemes for Practically Effective Bangla OCR." In this paper, we have utilized the popular Convolutional Recurrent Neural Network (CRNN) architecture and implemented our grapheme representation strategies to design the final labels of the model. Due to the absence of a large-scale Bangla word-level printed dataset, we created a synthetically generated Bangla corpus containing 2 million samples that are representative and sufficiently varied in terms of fonts, domain, and vocabulary size to train our Bangla OCR model. To test the various aspects of our model, we have also created 6 test protocols. Finally, to establish the generalizability of our grapheme representation methods, we have performed training and testing on external handwriting datasets. Updates: 10 June 2023: The paper has been accepted for publication in International Journal on Document Analysis and Recognition (IJDAR).
E
Benchmark for the evaluation of named entity recognition over ancient...
live.european-language-grid.eu
zenodo.org
+1more
png
Updated Aug 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Benchmark for the evaluation of named entity recognition over ancient documents [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7802
Explore at:
pngAvailable download formats
Dataset updated
Aug 20, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset consists of a multilingual noisy corpora for named entity recognition (NER).The noisy versions are simulated from the CoNLL-02 (Spanish and Dutch) and CoNLL-03 (English) NER corpora.The original collections are re-OCRed and four types of noises at two different levels are added in order to simulate various OCR output.More precisely, we first extracted raw texts and converted them into images. These images have been contaminated by adding some common noises when using a scanner. We further extract OCRed data using tesseract open sourceOCR engine v-3.04.01. Consequently to the image noise insertions, OCRed data contains degradations. Original and noisy texts are finally aligned.This archive contains three folders (one per language). The folders contain the degraded images, the noisy texts extracted by the OCR and their aligned version with clean data.
S
TibOCR-Bench: A Comprehensive Benchmark and Training Pipeline for Tibetan...
scidb.cn
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kuntharrgyal Khysru; LAMA Jie (2025). TibOCR-Bench: A Comprehensive Benchmark and Training Pipeline for Tibetan Multimodal OCR [Dataset]. http://doi.org/10.57760/sciencedb.28968
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.28968
Dataset updated
Aug 11, 2025
Dataset provided by
Science Data Bank
Authors
Kuntharrgyal Khysru; LAMA Jie
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
To effectively support the training and evaluation of Tibetan OCR models in practical application scenarios involving multiple fonts and complex text structures, we have constructed a multi-source, high-quality Tibetan text image dataset. The overall data construction includes two complementary strategies: forward construction and reverse construction. (1) Positive construction: Firstly, collect Tibetan language images in real scenes, and then manually annotate the corresponding text content. This method ensures the authenticity and practical relevance of the data, effectively covering the diverse language usage scenarios and inherent complexity in Tibetan OCR tasks. (2) Reverse construction: Firstly, select text content suitable for OCR tasks (such as advertising slogans, slogans, or standard documents), then choose appropriate background images and use multiple fonts and visual effects to synthesize the text image dataset. This method efficiently enhances the structural diversity and scale of the dataset. These two strategies complement each other and together form a comprehensive resource library for training and evaluating Tibetan OCR models.
Benchmark for the evaluation of Named Entity Linking over ancient documents
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet; Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet (2020). Benchmark for the evaluation of Named Entity Linking over ancient documents [Dataset]. http://doi.org/10.5281/zenodo.3490333
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3490333
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet; Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Benchmark for the evaluation of Named Entity Linking over ancient documents
Elvys Linhares Pontes, Ahmed Hamdi, Nicolas Sidere, and Antoine Doucet
University of Avignon: elvys.linhares-pontes@univ-avignon.fr; University of La Rochelle: {elvys.linhares_pontes,ahmed.hamdi,nicolas.sidere,antoine.doucet}@univ-lr.fr

These are the supplementary materials for the ICADL 2019 paper Impact of OCR Quality on Named Entity Linking. If you end up using whole or parts of this resource, please use the following citation:

Linhares Pontes, E., Hamdi, A., Sidere, N., and Doucet, A. (2019). Impact of OCR Quality on Named Entity Linking. In Proceedings of 21st International Conference on Asia-Pacific Digital Libraries ICADL 2019, Kuala Lumpur, Malaysia.

or alternatively use the following `bib`:

@inproceedings{linhares2019icadl, title="Impact of OCR Quality on Named Entity Linking.", author={Linhares Pontes, Elvys, and Hamdi, Ahmed, and Sidere, Nicolas, and Doucet, Antoine}, year={2019}, booktitle={Proceedings of 21st International Conference on Asia-Pacific Digital Libraries ICADL 2019} }

Files
This archive contains six folders -- one per dataset -- as well as this README. The folders contain the degraded images, the noisy texts extracted by the OCR and their aligned version with clean data. This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).

Acknowledgments
This work has been supported by the European Union's Horizon 2020 research and innovation programme under grant 770299 [NewsEye](https://www.newseye.eu/).
Z
Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts...
data.niaid.nih.gov
Updated Mar 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SyedKhaleel Jageer (2025). Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts with Corresponding Ground Truth Annotations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15009379
Explore at:
Dataset updated
Mar 22, 2025
Dataset authored and provided by
SyedKhaleel Jageer
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset provides a benchmark for Tamil Optical Character Recognition (OCR), covering both handwritten (Hangual) and printed Tamil text. It includes high-quality ground truth (GT) text files paired with corresponding TIFF images, making it valuable for training and evaluating OCR models, particularly for Tesseract, deep learning-based recognition, and AI research.

Dataset Highlights

Total Size: 15GB (Sample from the full 60GB dataset)

Total Pairs:Approximately 1,903,284 text-image pairs

Handwritten Fonts (9 Unicode Fonts):

Aazhi, Gnani, Hemalatha, Indumathi, Kalayarasi, Siva_01, Siva_02, Sudeeptha, Yogeshwaran

Printed Fonts (9 Unicode Fonts):

AnekTamil, Arima, KarlaTamilInclined, TAU-Barathi, TAU-Kambar, TAU-Marutham, TAU-Mullai, TAU-Neythal, TAU-Valluvar

Data Source:

The text corpus (GT text files) is curated from Wikipedia and Wikisource, ensuring linguistic diversity.

The fonts are publicly available Unicode Tamil fonts, sourced from Google Fonts and Tamil Virtual University.

File Structure

Tamil_OCR_Dataset/├── Hangual_Fonts/│ ├── Aazhi/│ │ ├── gt/│ │ │ ├── 00001.gt.txt│ │ │ ├── 00002.gt.txt│ │ ├── images/│ │ │ ├── 00001.tiff│ │ │ ├── 00002.tiff│ ├── Gnani/│ ├── ...├── Printed_Fonts/│ ├── AnekTamil/│ ├── HindMadurai/│ ├── ...

Cite this work

@dataset{tamilocr_dataset_2025, author = {Syedkhaleel Jageer}, title = {Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts with Corresponding Ground Truth Annotations}, year = {2025}, publisher = {Zenodo}, doi = {10.5281/zenodo.15009380}, url = {https://doi.org/10.5281/zenodo.15009380}}
IIIT5K-Words
kaggle.com
Updated May 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prathamesh Zade (2023). IIIT5K-Words [Dataset]. http://doi.org/10.34740/kaggle/dsv/5671242
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/5671242
Dataset updated
May 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Prathamesh Zade
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
IIIT5K-Words

The IIIT5K Words Dataset is a comprehensive collection of labeled word images, curated by the International Institute of Information Technology, Hyderabad (IIIT-H). It is designed to facilitate research and development in optical character recognition (OCR), word recognition, and related fields.

The dataset contains a diverse set of 5,000 word images, covering various fonts, styles, and sizes. Each word image represents a single English word and is accompanied by its corresponding ground truth label, providing accurate transcription for training and evaluation purposes.

Please refer: IIIT5K-Words official site

Note: In order to view mat files use this code

install requirements

!pip install shutil pymatreader

unzip the zip file

import shutil

shutil.unpack_archive('IIIT5K-Word_V3.0.tar.gz', 'data')

view mat files

from pymatreader import read_mat

testdata_mat = read_mat('testdata.mat')

testCharBound_mat = read_mat('testCharBound.mat')

testdata_mat

Key Features: - Size: The dataset comprises 5,000 word images, making it suitable for training and evaluating OCR algorithms. - Diversity: The dataset encompasses a wide range of fonts, styles, and sizes to ensure the inclusion of various challenges encountered in real-world scenarios. - Ground Truth Labels: Each word image is paired with its ground truth label, enabling supervised learning approaches and facilitating evaluation metrics calculation. - Quality Annotation: The dataset has been carefully curated by experts at IIIT-H, ensuring high-quality annotations and accurate transcription of the word images. - Research Applications: The dataset serves as a valuable resource for OCR, word recognition, text detection, and related research areas.

Potential Use Cases: - Optical Character Recognition (OCR) Systems: The dataset can be employed to train and benchmark OCR models, improving their accuracy and robustness. - Word Recognition Algorithms: Researchers can utilize the dataset to develop and evaluate word recognition algorithms, including deep learning-based approaches. - Text Detection: The dataset can aid in the development and evaluation of algorithms for text detection in natural scenes. - Font and Style Analysis: Researchers can leverage the dataset to study font and style variations, character segmentation, and other related analyses.

Citation:

@InProceedings{MishraBMVC12, author = "Mishra, A. and Alahari, K. and Jawahar, C.~V.", title = "Scene Text Recognition using Higher Order Language Priors", booktitle = "BMVC", year = "2012", }
H
A BENCHMARK DATASET FOR MANIPURI MEETEI-MAYEK HANDWRITTEN CHARACTER...
dataverse.harvard.edu
search.dataone.org
+5more
Updated Sep 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pangambam Singh (2019). A BENCHMARK DATASET FOR MANIPURI MEETEI-MAYEK HANDWRITTEN CHARACTER RECOGNITION [Dataset]. http://doi.org/10.7910/DVN/OMU2DV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/OMU2DV
Dataset updated
Sep 28, 2019
Dataset provided by
Harvard Dataverse
Authors
Pangambam Singh
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Manipur
Description
A benchmark dataset is always required for any classification or recognition system. To the best of our knowledge, no benchmark dataset exists for handwritten character recognition of Manipuri Meetei-Mayek script in public domain so far. Manipuri, also referred to as Meeteilon or sometimes Meiteilon, is a Sino-Tibetan language and also one of the Eight Scheduled languages of Indian Constitution. It is the official language and lingua franca of the southeastern Himalayan state of Manipur, in northeastern India. This language is also used by a significant number of people as their communicating language over the north-east India, and some parts of Bangladesh and Myanmar. It is the most widely spoken language in Northeast India after Bengali and Assamese languages. In this work, we introduce a handwritten Manipuri Meetei-Mayek character dataset which consists of more than 5000 data samples which were collected from a diverse population group that belongs to different age groups (from 4 years to 60 years), genders, educational backgrounds, occupations, communities from three different districts of Manipur, India (Imphal East District, Thoubal District and Kangpokpi District) during March and April 2019. Each individual was asked to write down all the Manipuri characters on one A4-size paper. The recorded responses are scanned with the help of a scanner and then each character is manually segmented from the scanned images. This dataset consists of segmented scanned images of handwritten Manipuri Meetei-Mayek characters (Mapi Mayek, Lonsum Mayek, Cheitap Mayek, Cheising Mayek, Khutam Mayek) of size 128X128 pixels in .JPG format as well as in .MAT format.
d
A benchmark dataset for Manipuri Meetei-Mayek handwritten character...
search.dataone.org
Updated Jun 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pangambam Singh (2025). A benchmark dataset for Manipuri Meetei-Mayek handwritten character recognition [Dataset]. http://doi.org/10.5061/dryad.r4xgxd27w
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.r4xgxd27w
Dataset updated
Jun 15, 2025
Dataset provided by
Dryad Digital Repository
Authors
Pangambam Singh
Time period covered
Jan 1, 2019
Description
A benchmark dataset is always required for any classification or recognition system. To the best of our knowledge, no benchmark dataset exists for handwritten character recognition of Manipuri Meetei-Mayek script in public domain so far.Â Manipuri, also referred to as Meeteilon or sometimes Meiteilon, is a Sino-Tibetan language and also one of the Eight Scheduled languages of Indian Constitution. It is the official language and lingua franca of the southeastern Himalayan state of Manipur, in northeastern India. This language is also used by a significant number of people as their communicating language over the north-east India, and some parts of Bangladesh and Myanmar. It is the most widely spoken language in Northeast India after Bengali and Assamese languages.Â In this work, we introduce a handwritten Manipuri Meetei-Mayek character dataset which consists of more than 5000 data samples which were collected from a diverse population group that belongs to different age groups (from 4 yea...
h
OCR-Reasoning
huggingface.co
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huang (2025). OCR-Reasoning [Dataset]. https://huggingface.co/datasets/mx262/OCR-Reasoning
Explore at:
Dataset updated
May 21, 2025
Authors
Huang
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal… See the full description on the dataset page: https://huggingface.co/datasets/mx262/OCR-Reasoning.
olmOCR-bench
huggingface.co
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). olmOCR-bench [Dataset]. https://huggingface.co/datasets/allenai/olmOCR-bench
Explore at:
Dataset updated
Jul 23, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
olmOCR-bench

olmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have. This benchmark evaluates the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information. Quick links:

📃 Paper 🛠️ Code 🎮 Demo

Table 1. Distribution of Test Classes by Document Source

Document Source Text Present Text… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmOCR-bench.
S
A dataset of Manchu ancient book word images for OCR tasks, China,...
scidb.cn
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sun Haipeng; Tao Wenhao; Bi Xiaojun (2025). A dataset of Manchu ancient book word images for OCR tasks, China, 1733–1867. [Dataset]. http://doi.org/10.57760/sciencedb.25676
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.25676
Dataset updated
May 29, 2025
Dataset provided by
Science Data Bank
Authors
Sun Haipeng; Tao Wenhao; Bi Xiaojun
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
China
Description
This dataset consists of 24,280 high-resolution word images extracted from Manchu ancient books dating from 1733 to 1867, collected within the present-day territory of China. The images were sourced from the Series of Rare Ancient Books in Manchu and Chinese curated by the National Library of China. Each of the 2,428 unique Manchu words in the dataset is represented by exactly 10 distinct image samples, resulting in a balanced and well-structured dataset suitable for training and evaluating deep learning models in the task of Manchu OCR (optical character recognition).This dataset was constructed using a semi-automated workflow to address the challenges posed by manual segmentation of historical scripts—such as high annotation costs and time-consuming processing—and to preserve the visual details of each page. The image acquisition process involved high-precision scanning at 600 dpi. Word regions were first identified using computer vision algorithms, followed by manual verification and correction to ensure the accuracy and completeness of the extracted samples.All images are stored in standard .jpg format with consistent resolution and naming conventions. The dataset is divided into structured folders by word category, and accompanying metadata files provide annotations, including word labels, file paths, and page source references. The released version has no missing data entries, and the dataset has been quality-checked to exclude samples with severe degradation, such as illegible characters, torn pages, or significant shadowing.To our knowledge, this is the largest publicly available Manchu word image dataset to date. It offers a valuable resource for researchers in historical document analysis, Manchu linguistics, and machine learning-based OCR. The dataset can be used for model training and evaluation, benchmarking segmentation algorithms, and exploring multimodal representations of Manchu script.
Data from: MDIW-13: New Database and Benchmark for Script Identification
zenodo.org
pdf
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miguel A. Ferrer; Miguel A. Ferrer; Abhijit Das; Abhijit Das; Moises Diaz; Moises Diaz; Aythami Morales; Aythami Morales; Cristina Carmona - Duarte; Cristina Carmona - Duarte; Umapada Pal; Umapada Pal (2024). MDIW-13: New Database and Benchmark for Script Identification [Dataset]. http://doi.org/10.5281/zenodo.6343658
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6343658
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Miguel A. Ferrer; Miguel A. Ferrer; Abhijit Das; Abhijit Das; Moises Diaz; Moises Diaz; Aythami Morales; Aythami Morales; Cristina Carmona - Duarte; Cristina Carmona - Duarte; Umapada Pal; Umapada Pal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Script identification is a necessary step in some applications involving document analysis in a multi-script and multi-language environment. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspapers and handwritten letters and notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given.

https://www.dropbox.com/s/vtmy0l4gjxun0oe/Multiscript_SIW_Database_Feb25_acceptedPaper.zip?dl=0

Please, cite our work if you find useful the database:

M. A. Ferrer, A. Das, M. Diaz, A. Morales, C. Carmona-Duarte, U. Pal (2022), "MDIW-13: New Database and Benchmark for Script Identification", Multimedia Tools and Applications, Pages 1-14. Accepted

A. Das, M. A. Ferrer, A. Morales, M. Diaz, U. Pal, et al. "SIW 2021: ICDAR Competition on Script Identification in the Wild". 16th International Conference on Document Analysis and Recognition (ICDAR 2021). Lecture Notes in Computer Science, vol 12824. Springer. Sep. 5-10, 2021, Lausanne, Switzerland, pp. 738-753. doi: 10.1007/978-3-030-86337-1_49
h
OCRFlux-pubtabnet-single
huggingface.co
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chatdoc.com (2025). OCRFlux-pubtabnet-single [Dataset]. https://huggingface.co/datasets/ChatDOC/OCRFlux-pubtabnet-single
Explore at:
Dataset updated
Jun 17, 2025
Authors
chatdoc.com
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OCRFlux-pubtabnet-single

OCRFlux-pubtabnet-single is a benchmark of 9064 table images and their corresponding ground-truth HTML, which are derived from the public PubTabNet benchmark with some format transformations. This dataset can be used to measure the performance of OCR systems in single-page table parsing. Quick links:

🤗 Model 🛠️ Code

Data Mix Table 1: Tables breakdown by complexity (whether they contain rowspan or colspan cells)… See the full description on the dataset page: https://huggingface.co/datasets/ChatDOC/OCRFlux-pubtabnet-single.
Handwritten Math Expressions Dataset
kaggle.com
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GOVINDARAM SRIRAM (2024). Handwritten Math Expressions Dataset [Dataset]. https://www.kaggle.com/datasets/govindaramsriram/handwritten-math-expressions-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 31, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
GOVINDARAM SRIRAM
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description:
This dataset contains images of handwritten mathematical expressions paired with their corresponding textual representations and answers. The expressions include various arithmetic operations such as addition (+), subtraction (-), multiplication (*), division (÷), and parentheses for grouping operations. The dataset is designed to support tasks such as Optical Character Recognition (OCR), handwritten text recognition, and sequence modeling for solving mathematical expressions.

Key Features:

Images: Contains high-quality images of handwritten mathematical equations.

Annotations: A CSV file with two columns:

Expression: The mathematical expression in text form.

Answer: The evaluated result of the expression.

Complexity: Includes basic operations, grouped expressions with parentheses, and diverse handwriting styles to simulate real-world challenges.

Applications: Ideal for developing and benchmarking OCR systems, training deep learning models, and fine-tuning pretrained models for handwritten text recognition.

This dataset serves as a valuable resource for researchers and practitioners working on handwriting recognition and mathematical problem-solving automation.
h
textocr-gpt4v
huggingface.co
Updated Apr 3, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jimmy Carter (2015). textocr-gpt4v [Dataset]. https://huggingface.co/datasets/jimmycarter/textocr-gpt4v
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2015
Authors
Jimmy Carter
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for TextOCR-GPT4V

Dataset Summary

TextOCR-GPT4V is Meta's TextOCR dataset dataset captioned with emphasis on text OCR using GPT4V. To get the image, you will need to agree to their terms of service.

Supported Tasks

The TextOCR-GPT4V dataset is intended for generating benchmarks for comparison of an MLLM to GPT4v.

Languages

The caption languages are in English, while various texts in images are in many languages such as Spanish, Japanese… See the full description on the dataset page: https://huggingface.co/datasets/jimmycarter/textocr-gpt4v.
h
impresso-ocr-benchmark-impresso-nzz
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emanuela Boros, impresso-ocr-benchmark-impresso-nzz [Dataset]. https://huggingface.co/datasets/emanuelaboros/impresso-ocr-benchmark-impresso-nzz
Explore at:
Authors
Emanuela Boros
Description
emanuelaboros/impresso-ocr-benchmark-impresso-nzz dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

OmniAI (2025). ocr-benchmark [Dataset]. https://huggingface.co/datasets/getomni-ai/ocr-benchmark

ocr-benchmark

getomni-ai/ocr-benchmark

Explore at:

127 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 19, 2025

Dataset provided by

OmniAI Technology, Inc.

Authors

OmniAI

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

OmniAI OCR Benchmark

A comprehensive benchmark that compares OCR and data extraction capabilities of different multimodal LLMs such as gpt-4o and gemini-2.0, evaluating both text and JSON extraction accuracy. Benchmark Results (Feb 2025) | Source Code

Clear search

Close search

Google apps

Main menu

ocr-benchmark

OCR-benchmark

Noisy OCR Dataset (NOD)

sensor-ocr-benchmark

Synthetic Printed Words and Test Protocols Data for Bangla OCR

Benchmark for the evaluation of named entity recognition over ancient...

TibOCR-Bench: A Comprehensive Benchmark and Training Pipeline for Tibetan...

Benchmark for the evaluation of Named Entity Linking over ancient documents

Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts...

IIIT5K-Words

IIIT5K-Words

A BENCHMARK DATASET FOR MANIPURI MEETEI-MAYEK HANDWRITTEN CHARACTER...

A benchmark dataset for Manipuri Meetei-Mayek handwritten character...

OCR-Reasoning

olmOCR-bench

A dataset of Manchu ancient book word images for OCR tasks, China,...

Data from: MDIW-13: New Database and Benchmark for Script Identification

OCRFlux-pubtabnet-single

Handwritten Math Expressions Dataset

Key Features:

textocr-gpt4v

impresso-ocr-benchmark-impresso-nzz

ocr-benchmark

getomni-ai/ocr-benchmark