55 datasets found

OCR Document Text Recognition Dataset
kaggle.com
Updated Sep 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

images - contains of original images of documents

boxes - includes bounding box labeling for the original images

annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

"Text Title" - corresponds to titles, the box is red

"Text Paragraph" - corresponds to paragraphs of text, the box is blue

"Table" - corresponds to the table, the box is green

"Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text
d
ID's photo Dataset | 67 countries | 11 types of documents | Document...
datarade.ai
.jpg, .jpeg, .png
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2025). ID's photo Dataset | 67 countries | 11 types of documents | Document Recognition | OCR Training | Computer Vision [Dataset]. https://datarade.ai/data-products/id-s-photo-dataset-67-countries-11-types-of-documents-d-filemarket
Explore at:
.jpg, .jpeg, .pngAvailable download formats
Dataset updated
Jul 25, 2025
Dataset authored and provided by
FileMarket
Area covered
Bulgaria, Indonesia, Mexico, France, Sri Lanka, Cuba, Peru, Venezuela (Bolivarian Republic of), Egypt, Benin
Description
Total individuals: 1661 Total images: 3623 Images per users: 2.18

Top Countries: - Nigeria 44,6% - United States of America 7,2% - Bangladesh 7,1% - Ethiopia 6,7% - Indonesia 4,8% - India 4,8% - Kenya 2,4% - Iran 2,3% - Nepal 1,7% - Pakistan 1,4% (Total 67 countries)

Type of documents: - Identification Card (ID Card) 63,2% - Driver's License 6,4% - Student ID 4,9% - International passport 2,8% - Domestic passport 0,8% - Residence Permit 0,7% - Military ID 0,4% - Health Insurance Card 0,2%

Data is organized in per‑user folders and includes rich metadata.

Within a folder you may find: (a) multiple document categories for the same person, and/or (b) repeated captures of the same document against different backgrounds or lighting setups. The maximum volume per individual is 28 images.

Metadata includes country of document, type of document, created date, last name, first name, day of birth, month of birth and year of birth.

Every image was provided with explicit user consent. This ensures downstream use cases—such as training and evaluating document detection, classification, text extraction, and identity authentication models—are supported by legally sourced data.
i
OCR Telugu Image Dataset
ieee-dataport.org
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kadavakollu Rao (2023). OCR Telugu Image Dataset [Dataset]. https://ieee-dataport.org/documents/ocr-telugu-image-dataset
Explore at:
Dataset updated
Dec 8, 2023
Authors
Kadavakollu Rao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The choice of the dataset is the key for OCR systems. Unfortunately
Optical Character Recognition
sdiinnovation-geoplatform.hub.arcgis.com
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2023). Optical Character Recognition [Dataset]. https://sdiinnovation-geoplatform.hub.arcgis.com/content/8b56ed53e34b4304a5b8b826a7512ab0
Explore at:
Dataset updated
May 18, 2023
Dataset authored and provided by
Esrihttp://esri.com/
Description
Text labels are an integral part of cadastral maps and floor plans. Text is also prevalent in natural scenes around us in the form of road signs, billboards, house numbers and place names. Extracting this text can provide additional context and details about the places the text describes and the information it conveys. Digitization of documents and extracting texts from them helps in retrieving and archiving of important information.This deep learning model is based on the MMOCR model and uses optical character recognition (OCR) technology to detect text in images. This model was trained on a large dataset of different types and styles of text with diverse background and contexts, allowing for precise text extraction. It can be applied to various tasks such as automatically detecting and reading text from documents, sign boards, scanned maps, etc., thereby converting images containing text to actionable data.Using the modelFollow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Fine-tuning the modelThis model cannot be fine-tuned using ArcGIS tools.InputHigh-resolution, 3-band street-level imagery/oriented imagery, scanned maps, or documents, with medium to large size text.OutputA feature layer with the recognized text and bounding box around it.Model architectureThis model is based on the open-source MMOCR model by MMLab.Sample resultsHere are a few results from the model.
S
A dataset of Manchu ancient book word images for OCR tasks, China,...
scidb.cn
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sun Haipeng; Tao Wenhao; Bi Xiaojun (2025). A dataset of Manchu ancient book word images for OCR tasks, China, 1733–1867. [Dataset]. http://doi.org/10.57760/sciencedb.25676
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.25676
Dataset updated
May 29, 2025
Dataset provided by
Science Data Bank
Authors
Sun Haipeng; Tao Wenhao; Bi Xiaojun
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
China
Description
This dataset consists of 24,280 high-resolution word images extracted from Manchu ancient books dating from 1733 to 1867, collected within the present-day territory of China. The images were sourced from the Series of Rare Ancient Books in Manchu and Chinese curated by the National Library of China. Each of the 2,428 unique Manchu words in the dataset is represented by exactly 10 distinct image samples, resulting in a balanced and well-structured dataset suitable for training and evaluating deep learning models in the task of Manchu OCR (optical character recognition).This dataset was constructed using a semi-automated workflow to address the challenges posed by manual segmentation of historical scripts—such as high annotation costs and time-consuming processing—and to preserve the visual details of each page. The image acquisition process involved high-precision scanning at 600 dpi. Word regions were first identified using computer vision algorithms, followed by manual verification and correction to ensure the accuracy and completeness of the extracted samples.All images are stored in standard .jpg format with consistent resolution and naming conventions. The dataset is divided into structured folders by word category, and accompanying metadata files provide annotations, including word labels, file paths, and page source references. The released version has no missing data entries, and the dataset has been quality-checked to exclude samples with severe degradation, such as illegible characters, torn pages, or significant shadowing.To our knowledge, this is the largest publicly available Manchu word image dataset to date. It offers a valuable resource for researchers in historical document analysis, Manchu linguistics, and machine learning-based OCR. The dataset can be used for model training and evaluation, benchmarking segmentation algorithms, and exploring multimodal representations of Manchu script.
h
idl-wds
huggingface.co
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixel Parsing (2024). idl-wds [Dataset]. https://huggingface.co/datasets/pixparse/idl-wds
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2024
Dataset authored and provided by
Pixel Parsing
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for Industry Documents Library (IDL)

Dataset Summary

Industry Documents Library (IDL) is a document dataset filtered from UCSF documents library with 19 million pages kept as valid samples. Each document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the idl_data project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/idl-wds.
h
my-ocr-output
huggingface.co
Updated Aug 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
phuong khach (2025). my-ocr-output [Dataset]. https://huggingface.co/datasets/phuongkhanh123/my-ocr-output
Explore at:
Dataset updated
Aug 14, 2025
Authors
phuong khach
Description
Document OCR using Nanonets-OCR-s

This dataset contains markdown-formatted OCR results from images in /content/input using Nanonets-OCR-s.

Processing Details

Source Dataset: /content/input Model: nanonets/Nanonets-OCR-s Number of Samples: 32 Processing Time: 7.9 minutes Processing Date: 2025-08-14 04:32 UTC

Configuration

Image Column: image Output Column: markdown Dataset Split: train Batch Size: 32 Max Model Length: 8,192 tokens Max Output Tokens: 4,096… See the full description on the dataset page: https://huggingface.co/datasets/phuongkhanh123/my-ocr-output.
q
Arabic OCR Corpus v.2 (2,894 items from QNL Collection)
manara.qnl.qa
csv
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qatar National Library (2024). Arabic OCR Corpus v.2 (2,894 items from QNL Collection) [Dataset]. http://doi.org/10.57945/manara.26984785.v2
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.57945/manara.26984785.v2
Dataset updated
Nov 12, 2024
Dataset provided by
Manara - Qatar Research Repository
Authors
Qatar National Library
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dataset contentsThis dataset is an OCR text corpus of 2,984 printed works (monographs and serials) from the collection of the Qatar National Library. All works are mostly in Arabic language, but fragments of texts in other languages can also be found. Besides the OCR text, the basic descriptive metadata for each item is also provided.Release note for version 2 of the datasetThe dataset of OCRed Arabic books has been fully updated to ensure consistency and quality. All items in the dataset have now been processed using the latest retrained data. Furthermore, every item has undergone a thorough visual quality assurance check conducted using a representative sample of pages. This update has resulted in a significant enhancement of word-level accuracy across the entire dataset, ensuring higher reliability and usability.The exact list of files changed between version 1 and version 2 of the dataset can be determined by comparing the SHA256 checksums provided with each dataset version (see below for details).Dataset structureThe dataset consists of three files:QNL-ArabicContentDataset-Metadata.csv and QNL-ArabicContentDataset-Metadata.xlsx contain the same basic metadata of 2,894 items from the Qatar National Library collection. Both files have the same content and are structured into the following columns:CALL #(ITEM) - Item call number in the QNL catalogRECORD #(ITEM) - Item record number in the QNL catalog (unique for each item)Repository URL - URL to digitized item content in the QNL repositoryCatalog URL - URL to the complete item metadata record in the QNL catalogAUTHOR - Main author information for the itemADD AUTHOR - Additional author information for the itemPUB INFO - Item publication infoTITLE - Item titleDESCRIPTION - Item descriptionVOLUME - Item volume information (in case of some serial publications)QNL_ArabicOCR_Corpus-v2.zip contains:2,894 text files with the following naming pattern: [unique item record number]-[unique item QNL repository id].txt. The unique item record number should be used to match each file with a related metadata record. Each file contains text extracted from a particular item using OCR software.checksums.sha256 - contains SHA256 checksums for all 2,894 text files
Dataset of invoices and receipts including annotation of relevant fields
zenodo.org
zip
Updated Apr 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli (2022). Dataset of invoices and receipts including annotation of relevant fields [Dataset]. http://doi.org/10.5281/zenodo.6371710
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6371710
Dataset updated
Apr 3, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.
E
Dataset of ICDAR 2019 Competition on Post-OCR Text Correction
live.european-language-grid.eu
zenodo.org
+1more
txt
Updated Sep 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Dataset of ICDAR 2019 Competition on Post-OCR Text Correction [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7738
Explore at:
txtAvailable download formats
Dataset updated
Sep 12, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Corpus for the ICDAR2019 Competition on Post-OCR Text Correction (October 2019)Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreuxhttp://l3i.univ-larochelle.fr/ICDAR2019PostOCR-------------------------------------------------------------------------------These are the supplementary materials for the ICDAR 2019 paper ICDAR 2019 Competition on Post-OCR Text CorrectionPlease use the following citation:@inproceedings{rigaud2019pocr,title=""ICDAR 2019 Competition on Post-OCR Text Correction"",author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},year={2019},booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}}
Description: The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GS comes both from BnF's internal projects and external initiatives such as Europeana Newspapers, IMPACT, Project Gutenberg, Perseus and Wikisource. Repartition of the dataset- ICDAR2019_Post_OCR_correction_training_18M.zip: 80% of the full dataset, provided to train participants' methods.- ICDAR2019_Post_OCR_correction_evaluation_4M: 20% of the full dataset used for the evaluation (with Gold Standard made publicly after the competition).- ICDAR2019_Post_OCR_correction_full_22M: full dataset made publicly available after the competition. Special case for Finnish language Material from the National Library of Finland (Finnish dataset FI > FI1) are not allowed to be re-shared on other website. Please follow these guidelines to get and format the data from the original website.1. Go to https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en;2. Download OCR Ground Truth Pages (Finnish Fraktur) [v1](4.8GB) from Digitalia (2015-17) package;3. Convert the Excel file ""~/metadata/nlf_ocr_gt_tescomb5_2017.xlsx"" as Comma Separated Format (.csv) by using save as function in a spreadsheet software (e.g. Excel, Calc) and copy it into ""FI/FI1/HOWTO_get_data/input/"";4. Go to ""FI/FI1/HOWTO_get_data/"" and run ""script_1.py"" to generate the full ""FI1"" dataset in ""output/full/"";4. Run ""script_2.py"" to split the ""output/full/"" dataset into ""output/training/"" and ""output/evaluation/"" sub sets.At the end of the process, you should have a ""training"", ""evaluation"" and ""full"" folder with 1579528, 380817 and 1960345 characters respectively.
Licenses: free to use for non-commercial uses, according to sources in details- BG1: IMPACT - National Library of Bulgaria: CC BY NC ND- CZ1: IMPACT - National Library of the Czech Republic: CC BY NC SA- DE1: Front pages of Swiss newspaper NZZ: Creative Commons Attribution 4.0 International (https://zenodo.org/record/3333627)- DE2: IMPACT - German National Library: CC BY NC ND- DE3: GT4Hist-dta19 dataset: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE4: GT4Hist - EarlyModernLatin: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE5: GT4Hist - Kallimachos: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE6: GT4Hist - RefCorpus-ENHG-Incunabula: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE7: GT4Hist - RIDGES-Fraktur: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- EN1: IMPACT - British Library: CC BY NC SA 3.0- ES1: IMPACT - National Library of Spain: CC BY NC SA- FI1: National Library of Finland: no re-sharing allowed, follow the above section to get the data. (https://digi.kansalliskirjasto.fi/opendata)- FR1: HIMANIS Project: CC0 (https://www.himanis.org)- FR2: IMPACT - National Library of France: CC BY NC SA 3.0- FR3: RECEIPT dataset: CC0 (http://findit.univ-lr.fr)- NL1: IMPACT - National library of the Netherlands: CC BY- PL1: IMPACT - National Library of Poland: CC BY- SL1: IMPACT - Slovak National Library: CC BY NCText post-processing such as cleaning and alignment have been applied on the resources mentioned above, so that the Gold Standard and the OCRs provided are not necessarily identical to the originals.
Structure- **Content** [./lang_type/sub_folder/#.txt] - ""[OCR_toInput] "" => Raw OCRed text to be de-noised. - ""[OCR_aligned] "" => Aligned OCRed text. - ""[ GS_aligned] "" => Aligned Gold Standard text.The aligned OCRed/GS texts are provided for training and test purposes. The alignment was made at the character level using ""@"" symbols. ""#"" symbols correspond to the absence of GS either related to alignment uncertainties or related to unreadable characters in the source document. For a better view of the alignment, make sure to disable the ""word wrap"" option in your text editor.The Error Rate and the quality of the alignment vary according to the nature and the state of degradation of the source documents. Periodicals (mostly historical newspapers) for example, due to their complex layout and their original fonts have been reported to be especially challenging. In addition, it should be mentioned that the quality of Gold Standard also varies as the dataset aggregates resources from different projects that have their own annotation procedure, and obviously contains some errors.
ICDAR2019 competitionInformation related to the tasks, formats and the evaluation metrics are details on :https://sites.google.com/view/icdar2019-postcorrectionocr/evaluation
References - IMPACT, European Commission's 7th Framework Program, grant agreement 215064 - Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. - https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland- EU Horizon 2020 research and innovation programme grant agreement No 770299
Contact- christophe.rigaud(at)univ-lr.fr- antoine.doucet(at)univ-lr.fr- mickael.coustaty(at)univ-lr.fr- jean-philippe.moreux(at)bnf.frL3i - University of la Rochelle, http://l3i.univ-larochelle.frBnF - French National Library, http://www.bnf.fr
h
pdfa-eng-wds
huggingface.co
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixel Parsing (2024). pdfa-eng-wds [Dataset]. https://huggingface.co/datasets/pixparse/pdfa-eng-wds
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2024
Dataset authored and provided by
Pixel Parsing
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for PDF Association dataset (PDFA)

Dataset Summary

PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf documents analysis. The purpose of that subset differs in that regard, as focus has been done on making the dataset machine learning-ready for vision-language models.

An example page of one pdf document, with added bounding boxes… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/pdfa-eng-wds.
Synthetic dataset for multi-script text line recognition
zenodo.org
application/gzip
Updated Feb 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SVEN NAJEM-MEYER; SVEN NAJEM-MEYER (2025). Synthetic dataset for multi-script text line recognition [Dataset]. http://doi.org/10.5281/zenodo.14840349
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14840349
Dataset updated
Feb 9, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
SVEN NAJEM-MEYER; SVEN NAJEM-MEYER
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Optical Character Recognition (OCR) systems frequently encounter difficulties when processing rare or ancient scripts, especially when they occur in historical contexts involving multiple writing systems. These challenges often constrain researchers to fine-tune or to train new OCR models tailored to their specific needs. To support these efforts, we introduce a synthetic dataset comprising 6.2 million lines, specifically geared towards mixed polytonic Greek and Latin scripts. Being augmented with artificially degraded lines, the dataset bolsters strong results when used to train historical OCR models. This resource can be used both for training and testing purposes, and is particularly valuable for researchers working with ancient Greek and limited annotated data. The software used to generate this datasets is linked to below on our Git. This is a sample, but please contact us if you would like access to the whole dataset.
Synthetic Printed Words and Test Protocols Data for Bangla OCR
figshare.com
zip
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koushik Roy; MD Sazzad Hossain; Pritom Saha; Shadman Rohan; Fuad Rahman; Imranul Ashrafi; Ifty Mohammad Rezwan; B M Mainul Hossain; Ahmedul Kabir; Nabeel Mohammed (2023). Synthetic Printed Words and Test Protocols Data for Bangla OCR [Dataset]. http://doi.org/10.6084/m9.figshare.20186825.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20186825.v1
Dataset updated
Jun 13, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Koushik Roy; MD Sazzad Hossain; Pritom Saha; Shadman Rohan; Fuad Rahman; Imranul Ashrafi; Ifty Mohammad Rezwan; B M Mainul Hossain; Ahmedul Kabir; Nabeel Mohammed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic Printed word image data and test protocols word image Data repository for the paper "A Multifaceted Evaluation of Representation of Graphemes for Practically Effective Bangla OCR." In this paper, we have utilized the popular Convolutional Recurrent Neural Network (CRNN) architecture and implemented our grapheme representation strategies to design the final labels of the model. Due to the absence of a large-scale Bangla word-level printed dataset, we created a synthetically generated Bangla corpus containing 2 million samples that are representative and sufficiently varied in terms of fonts, domain, and vocabulary size to train our Bangla OCR model. To test the various aspects of our model, we have also created 6 test protocols. Finally, to establish the generalizability of our grapheme representation methods, we have performed training and testing on external handwriting datasets. Updates: 10 June 2023: The paper has been accepted for publication in International Journal on Document Analysis and Recognition (IJDAR).
h
india-medical-ocr-test
huggingface.co
Updated Aug 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel van Strien (2025). india-medical-ocr-test [Dataset]. https://huggingface.co/datasets/davanstrien/india-medical-ocr-test
Explore at:
Dataset updated
Aug 7, 2025
Authors
Daniel van Strien
Description
Document OCR using NuMarkdown-8B-Thinking

This dataset contains markdown-formatted OCR results from images in davanstrien/india-medical-test using NuMarkdown-8B-Thinking.

Processing Details

Source Dataset: davanstrien/india-medical-test Model: numind/NuMarkdown-8B-Thinking Number of Samples: 50 Processing Time: 13.3 minutes Processing Date: 2025-08-07 08:04 UTC

Configuration

Image Column: image Output Column: markdown Dataset Split: train Batch Size: 16… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/india-medical-ocr-test.
Smart Document Scanner OCR App Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Smart Document Scanner OCR App Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/smart-document-scanner-ocr-app-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Jun 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Smart Document Scanner OCR App Market Outlook

According to our latest research, the global Smart Document Scanner OCR App market size reached USD 3.85 billion in 2024, exhibiting robust growth driven by the rapid digitization of workflows and the increasing need for document automation across various sectors. The market is projected to grow at a CAGR of 13.7% from 2025 to 2033, with the market size forecasted to reach USD 11.89 billion by 2033. This significant expansion is primarily attributed to the widespread adoption of mobile devices, advancements in artificial intelligence and machine learning, and the growing demand for efficient document management solutions in both personal and professional environments.

One of the primary growth factors fueling the Smart Document Scanner OCR App market is the accelerating pace of digital transformation across industries such as healthcare, finance, education, and government. Organizations are increasingly seeking ways to streamline their document handling processes, reduce manual data entry errors, and improve operational efficiency. The integration of Optical Character Recognition (OCR) technology into smart document scanning apps enables users to quickly convert paper documents into editable and searchable digital formats, significantly enhancing productivity. Furthermore, the proliferation of remote work and the need for secure, cloud-based document sharing have further heightened the demand for advanced OCR-enabled scanning solutions.

Another significant driver is the continuous innovation in artificial intelligence and machine learning algorithms, which are making OCR technology more accurate, reliable, and versatile. Modern Smart Document Scanner OCR Apps can now recognize a wide range of fonts, languages, and complex layouts, including tables and handwritten notes, with remarkable precision. This technological evolution has broadened the application scope of these apps, allowing them to be used not only for basic document digitization but also for tasks such as invoice processing, identity verification, and compliance management. The incorporation of AI-powered features such as automatic document detection, real-time translation, and advanced data extraction is further propelling market growth.

The increasing penetration of smartphones and mobile devices globally has also played a crucial role in the expansion of the Smart Document Scanner OCR App market. With the majority of the population now having access to high-resolution cameras and powerful processing capabilities on their mobile devices, scanning and digitizing documents has become more convenient than ever. This trend is particularly pronounced in emerging markets, where mobile-first solutions are often preferred over traditional desktop-based applications. Additionally, the growing emphasis on paperless offices and environmental sustainability is encouraging both individuals and enterprises to adopt digital document management practices, thereby boosting the market for OCR-enabled scanner apps.

From a regional perspective, North America currently dominates the global Smart Document Scanner OCR App market, accounting for the largest share in 2024. This is largely due to the high adoption rate of advanced technologies, a mature IT infrastructure, and the presence of leading solution providers in the region. However, Asia Pacific is expected to witness the fastest growth over the forecast period, driven by rapid urbanization, increasing smartphone penetration, and rising investments in digital transformation initiatives across countries such as China, India, and Japan. Europe also presents significant growth opportunities, supported by stringent regulatory requirements for data management and a strong focus on innovation in document processing technologies.

Component Analysis

The Component segment of the Smart Document Scanner OCR App market is bifurcated into Software and Services. The Software sub-segment holds the lion’s share of the market, as the co
d
Data for Optical Character Recognition Applied to Hieratic: Sign...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tabin, Julius A. (2023). Data for Optical Character Recognition Applied to Hieratic: Sign Identification and Broad Analysis [Dataset]. http://doi.org/10.7910/DVN/D8CWVZ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/D8CWVZ
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Tabin, Julius A.
Description
This data consists of a number of .zip files containing everything needed to run the hieratic optical character recognition program presented at https://github.com/jtabin/PaPYrus. The files included are as follows: 1. "Dataset By Sign": This is all 13,134 data set images, categorized in folders by their Gardiner sign. Each image is a black and white .png image of a hieratic sign. The signs are labeled with unique identifiers, corresponding in order to their placement in a text from the 1st (0001) to the 9999th (9999), facsimile maker (1 for Möller, 2 for Poe, 3 for Tabin), provenance (1: Thebes, 2: Lahun, 3: Hatnub, 4: Unknown), and original text (1: Shipwrecked Sailor, 2: Eloquent Peasant B1, 3: Eloquent Peasant R, 4: Sinuhe B, 5: Sinuhe R, 6: Papyrus Prisse, 7: Hymn to Senwosret III, 8: Lahun Temple Files, 9: Will of Wah, 10: Texte aus Hatnub, 11: Papyrus Ebers, 12: Rhind Papyrus, 13: Papyrus Westcar). 2. "Dataset Categorized": This is every data set image, as above, categorized in folders by their provenance, text, and facsimile maker (i.e. where the tags originate from). 3. "Dataset Whole": This is every data set image in one folder. This is what is used for the analyses done by the OCR program. 4. "Precalculated Data Set Stats": This is a collection of .csv files outputted by the "Data Set Prep.ipynb" code (code found on the aforementioned GitHub page). "pxls_16.csv", "pxls_20.csv", and "pxls_25.csv" are the pixel values for every sign in the data set, after they were resized to 16, 20, and 25 pixels, respectively. "datasetstats.csv" includes the aspect ratios and sign names for every sign in the data set. The two files beginning with "A1cut" are the same stats, but after every A1 sign had its tail manually cut off. 5. "Precalculated OCR Results": This is a collection of .csv files outputted by the "Image Identification.ipynb" code (also found on the GitHub page). The files are mostly the product of all of one sign from the data set being run through the OCR program and they are labeled with the name of the sign. These result in columns of signs and their similarity scores when compared to other signs. Some files, such as "randsamp_fullresults.csv", come from other analyses explained in their file names (that file, for instance, is a random sample from the data set).
k
Identity Document
koncile.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koncile.ai, Identity Document [Dataset]. https://www.koncile.ai/en
Explore at:
Dataset provided by
Koncile.ai
License
https://www.koncile.ai/en/termsandconditionshttps://www.koncile.ai/en/termsandconditions
Description
AI-powered OCR to extract all fields from your ID documents (PDF or image). Turn your documents into data via API or SDK. Reliable and customizable.
blip3-ocr-200m
huggingface.co
Updated Sep 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salesforce (2024). blip3-ocr-200m [Dataset]. https://huggingface.co/datasets/Salesforce/blip3-ocr-200m
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 5, 2024
Dataset provided by
Salesforce Inchttp://salesforce.com/
Authors
Salesforce
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
BLIP3-OCR-200M Dataset

Overview

The BLIP3-OCR-200M dataset is designed to address the limitations of current Vision-Language Models (VLMs) in processing and interpreting text-rich images, such as documents and charts. Traditional image-text datasets often struggle to capture nuanced textual information, which is crucial for tasks requiring complex text comprehension and reasoning.

Key Features

OCR Integration: The dataset incorporates Optical Character… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/blip3-ocr-200m.
2999 Tamil Characters Processed and Classified
kaggle.com
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Mathew Chakaramakkil (2023). 2999 Tamil Characters Processed and Classified [Dataset]. https://www.kaggle.com/datasets/joch2722/3k-tamil-chars-processed-and-classified/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Joseph Mathew Chakaramakkil
Description
This dataset is a processed and classified (labelled) dataset for Tamil OCR derived from this dataset from this notebook. I re-uploaded this dataset for use with my copy of the notebook but also publish it here so that it might be useful to others. I suspect the original datasets in the notebook were unintentionally left private as the author provided the Google Drive link to the files in their public notebook. I have prepared this dataset for sharing here in the hopes it may be useful. Licensing information was not provided with the original dataset. Please direct licensing queries to either the original dataset publisher or me.

Sample Size: This dataset contains an average of 300±? samples for each of 11 characters, totalling 2999 samples total encoded as TIFF files. Some characters are more similar to each other than others (characters 0 & 1 and characters 9 & 10 are visually similar which confused models when I was training them)

Limited Scope: This dataset contains only 11 characters (e.g. அ–ஓ) (indexed 0–10) and not all possible characters in Tamil

Truncation: The last character class was removed from the original dataset for this dataset as it only contained 1 sample which was unsuitable for model training

File Naming Scheme: The file naming scheme (retained from the original) appears to be u[?]_[character_number]t[sample_number].tiff where character_number indexes the identity of the characters (providing labelling information) and sample_number indexes the samples of said characters. The significance of u[?] is unknown to me but I suspect it corresponds to the identities of the people that hand-wrote the samples

Folder Structure: The samples are organised into folders by the type of character. This fact may be used to generate labels for model creation as it is in the source notebook

Binarisation: The samples are binarised as in the source notebook where black(0) refers to no ink and white(1) refers to an inked pixel. The raw samples were black ink on white background

Resizing Distortion: The resizing process from the source notebook to create this dataset does squish and stretch characters that occupy more rectangular spaces to make them occupy a uniform square space for model training. This may be undesirable depending on your use-case

Lone Characters: The samples shown are for lone characters and not characters within words so intra-word joins in cursive styles are not present

Writing Quality: The quality of the handwriting and the deviation from the typographical versions of these characters varies in this dataset. At least one sample appears to show two subsequent writing attempts overlaid upon each other, which often confused the CNN model in my copy of the source notebook In a later version or fork of this dataset, such especially poor quality samples may be separately, more accurately, classified as illegible
h
OpenDoc-Pdf-Preview
huggingface.co
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prithiv Sakthi (2025). OpenDoc-Pdf-Preview [Dataset]. https://huggingface.co/datasets/prithivMLmods/OpenDoc-Pdf-Preview
Explore at:
Dataset updated
Jun 25, 2025
Authors
Prithiv Sakthi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OpenDoc-Pdf-Preview

OpenDoc-Pdf-Preview is a compact visual preview dataset containing 6,000 high-resolution document images extracted from PDFs. This dataset is designed for Image-to-Text tasks such as document OCR pretraining, layout understanding, and multimodal document analysis.

Dataset Summary

Modality: Image-to-Text Content Type: PDF-based document previews Number of Samples: 6,000 Language: English Format: Parquet Split: train only Size: 606 MB License: Apache… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/OpenDoc-Pdf-Preview.

Facebook

Twitter

Click to copy link

Link copied

Cite

Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2

OCR Document Text Recognition Dataset

Photos of the documents and text - OCR dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 7, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Training Data

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

images - contains of original images of documents
boxes - includes bounding box labeling for the original images
annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

"Text Title" - corresponds to titles, the box is red
"Text Paragraph" - corresponds to paragraphs of text, the box is blue
"Table" - corresponds to the table, the box is green
"Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

Clear search

Close search

Google apps

Main menu

OCR Document Text Recognition Dataset

OCR Text Detection in the Documents Object Detection dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

Dataset structure

Data Format

Labels for the text:

Example of XML file structure

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

ID's photo Dataset | 67 countries | 11 types of documents | Document...

OCR Telugu Image Dataset

Optical Character Recognition

A dataset of Manchu ancient book word images for OCR tasks, China,...

idl-wds

my-ocr-output

Arabic OCR Corpus v.2 (2,894 items from QNL Collection)

Dataset of invoices and receipts including annotation of relevant fields

Dataset of ICDAR 2019 Competition on Post-OCR Text Correction

pdfa-eng-wds

Synthetic dataset for multi-script text line recognition

Synthetic Printed Words and Test Protocols Data for Bangla OCR

india-medical-ocr-test

Smart Document Scanner OCR App Market Research Report 2033

Smart Document Scanner OCR App Market Outlook

Component Analysis

Data for Optical Character Recognition Applied to Hieratic: Sign...

Identity Document

blip3-ocr-200m

2999 Tamil Characters Processed and Classified

OpenDoc-Pdf-Preview

OCR Document Text Recognition DatasetSee More Versions

Photos of the documents and text - OCR dataset

OCR Text Detection in the Documents Object Detection dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

Dataset structure

Data Format

Labels for the text:

Example of XML file structure

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

OCR Document Text Recognition Dataset