55 datasets found
  1. OCR Document Text Recognition Dataset

    • kaggle.com
    Updated Sep 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    OCR Text Detection in the Documents Object Detection dataset

    The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

    The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

    💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

    The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

    Dataset structure

    • images - contains of original images of documents
    • boxes - includes bounding box labeling for the original images
    • annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

    Data Format

    Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

    Labels for the text:

    • "Text Title" - corresponds to titles, the box is red
    • "Text Paragraph" - corresponds to paragraphs of text, the box is blue
    • "Table" - corresponds to the table, the box is green
    • "Handwritten" - corresponds to handwritten text, the box is purple

    Example of XML file structure

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

    Text Detection in the Documents might be made in accordance with your requirements.

    💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

  2. d

    ID's photo Dataset | 67 countries | 11 types of documents | Document...

    • datarade.ai
    .jpg, .jpeg, .png
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2025). ID's photo Dataset | 67 countries | 11 types of documents | Document Recognition | OCR Training | Computer Vision [Dataset]. https://datarade.ai/data-products/id-s-photo-dataset-67-countries-11-types-of-documents-d-filemarket
    Explore at:
    .jpg, .jpeg, .pngAvailable download formats
    Dataset updated
    Jul 25, 2025
    Dataset authored and provided by
    FileMarket
    Area covered
    Bulgaria, Indonesia, Mexico, France, Sri Lanka, Cuba, Peru, Venezuela (Bolivarian Republic of), Egypt, Benin
    Description

    Total individuals: 1661 Total images: 3623 Images per users: 2.18

    Top Countries: - Nigeria 44,6% - United States of America 7,2% - Bangladesh 7,1% - Ethiopia 6,7% - Indonesia 4,8% - India 4,8% - Kenya 2,4% - Iran 2,3% - Nepal 1,7% - Pakistan 1,4% (Total 67 countries)

    Type of documents: - Identification Card (ID Card) 63,2% - Driver's License 6,4% - Student ID 4,9% - International passport 2,8% - Domestic passport 0,8% - Residence Permit 0,7% - Military ID 0,4% - Health Insurance Card 0,2%

    Data is organized in per‑user folders and includes rich metadata.

    Within a folder you may find: (a) multiple document categories for the same person, and/or (b) repeated captures of the same document against different backgrounds or lighting setups. The maximum volume per individual is 28 images.

    Metadata includes country of document, type of document, created date, last name, first name, day of birth, month of birth and year of birth.

    Every image was provided with explicit user consent. This ensures downstream use cases—such as training and evaluating document detection, classification, text extraction, and identity authentication models—are supported by legally sourced data.

  3. i

    OCR Telugu Image Dataset

    • ieee-dataport.org
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kadavakollu Rao (2023). OCR Telugu Image Dataset [Dataset]. https://ieee-dataport.org/documents/ocr-telugu-image-dataset
    Explore at:
    Dataset updated
    Dec 8, 2023
    Authors
    Kadavakollu Rao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The choice of the dataset is the key for OCR systems. Unfortunately

  4. Optical Character Recognition

    • sdiinnovation-geoplatform.hub.arcgis.com
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2023). Optical Character Recognition [Dataset]. https://sdiinnovation-geoplatform.hub.arcgis.com/content/8b56ed53e34b4304a5b8b826a7512ab0
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset authored and provided by
    Esrihttp://esri.com/
    Description

    Text labels are an integral part of cadastral maps and floor plans. Text is also prevalent in natural scenes around us in the form of road signs, billboards, house numbers and place names. Extracting this text can provide additional context and details about the places the text describes and the information it conveys. Digitization of documents and extracting texts from them helps in retrieving and archiving of important information.This deep learning model is based on the MMOCR model and uses optical character recognition (OCR) technology to detect text in images. This model was trained on a large dataset of different types and styles of text with diverse background and contexts, allowing for precise text extraction. It can be applied to various tasks such as automatically detecting and reading text from documents, sign boards, scanned maps, etc., thereby converting images containing text to actionable data.Using the modelFollow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Fine-tuning the modelThis model cannot be fine-tuned using ArcGIS tools.InputHigh-resolution, 3-band street-level imagery/oriented imagery, scanned maps, or documents, with medium to large size text.OutputA feature layer with the recognized text and bounding box around it.Model architectureThis model is based on the open-source MMOCR model by MMLab.Sample resultsHere are a few results from the model.

  5. S

    A dataset of Manchu ancient book word images for OCR tasks, China,...

    • scidb.cn
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun Haipeng; Tao Wenhao; Bi Xiaojun (2025). A dataset of Manchu ancient book word images for OCR tasks, China, 1733–1867. [Dataset]. http://doi.org/10.57760/sciencedb.25676
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 29, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Sun Haipeng; Tao Wenhao; Bi Xiaojun
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    China
    Description

    This dataset consists of 24,280 high-resolution word images extracted from Manchu ancient books dating from 1733 to 1867, collected within the present-day territory of China. The images were sourced from the Series of Rare Ancient Books in Manchu and Chinese curated by the National Library of China. Each of the 2,428 unique Manchu words in the dataset is represented by exactly 10 distinct image samples, resulting in a balanced and well-structured dataset suitable for training and evaluating deep learning models in the task of Manchu OCR (optical character recognition).This dataset was constructed using a semi-automated workflow to address the challenges posed by manual segmentation of historical scripts—such as high annotation costs and time-consuming processing—and to preserve the visual details of each page. The image acquisition process involved high-precision scanning at 600 dpi. Word regions were first identified using computer vision algorithms, followed by manual verification and correction to ensure the accuracy and completeness of the extracted samples.All images are stored in standard .jpg format with consistent resolution and naming conventions. The dataset is divided into structured folders by word category, and accompanying metadata files provide annotations, including word labels, file paths, and page source references. The released version has no missing data entries, and the dataset has been quality-checked to exclude samples with severe degradation, such as illegible characters, torn pages, or significant shadowing.To our knowledge, this is the largest publicly available Manchu word image dataset to date. It offers a valuable resource for researchers in historical document analysis, Manchu linguistics, and machine learning-based OCR. The dataset can be used for model training and evaluation, benchmarking segmentation algorithms, and exploring multimodal representations of Manchu script.

  6. h

    idl-wds

    • huggingface.co
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pixel Parsing (2024). idl-wds [Dataset]. https://huggingface.co/datasets/pixparse/idl-wds
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2024
    Dataset authored and provided by
    Pixel Parsing
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for Industry Documents Library (IDL)

      Dataset Summary
    

    Industry Documents Library (IDL) is a document dataset filtered from UCSF documents library with 19 million pages kept as valid samples. Each document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the idl_data project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/idl-wds.

  7. h

    my-ocr-output

    • huggingface.co
    Updated Aug 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    phuong khach (2025). my-ocr-output [Dataset]. https://huggingface.co/datasets/phuongkhanh123/my-ocr-output
    Explore at:
    Dataset updated
    Aug 14, 2025
    Authors
    phuong khach
    Description

    Document OCR using Nanonets-OCR-s

    This dataset contains markdown-formatted OCR results from images in /content/input using Nanonets-OCR-s.

      Processing Details
    

    Source Dataset: /content/input Model: nanonets/Nanonets-OCR-s Number of Samples: 32 Processing Time: 7.9 minutes Processing Date: 2025-08-14 04:32 UTC

      Configuration
    

    Image Column: image Output Column: markdown Dataset Split: train Batch Size: 32 Max Model Length: 8,192 tokens Max Output Tokens: 4,096… See the full description on the dataset page: https://huggingface.co/datasets/phuongkhanh123/my-ocr-output.

  8. q

    Arabic OCR Corpus v.2 (2,894 items from QNL Collection)

    • manara.qnl.qa
    csv
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qatar National Library (2024). Arabic OCR Corpus v.2 (2,894 items from QNL Collection) [Dataset]. http://doi.org/10.57945/manara.26984785.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 12, 2024
    Dataset provided by
    Manara - Qatar Research Repository
    Authors
    Qatar National Library
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dataset contentsThis dataset is an OCR text corpus of 2,984 printed works (monographs and serials) from the collection of the Qatar National Library. All works are mostly in Arabic language, but fragments of texts in other languages can also be found. Besides the OCR text, the basic descriptive metadata for each item is also provided.Release note for version 2 of the datasetThe dataset of OCRed Arabic books has been fully updated to ensure consistency and quality. All items in the dataset have now been processed using the latest retrained data. Furthermore, every item has undergone a thorough visual quality assurance check conducted using a representative sample of pages. This update has resulted in a significant enhancement of word-level accuracy across the entire dataset, ensuring higher reliability and usability.The exact list of files changed between version 1 and version 2 of the dataset can be determined by comparing the SHA256 checksums provided with each dataset version (see below for details).Dataset structureThe dataset consists of three files:QNL-ArabicContentDataset-Metadata.csv and QNL-ArabicContentDataset-Metadata.xlsx contain the same basic metadata of 2,894 items from the Qatar National Library collection. Both files have the same content and are structured into the following columns:CALL #(ITEM) - Item call number in the QNL catalogRECORD #(ITEM) - Item record number in the QNL catalog (unique for each item)Repository URL - URL to digitized item content in the QNL repositoryCatalog URL - URL to the complete item metadata record in the QNL catalogAUTHOR - Main author information for the itemADD AUTHOR - Additional author information for the itemPUB INFO - Item publication infoTITLE - Item titleDESCRIPTION - Item descriptionVOLUME - Item volume information (in case of some serial publications)QNL_ArabicOCR_Corpus-v2.zip contains:2,894 text files with the following naming pattern: [unique item record number]-[unique item QNL repository id].txt. The unique item record number should be used to match each file with a related metadata record. Each file contains text extracted from a particular item using OCR software.checksums.sha256 - contains SHA256 checksums for all 2,894 text files

  9. Dataset of invoices and receipts including annotation of relevant fields

    • zenodo.org
    zip
    Updated Apr 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli (2022). Dataset of invoices and receipts including annotation of relevant fields [Dataset]. http://doi.org/10.5281/zenodo.6371710
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 3, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.

  10. E

    Dataset of ICDAR 2019 Competition on Post-OCR Text Correction

    • live.european-language-grid.eu
    • zenodo.org
    • +1more
    txt
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Dataset of ICDAR 2019 Competition on Post-OCR Text Correction [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7738
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 12, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Corpus for the ICDAR2019 Competition on Post-OCR Text Correction (October 2019)Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreuxhttp://l3i.univ-larochelle.fr/ICDAR2019PostOCR-------------------------------------------------------------------------------These are the supplementary materials for the ICDAR 2019 paper ICDAR 2019 Competition on Post-OCR Text CorrectionPlease use the following citation:@inproceedings{rigaud2019pocr,title=""ICDAR 2019 Competition on Post-OCR Text Correction"",author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},year={2019},booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}}

    Description: The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GS comes both from BnF's internal projects and external initiatives such as Europeana Newspapers, IMPACT, Project Gutenberg, Perseus and Wikisource. Repartition of the dataset- ICDAR2019_Post_OCR_correction_training_18M.zip: 80% of the full dataset, provided to train participants' methods.- ICDAR2019_Post_OCR_correction_evaluation_4M: 20% of the full dataset used for the evaluation (with Gold Standard made publicly after the competition).- ICDAR2019_Post_OCR_correction_full_22M: full dataset made publicly available after the competition. Special case for Finnish language Material from the National Library of Finland (Finnish dataset FI > FI1) are not allowed to be re-shared on other website. Please follow these guidelines to get and format the data from the original website.1. Go to https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en;2. Download OCR Ground Truth Pages (Finnish Fraktur) [v1](4.8GB) from Digitalia (2015-17) package;3. Convert the Excel file ""~/metadata/nlf_ocr_gt_tescomb5_2017.xlsx"" as Comma Separated Format (.csv) by using save as function in a spreadsheet software (e.g. Excel, Calc) and copy it into ""FI/FI1/HOWTO_get_data/input/"";4. Go to ""FI/FI1/HOWTO_get_data/"" and run ""script_1.py"" to generate the full ""FI1"" dataset in ""output/full/"";4. Run ""script_2.py"" to split the ""output/full/"" dataset into ""output/training/"" and ""output/evaluation/"" sub sets.At the end of the process, you should have a ""training"", ""evaluation"" and ""full"" folder with 1579528, 380817 and 1960345 characters respectively.

    Licenses: free to use for non-commercial uses, according to sources in details- BG1: IMPACT - National Library of Bulgaria: CC BY NC ND- CZ1: IMPACT - National Library of the Czech Republic: CC BY NC SA- DE1: Front pages of Swiss newspaper NZZ: Creative Commons Attribution 4.0 International (https://zenodo.org/record/3333627)- DE2: IMPACT - German National Library: CC BY NC ND- DE3: GT4Hist-dta19 dataset: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE4: GT4Hist - EarlyModernLatin: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE5: GT4Hist - Kallimachos: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE6: GT4Hist - RefCorpus-ENHG-Incunabula: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE7: GT4Hist - RIDGES-Fraktur: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- EN1: IMPACT - British Library: CC BY NC SA 3.0- ES1: IMPACT - National Library of Spain: CC BY NC SA- FI1: National Library of Finland: no re-sharing allowed, follow the above section to get the data. (https://digi.kansalliskirjasto.fi/opendata)- FR1: HIMANIS Project: CC0 (https://www.himanis.org)- FR2: IMPACT - National Library of France: CC BY NC SA 3.0- FR3: RECEIPT dataset: CC0 (http://findit.univ-lr.fr)- NL1: IMPACT - National library of the Netherlands: CC BY- PL1: IMPACT - National Library of Poland: CC BY- SL1: IMPACT - Slovak National Library: CC BY NCText post-processing such as cleaning and alignment have been applied on the resources mentioned above, so that the Gold Standard and the OCRs provided are not necessarily identical to the originals.

    Structure- **Content** [./lang_type/sub_folder/#.txt] - ""[OCR_toInput] "" => Raw OCRed text to be de-noised. - ""[OCR_aligned] "" => Aligned OCRed text. - ""[ GS_aligned] "" => Aligned Gold Standard text.The aligned OCRed/GS texts are provided for training and test purposes. The alignment was made at the character level using ""@"" symbols. ""#"" symbols correspond to the absence of GS either related to alignment uncertainties or related to unreadable characters in the source document. For a better view of the alignment, make sure to disable the ""word wrap"" option in your text editor.The Error Rate and the quality of the alignment vary according to the nature and the state of degradation of the source documents. Periodicals (mostly historical newspapers) for example, due to their complex layout and their original fonts have been reported to be especially challenging. In addition, it should be mentioned that the quality of Gold Standard also varies as the dataset aggregates resources from different projects that have their own annotation procedure, and obviously contains some errors.

    ICDAR2019 competitionInformation related to the tasks, formats and the evaluation metrics are details on :https://sites.google.com/view/icdar2019-postcorrectionocr/evaluation

    References - IMPACT, European Commission's 7th Framework Program, grant agreement 215064 - Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. - https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland- EU Horizon 2020 research and innovation programme grant agreement No 770299

    Contact- christophe.rigaud(at)univ-lr.fr- antoine.doucet(at)univ-lr.fr- mickael.coustaty(at)univ-lr.fr- jean-philippe.moreux(at)bnf.frL3i - University of la Rochelle, http://l3i.univ-larochelle.frBnF - French National Library, http://www.bnf.fr

  11. h

    pdfa-eng-wds

    • huggingface.co
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pixel Parsing (2024). pdfa-eng-wds [Dataset]. https://huggingface.co/datasets/pixparse/pdfa-eng-wds
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2024
    Dataset authored and provided by
    Pixel Parsing
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for PDF Association dataset (PDFA)

      Dataset Summary
    

    PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf documents analysis. The purpose of that subset differs in that regard, as focus has been done on making the dataset machine learning-ready for vision-language models.

    An example page of one pdf document, with added bounding boxes… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/pdfa-eng-wds.
    
  12. Synthetic dataset for multi-script text line recognition

    • zenodo.org
    application/gzip
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SVEN NAJEM-MEYER; SVEN NAJEM-MEYER (2025). Synthetic dataset for multi-script text line recognition [Dataset]. http://doi.org/10.5281/zenodo.14840349
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 9, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    SVEN NAJEM-MEYER; SVEN NAJEM-MEYER
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Optical Character Recognition (OCR) systems frequently encounter difficulties when processing rare or ancient scripts, especially when they occur in historical contexts involving multiple writing systems. These challenges often constrain researchers to fine-tune or to train new OCR models tailored to their specific needs. To support these efforts, we introduce a synthetic dataset comprising 6.2 million lines, specifically geared towards mixed polytonic Greek and Latin scripts. Being augmented with artificially degraded lines, the dataset bolsters strong results when used to train historical OCR models. This resource can be used both for training and testing purposes, and is particularly valuable for researchers working with ancient Greek and limited annotated data. The software used to generate this datasets is linked to below on our Git. This is a sample, but please contact us if you would like access to the whole dataset.

  13. Synthetic Printed Words and Test Protocols Data for Bangla OCR

    • figshare.com
    zip
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koushik Roy; MD Sazzad Hossain; Pritom Saha; Shadman Rohan; Fuad Rahman; Imranul Ashrafi; Ifty Mohammad Rezwan; B M Mainul Hossain; Ahmedul Kabir; Nabeel Mohammed (2023). Synthetic Printed Words and Test Protocols Data for Bangla OCR [Dataset]. http://doi.org/10.6084/m9.figshare.20186825.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Koushik Roy; MD Sazzad Hossain; Pritom Saha; Shadman Rohan; Fuad Rahman; Imranul Ashrafi; Ifty Mohammad Rezwan; B M Mainul Hossain; Ahmedul Kabir; Nabeel Mohammed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic Printed word image data and test protocols word image Data repository for the paper "A Multifaceted Evaluation of Representation of Graphemes for Practically Effective Bangla OCR." In this paper, we have utilized the popular Convolutional Recurrent Neural Network (CRNN) architecture and implemented our grapheme representation strategies to design the final labels of the model. Due to the absence of a large-scale Bangla word-level printed dataset, we created a synthetically generated Bangla corpus containing 2 million samples that are representative and sufficiently varied in terms of fonts, domain, and vocabulary size to train our Bangla OCR model. To test the various aspects of our model, we have also created 6 test protocols. Finally, to establish the generalizability of our grapheme representation methods, we have performed training and testing on external handwriting datasets. Updates: 10 June 2023: The paper has been accepted for publication in International Journal on Document Analysis and Recognition (IJDAR).

  14. h

    india-medical-ocr-test

    • huggingface.co
    Updated Aug 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien (2025). india-medical-ocr-test [Dataset]. https://huggingface.co/datasets/davanstrien/india-medical-ocr-test
    Explore at:
    Dataset updated
    Aug 7, 2025
    Authors
    Daniel van Strien
    Description

    Document OCR using NuMarkdown-8B-Thinking

    This dataset contains markdown-formatted OCR results from images in davanstrien/india-medical-test using NuMarkdown-8B-Thinking.

      Processing Details
    

    Source Dataset: davanstrien/india-medical-test Model: numind/NuMarkdown-8B-Thinking Number of Samples: 50 Processing Time: 13.3 minutes Processing Date: 2025-08-07 08:04 UTC

      Configuration
    

    Image Column: image Output Column: markdown Dataset Split: train Batch Size: 16… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/india-medical-ocr-test.

  15. Smart Document Scanner OCR App Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Smart Document Scanner OCR App Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/smart-document-scanner-ocr-app-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Jun 29, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Smart Document Scanner OCR App Market Outlook



    According to our latest research, the global Smart Document Scanner OCR App market size reached USD 3.85 billion in 2024, exhibiting robust growth driven by the rapid digitization of workflows and the increasing need for document automation across various sectors. The market is projected to grow at a CAGR of 13.7% from 2025 to 2033, with the market size forecasted to reach USD 11.89 billion by 2033. This significant expansion is primarily attributed to the widespread adoption of mobile devices, advancements in artificial intelligence and machine learning, and the growing demand for efficient document management solutions in both personal and professional environments.




    One of the primary growth factors fueling the Smart Document Scanner OCR App market is the accelerating pace of digital transformation across industries such as healthcare, finance, education, and government. Organizations are increasingly seeking ways to streamline their document handling processes, reduce manual data entry errors, and improve operational efficiency. The integration of Optical Character Recognition (OCR) technology into smart document scanning apps enables users to quickly convert paper documents into editable and searchable digital formats, significantly enhancing productivity. Furthermore, the proliferation of remote work and the need for secure, cloud-based document sharing have further heightened the demand for advanced OCR-enabled scanning solutions.




    Another significant driver is the continuous innovation in artificial intelligence and machine learning algorithms, which are making OCR technology more accurate, reliable, and versatile. Modern Smart Document Scanner OCR Apps can now recognize a wide range of fonts, languages, and complex layouts, including tables and handwritten notes, with remarkable precision. This technological evolution has broadened the application scope of these apps, allowing them to be used not only for basic document digitization but also for tasks such as invoice processing, identity verification, and compliance management. The incorporation of AI-powered features such as automatic document detection, real-time translation, and advanced data extraction is further propelling market growth.




    The increasing penetration of smartphones and mobile devices globally has also played a crucial role in the expansion of the Smart Document Scanner OCR App market. With the majority of the population now having access to high-resolution cameras and powerful processing capabilities on their mobile devices, scanning and digitizing documents has become more convenient than ever. This trend is particularly pronounced in emerging markets, where mobile-first solutions are often preferred over traditional desktop-based applications. Additionally, the growing emphasis on paperless offices and environmental sustainability is encouraging both individuals and enterprises to adopt digital document management practices, thereby boosting the market for OCR-enabled scanner apps.




    From a regional perspective, North America currently dominates the global Smart Document Scanner OCR App market, accounting for the largest share in 2024. This is largely due to the high adoption rate of advanced technologies, a mature IT infrastructure, and the presence of leading solution providers in the region. However, Asia Pacific is expected to witness the fastest growth over the forecast period, driven by rapid urbanization, increasing smartphone penetration, and rising investments in digital transformation initiatives across countries such as China, India, and Japan. Europe also presents significant growth opportunities, supported by stringent regulatory requirements for data management and a strong focus on innovation in document processing technologies.





    Component Analysis



    The Component segment of the Smart Document Scanner OCR App market is bifurcated into Software and Services. The Software sub-segment holds the lion’s share of the market, as the co

  16. d

    Data for Optical Character Recognition Applied to Hieratic: Sign...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tabin, Julius A. (2023). Data for Optical Character Recognition Applied to Hieratic: Sign Identification and Broad Analysis [Dataset]. http://doi.org/10.7910/DVN/D8CWVZ
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Tabin, Julius A.
    Description

    This data consists of a number of .zip files containing everything needed to run the hieratic optical character recognition program presented at https://github.com/jtabin/PaPYrus. The files included are as follows: 1. "Dataset By Sign": This is all 13,134 data set images, categorized in folders by their Gardiner sign. Each image is a black and white .png image of a hieratic sign. The signs are labeled with unique identifiers, corresponding in order to their placement in a text from the 1st (0001) to the 9999th (9999), facsimile maker (1 for Möller, 2 for Poe, 3 for Tabin), provenance (1: Thebes, 2: Lahun, 3: Hatnub, 4: Unknown), and original text (1: Shipwrecked Sailor, 2: Eloquent Peasant B1, 3: Eloquent Peasant R, 4: Sinuhe B, 5: Sinuhe R, 6: Papyrus Prisse, 7: Hymn to Senwosret III, 8: Lahun Temple Files, 9: Will of Wah, 10: Texte aus Hatnub, 11: Papyrus Ebers, 12: Rhind Papyrus, 13: Papyrus Westcar). 2. "Dataset Categorized": This is every data set image, as above, categorized in folders by their provenance, text, and facsimile maker (i.e. where the tags originate from). 3. "Dataset Whole": This is every data set image in one folder. This is what is used for the analyses done by the OCR program. 4. "Precalculated Data Set Stats": This is a collection of .csv files outputted by the "Data Set Prep.ipynb" code (code found on the aforementioned GitHub page). "pxls_16.csv", "pxls_20.csv", and "pxls_25.csv" are the pixel values for every sign in the data set, after they were resized to 16, 20, and 25 pixels, respectively. "datasetstats.csv" includes the aspect ratios and sign names for every sign in the data set. The two files beginning with "A1cut" are the same stats, but after every A1 sign had its tail manually cut off. 5. "Precalculated OCR Results": This is a collection of .csv files outputted by the "Image Identification.ipynb" code (also found on the GitHub page). The files are mostly the product of all of one sign from the data set being run through the OCR program and they are labeled with the name of the sign. These result in columns of signs and their similarity scores when compared to other signs. Some files, such as "randsamp_fullresults.csv", come from other analyses explained in their file names (that file, for instance, is a random sample from the data set).

  17. k

    Identity Document

    • koncile.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koncile.ai, Identity Document [Dataset]. https://www.koncile.ai/en
    Explore at:
    Dataset provided by
    Koncile.ai
    License

    https://www.koncile.ai/en/termsandconditionshttps://www.koncile.ai/en/termsandconditions

    Description

    AI-powered OCR to extract all fields from your ID documents (PDF or image). Turn your documents into data via API or SDK. Reliable and customizable.

  18. blip3-ocr-200m

    • huggingface.co
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salesforce (2024). blip3-ocr-200m [Dataset]. https://huggingface.co/datasets/Salesforce/blip3-ocr-200m
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 5, 2024
    Dataset provided by
    Salesforce Inchttp://salesforce.com/
    Authors
    Salesforce
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    BLIP3-OCR-200M Dataset

      Overview
    

    The BLIP3-OCR-200M dataset is designed to address the limitations of current Vision-Language Models (VLMs) in processing and interpreting text-rich images, such as documents and charts. Traditional image-text datasets often struggle to capture nuanced textual information, which is crucial for tasks requiring complex text comprehension and reasoning.

      Key Features
    

    OCR Integration: The dataset incorporates Optical Character… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/blip3-ocr-200m.

  19. 2999 Tamil Characters Processed and Classified

    • kaggle.com
    Updated Apr 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Mathew Chakaramakkil (2023). 2999 Tamil Characters Processed and Classified [Dataset]. https://www.kaggle.com/datasets/joch2722/3k-tamil-chars-processed-and-classified/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Joseph Mathew Chakaramakkil
    Description

    This dataset is a processed and classified (labelled) dataset for Tamil OCR derived from this dataset from this notebook. I re-uploaded this dataset for use with my copy of the notebook but also publish it here so that it might be useful to others. I suspect the original datasets in the notebook were unintentionally left private as the author provided the Google Drive link to the files in their public notebook. I have prepared this dataset for sharing here in the hopes it may be useful. Licensing information was not provided with the original dataset. Please direct licensing queries to either the original dataset publisher or me.

    • Sample Size: This dataset contains an average of 300±? samples for each of 11 characters, totalling 2999 samples total encoded as TIFF files. Some characters are more similar to each other than others (characters 0 & 1 and characters 9 & 10 are visually similar which confused models when I was training them)
    • Limited Scope: This dataset contains only 11 characters (e.g. அ–ஓ) (indexed 0–10) and not all possible characters in Tamil
    • Truncation: The last character class was removed from the original dataset for this dataset as it only contained 1 sample which was unsuitable for model training
    • File Naming Scheme: The file naming scheme (retained from the original) appears to be u[?]_[character_number]t[sample_number].tiff where character_number indexes the identity of the characters (providing labelling information) and sample_number indexes the samples of said characters. The significance of u[?] is unknown to me but I suspect it corresponds to the identities of the people that hand-wrote the samples
    • Folder Structure: The samples are organised into folders by the type of character. This fact may be used to generate labels for model creation as it is in the source notebook
    • Binarisation: The samples are binarised as in the source notebook where black(0) refers to no ink and white(1) refers to an inked pixel. The raw samples were black ink on white background
    • Resizing Distortion: The resizing process from the source notebook to create this dataset does squish and stretch characters that occupy more rectangular spaces to make them occupy a uniform square space for model training. This may be undesirable depending on your use-case
    • Lone Characters: The samples shown are for lone characters and not characters within words so intra-word joins in cursive styles are not present
    • Writing Quality: The quality of the handwriting and the deviation from the typographical versions of these characters varies in this dataset. At least one sample appears to show two subsequent writing attempts overlaid upon each other, which often confused the CNN model in my copy of the source notebook In a later version or fork of this dataset, such especially poor quality samples may be separately, more accurately, classified as illegible
  20. h

    OpenDoc-Pdf-Preview

    • huggingface.co
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prithiv Sakthi (2025). OpenDoc-Pdf-Preview [Dataset]. https://huggingface.co/datasets/prithivMLmods/OpenDoc-Pdf-Preview
    Explore at:
    Dataset updated
    Jun 25, 2025
    Authors
    Prithiv Sakthi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OpenDoc-Pdf-Preview

    OpenDoc-Pdf-Preview is a compact visual preview dataset containing 6,000 high-resolution document images extracted from PDFs. This dataset is designed for Image-to-Text tasks such as document OCR pretraining, layout understanding, and multimodal document analysis.

      Dataset Summary
    

    Modality: Image-to-Text Content Type: PDF-based document previews Number of Samples: 6,000 Language: English Format: Parquet Split: train only Size: 606 MB License: Apache… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/OpenDoc-Pdf-Preview.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
Organization logo

OCR Document Text Recognition Dataset

Photos of the documents and text - OCR dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

  • images - contains of original images of documents
  • boxes - includes bounding box labeling for the original images
  • annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

  • "Text Title" - corresponds to titles, the box is red
  • "Text Paragraph" - corresponds to paragraphs of text, the box is blue
  • "Table" - corresponds to the table, the box is green
  • "Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

Search
Clear search
Close search
Google apps
Main menu