100+ datasets found
  1. OCR Document Text Recognition Dataset

    • kaggle.com
    Updated Sep 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    OCR Text Detection in the Documents Object Detection dataset

    The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

    The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

    💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

    The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

    Dataset structure

    • images - contains of original images of documents
    • boxes - includes bounding box labeling for the original images
    • annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

    Data Format

    Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

    Labels for the text:

    • "Text Title" - corresponds to titles, the box is red
    • "Text Paragraph" - corresponds to paragraphs of text, the box is blue
    • "Table" - corresponds to the table, the box is green
    • "Handwritten" - corresponds to handwritten text, the box is purple

    Example of XML file structure

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

    Text Detection in the Documents might be made in accordance with your requirements.

    💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

  2. F

    English Product Image OCR Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Product Image OCR Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/english-product-image-ocr-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the English Product Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the English language.

    Dataset Contain & Diversity:

    Containing a total of 2000 images, this English OCR dataset offers diverse distribution across different types of front images of Products. In this dataset, you'll find a variety of text that includes product names, taglines, logos, company names, addresses, product content, etc. Images in this dataset showcase distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build a robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible English text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, to build a balanced OCR dataset. The collection features images in portrait and landscape modes.

    All these images were captured by native English people to ensure the text quality, avoid toxic content and PII text. We used the latest iOS and Android mobile devices above 5MP cameras to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data, you will also receive detailed structured metadata in CSV format. For each image, it includes metadata like image orientation, county, language, and device information. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of English text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native English crowd community.

    If you require a custom product image OCR dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific project requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this product image OCR dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the English language. Your journey to enhanced language understanding and processing starts here.

  3. R

    Ocr_datasets Dataset

    • universe.roboflow.com
    zip
    Updated Aug 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    myopensourcedatasets (2025). Ocr_datasets Dataset [Dataset]. https://universe.roboflow.com/myopensourcedatasets/ocr_datasets
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 2, 2025
    Dataset authored and provided by
    myopensourcedatasets
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Words Bounding Boxes
    Description

    OCR_Datasets

    ## Overview
    
    OCR_Datasets is a dataset for object detection tasks - it contains Words annotations for 498 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  4. R

    Ocr Dataset

    • universe.roboflow.com
    zip
    Updated Nov 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OCR (2024). Ocr Dataset [Dataset]. https://universe.roboflow.com/ocr-knse8/ocr-9vwoq
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 15, 2024
    Dataset authored and provided by
    OCR
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    TEXT Descriptions
    Description

    OCR

    ## Overview
    
    OCR is a dataset for vision language (multimodal) tasks - it contains TEXT annotations for 540 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
    
  5. E

    Dataset of ICDAR 2019 Competition on Post-OCR Text Correction

    • live.european-language-grid.eu
    • zenodo.org
    • +1more
    txt
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Dataset of ICDAR 2019 Competition on Post-OCR Text Correction [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7738
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 12, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Corpus for the ICDAR2019 Competition on Post-OCR Text Correction (October 2019)Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreuxhttp://l3i.univ-larochelle.fr/ICDAR2019PostOCR-------------------------------------------------------------------------------These are the supplementary materials for the ICDAR 2019 paper ICDAR 2019 Competition on Post-OCR Text CorrectionPlease use the following citation:@inproceedings{rigaud2019pocr,title=""ICDAR 2019 Competition on Post-OCR Text Correction"",author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},year={2019},booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}}

    Description: The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GS comes both from BnF's internal projects and external initiatives such as Europeana Newspapers, IMPACT, Project Gutenberg, Perseus and Wikisource. Repartition of the dataset- ICDAR2019_Post_OCR_correction_training_18M.zip: 80% of the full dataset, provided to train participants' methods.- ICDAR2019_Post_OCR_correction_evaluation_4M: 20% of the full dataset used for the evaluation (with Gold Standard made publicly after the competition).- ICDAR2019_Post_OCR_correction_full_22M: full dataset made publicly available after the competition. Special case for Finnish language Material from the National Library of Finland (Finnish dataset FI > FI1) are not allowed to be re-shared on other website. Please follow these guidelines to get and format the data from the original website.1. Go to https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en;2. Download OCR Ground Truth Pages (Finnish Fraktur) [v1](4.8GB) from Digitalia (2015-17) package;3. Convert the Excel file ""~/metadata/nlf_ocr_gt_tescomb5_2017.xlsx"" as Comma Separated Format (.csv) by using save as function in a spreadsheet software (e.g. Excel, Calc) and copy it into ""FI/FI1/HOWTO_get_data/input/"";4. Go to ""FI/FI1/HOWTO_get_data/"" and run ""script_1.py"" to generate the full ""FI1"" dataset in ""output/full/"";4. Run ""script_2.py"" to split the ""output/full/"" dataset into ""output/training/"" and ""output/evaluation/"" sub sets.At the end of the process, you should have a ""training"", ""evaluation"" and ""full"" folder with 1579528, 380817 and 1960345 characters respectively.

    Licenses: free to use for non-commercial uses, according to sources in details- BG1: IMPACT - National Library of Bulgaria: CC BY NC ND- CZ1: IMPACT - National Library of the Czech Republic: CC BY NC SA- DE1: Front pages of Swiss newspaper NZZ: Creative Commons Attribution 4.0 International (https://zenodo.org/record/3333627)- DE2: IMPACT - German National Library: CC BY NC ND- DE3: GT4Hist-dta19 dataset: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE4: GT4Hist - EarlyModernLatin: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE5: GT4Hist - Kallimachos: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE6: GT4Hist - RefCorpus-ENHG-Incunabula: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE7: GT4Hist - RIDGES-Fraktur: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- EN1: IMPACT - British Library: CC BY NC SA 3.0- ES1: IMPACT - National Library of Spain: CC BY NC SA- FI1: National Library of Finland: no re-sharing allowed, follow the above section to get the data. (https://digi.kansalliskirjasto.fi/opendata)- FR1: HIMANIS Project: CC0 (https://www.himanis.org)- FR2: IMPACT - National Library of France: CC BY NC SA 3.0- FR3: RECEIPT dataset: CC0 (http://findit.univ-lr.fr)- NL1: IMPACT - National library of the Netherlands: CC BY- PL1: IMPACT - National Library of Poland: CC BY- SL1: IMPACT - Slovak National Library: CC BY NCText post-processing such as cleaning and alignment have been applied on the resources mentioned above, so that the Gold Standard and the OCRs provided are not necessarily identical to the originals.

    Structure- **Content** [./lang_type/sub_folder/#.txt] - ""[OCR_toInput] "" => Raw OCRed text to be de-noised. - ""[OCR_aligned] "" => Aligned OCRed text. - ""[ GS_aligned] "" => Aligned Gold Standard text.The aligned OCRed/GS texts are provided for training and test purposes. The alignment was made at the character level using ""@"" symbols. ""#"" symbols correspond to the absence of GS either related to alignment uncertainties or related to unreadable characters in the source document. For a better view of the alignment, make sure to disable the ""word wrap"" option in your text editor.The Error Rate and the quality of the alignment vary according to the nature and the state of degradation of the source documents. Periodicals (mostly historical newspapers) for example, due to their complex layout and their original fonts have been reported to be especially challenging. In addition, it should be mentioned that the quality of Gold Standard also varies as the dataset aggregates resources from different projects that have their own annotation procedure, and obviously contains some errors.

    ICDAR2019 competitionInformation related to the tasks, formats and the evaluation metrics are details on :https://sites.google.com/view/icdar2019-postcorrectionocr/evaluation

    References - IMPACT, European Commission's 7th Framework Program, grant agreement 215064 - Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. - https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland- EU Horizon 2020 research and innovation programme grant agreement No 770299

    Contact- christophe.rigaud(at)univ-lr.fr- antoine.doucet(at)univ-lr.fr- mickael.coustaty(at)univ-lr.fr- jean-philippe.moreux(at)bnf.frL3i - University of la Rochelle, http://l3i.univ-larochelle.frBnF - French National Library, http://www.bnf.fr

  6. R

    Standard Ocr Dataset 2 Dataset

    • universe.roboflow.com
    zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    flickrimages (2025). Standard Ocr Dataset 2 Dataset [Dataset]. https://universe.roboflow.com/flickrimages/standard-ocr-dataset-2/model/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    flickrimages
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Characters Bounding Boxes
    Description

    Standard Ocr Dataset 2

    ## Overview
    
    Standard Ocr Dataset 2 is a dataset for object detection tasks - it contains Characters annotations for 206 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  7. h

    OCR-VQA

    • huggingface.co
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    howard-hou (2023). OCR-VQA [Dataset]. https://huggingface.co/datasets/howard-hou/OCR-VQA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2023
    Authors
    howard-hou
    Description

    Dataset Card for "OCR-VQA"

    More Information needed

  8. R

    Ocr Dataset

    • universe.roboflow.com
    zip
    Updated Jun 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nabil (2024). Ocr Dataset [Dataset]. https://universe.roboflow.com/nabil-k0ulv/ocr-1nyva
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 22, 2024
    Dataset authored and provided by
    Nabil
    Variables measured
    Digis Bounding Boxes
    Description

    Ocr

    ## Overview
    
    Ocr is a dataset for object detection tasks - it contains Digis annotations for 237 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
  9. h

    CaptionedSynthText

    • huggingface.co
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Wendler (2024). CaptionedSynthText [Dataset]. https://huggingface.co/datasets/wendlerc/CaptionedSynthText
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 23, 2024
    Authors
    Chris Wendler
    Description

    This dataset has been created by Stability AI and LAION. SynthText is a popular OCR dataset, where random texts are rendered into random locations in images based on depth maps. In this dataset, we additionally computed image captions using BLIP2.

    Caption: "a close up of a leopard's face with a blurry background"

  10. g

    Tesseract OCR Training Dataset

    • gts.ai
    json
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2024). Tesseract OCR Training Dataset [Dataset]. https://gts.ai/dataset-download/page/68/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Sep 6, 2024
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Unlock the potential of Tesseract OCR with our meticulously hand-labeled training dataset. Designed for fine-tuning, this dataset includes comprehensive files and a custom Bash script to streamline your OCR improvements.

  11. 19th-Century Romanian Transitional Script

    • kaggle.com
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marius E. Penteliuc (2024). 19th-Century Romanian Transitional Script [Dataset]. https://www.kaggle.com/datasets/mariuspenteliuc/rts-ocr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marius E. Penteliuc
    Description

    This dataset consists of 156 pages of Romanian texts written in the Romanian Transitional Script (RTS). RTS is a mix of Latin and Cyrillic characters that were used in the 19th century in the Romanian provinces to facilitate the transition from the Romanian Cyrillic Script to the modern Latin Script. The images cover the period between 1833 and 1864. The selected texts cover a diverse range of literary genres, including poems, novels, dramas, stories, newspapers, and religious texts.

    The dataset was obtained from the Central University Libraries (BCU) of Timișoara, Iași, and Cluj-Napoca through their free online platforms or by request. The scanned images are provided in JPEG and PNG formats, with dimensions ranging from approximately 300 by 900 pixels to 2000 by 3000 pixels. The file sizes vary between 70 KB and 10 MB.

    To ensure diversity, the dataset includes images with various fonts, styles, regions, publishers, and years. It covers all three main Romanian provinces' key publishing regions (Bucharest - B, Iasi - IS, Brasov - BV, Sibiu - SB, Blaj - BJ) as well as some located outside Romania that printed texts in RTS (Vienna - V, Budapest - BD, Paris - P). It comprises 4588 lines of text, totaling 31,132 words and 158,656 characters. Among these characters, there are 61,065 Cyrillic characters, 27,022 Latin characters, 53,844 overlapping characters (identical symbols), and 16,725 other characters (e.g., punctuation, digits). The images below summarize its content per publisher and decade. More statistics (including per publishing house and per character) are available in the code provided.

    Statistics of characters in the dataset per publisher and decade* https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15661653%2F13bd86216df169b5c4783813a4b5118f%2Fchar-count.png?generation=1687532923729343&alt=media" alt="">

    Percentage of Latin vs. Cyrillic vs. other characters in the dataset* https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15661653%2F0cfad1574aa2823b798fcf2b515beff6%2Fchar-ratio.png?generation=1687532980067286&alt=media" alt="">

    The dataset presents typical challenges found in old documents, such as wear and tear, blemishes, discolorations, library imprints, handwriting, ink smudges, and variations in text alignment. These factors may impact legibility, and some scanned lines of text may not be uniformly straight.

    This dataset provides a valuable resource for researchers and practitioners interested in historical document analysis, transliteration techniques, and studying the evolution of the Romanian language. It allows for the development and evaluation of OCR models and other language processing techniques in the context of the Romanian Transitional Script. The images provided are accompanied by ground truth texts (.gt.txt files) containing the correct text found in them, as well as .box files for the Tesseract 5 OCR engine.

    Usage

    You may use the dataset freely as long as you mention this page or the project below.

    Acknowledgements

    This work was supported by a grant of the Romanian Ministry of Research, Innovation and Digitization, CCCDI - UEFISCDI, project number PN-III-P2-2.1-PED-2021-0693, within PNCDI III. Project website: ROTLA

    *Plots are based on the original dataset distribution

  12. Dataset of invoices and receipts including annotation of relevant fields

    • zenodo.org
    zip
    Updated Apr 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli (2022). Dataset of invoices and receipts including annotation of relevant fields [Dataset]. http://doi.org/10.5281/zenodo.6371710
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 3, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.

  13. g

    OCR Barcodes Detection.

    • gts.ai
    json
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2023). OCR Barcodes Detection. [Dataset]. https://gts.ai/dataset-download/financial-data-set1-2/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    OCR (Optical Character Recognition) barcode detection is a technology that enables the automatic recognition and extraction of barcode information from images or documents...

  14. R

    Ocr Dataset

    • universe.roboflow.com
    zip
    Updated Jun 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ocr (2024). Ocr Dataset [Dataset]. https://universe.roboflow.com/ocr-ipmuf/ocr-5e87s/dataset/5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    ocr
    Variables measured
    Meter Number Bounding Boxes
    Description

    Ocr

    ## Overview
    
    Ocr is a dataset for object detection tasks - it contains Meter Number annotations for 3,786 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
  15. m

    Gurmukhi dataset

    • data.mendeley.com
    Updated Sep 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atul Sharma (2024). Gurmukhi dataset [Dataset]. http://doi.org/10.17632/h65gdk4ptv.1
    Explore at:
    Dataset updated
    Sep 24, 2024
    Authors
    Atul Sharma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises a meticulously augmented collection of Gurmukhi handwritten characters, designed to enhance the performance of machine learning models in optical character recognition (OCR) and related tasks. It includes characters across 41 distinct classes, each augmented to reach a total of approximately 290 samples per class.

    Key Features:

    Gurmukhi Script Focus: The dataset exclusively features handwritten characters from the Gurmukhi script, catering specifically to applications involving Punjabi language processing. Diverse Augmentations: Images have been subjected to a range of transformations, including rotations, shifts, shears, zooms, and horizontal flips, promoting robustness to variations encountered in handwritten text. Consistent Dimensions: All images are resized to a uniform 256x256 resolution, ensuring compatibility with most deep learning architectures. Class-Specific Organization: Images are neatly organized into 41 folders, each representing a distinct Gurmukhi character, facilitating targeted training and evaluation. Handwritten Data Collection: The original images used for augmentation were collected from 10 volunteers, introducing natural variability in writing styles and further enhancing the dataset's diversity. Potential Use Cases:

    Gurmukhi OCR: Train and evaluate OCR models specifically for Gurmukhi script recognition. Handwriting Recognition: Develop models capable of recognizing and transcribing handwritten Gurmukhi text. Script Style Analysis: Explore the variations in handwriting styles within the Gurmukhi script.

  16. g

    Fire and Smoke Dataset.

    • gts.ai
    json
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2023). Fire and Smoke Dataset. [Dataset]. https://gts.ai/dataset-download/fire-and-smoke-dataset/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A Fire and Smoke Dataset is a collection of images and data specifically curated for the development, training, and evaluation of machine learning models and computer vision algorithms designed to detect and classify fires and smoke in various environments..

  17. h

    SROIE_2019_text_recognition

    • huggingface.co
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    priyank (2025). SROIE_2019_text_recognition [Dataset]. https://huggingface.co/datasets/priyank-m/SROIE_2019_text_recognition
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2025
    Authors
    priyank
    License

    https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/

    Description

    This dataset we prepared using the Scanned receipts OCR and information extraction(SROIE) dataset. The SROIE dataset contains 973 scanned receipts in English language. Cropping the bounding boxes from each of the receipts to generate this text-recognition dataset resulted in 33626 images for train set and 18704 images for the test set. The text annotations for all the images inside a split are stored in a metadata.jsonl file. usage: from dataset import load_dataset data =… See the full description on the dataset page: https://huggingface.co/datasets/priyank-m/SROIE_2019_text_recognition.

  18. R

    New Data Ocr Dataset

    • universe.roboflow.com
    zip
    Updated Mar 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lovely GP (2025). New Data Ocr Dataset [Dataset]. https://universe.roboflow.com/lovely-gp/new-data-ocr/model/4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 24, 2025
    Dataset authored and provided by
    lovely GP
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Inscriptions Bounding Boxes
    Description

    New Data OCR

    ## Overview
    
    New Data OCR is a dataset for object detection tasks - it contains Inscriptions annotations for 580 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  19. h

    OCR-VQA-200K

    • huggingface.co
    Updated Sep 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Cuellar (2024). OCR-VQA-200K [Dataset]. https://huggingface.co/datasets/atc96/OCR-VQA-200K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 27, 2024
    Authors
    Adam Cuellar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    atc96/OCR-VQA-200K dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. Aida Calculus Math Handwriting Recognition Dataset

    • kaggle.com
    Updated Oct 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aida by Pearson (2020). Aida Calculus Math Handwriting Recognition Dataset [Dataset]. https://www.kaggle.com/aidapearson/ocr-data/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aida by Pearson
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Context

    The Aida Calculus Math Handwriting Recognition Dataset consists of 100,000 images in 10 batches. Each image contains a photo of a handwritten calculus math expression (specifically within the topic of limits) written with a dark utensil on plain paper. Each image is accompanied by ground truth math expression in LaTeX as well as bounding boxes and pixel-level masks per character. All images are synthetically generated.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F67bf0c680286baf2c979c8207a991bb2%2FScreen%20Shot%202020-08-19%20at%201.02.50%20PM.png?generation=1597868629120369&alt=media%20=500x100" alt="">

    Motivation

    The complexity of handwriting recognition for math expressions can be decomposed into the following sources of variability:

    Image of Math = Math Expression x Math Characters x Location of Math Characters x Visual Qualities of the Math Characters (fonts, color) x Noise of Image (backgrounds, stray marks)

    It is the job of the recognition model to take the Image of Math as input and predict the Math Expression.
    Typical approaches to handwritten recognition tasks involve collecting and tagging of large amounts of data, on which many iterations of models are trained. The "one dataset, many models" paradigm has specific drawbacks within the context of product development. As product requirements evolve, such as the addition of a new mathematical character into the prediction space, a new data collection and tagging effort must be undertaken. The cycle of adapting the handwriting recognition capability to new requirements is long and does not support agile product development.

    Here, we take a different approach by iteratively building a complex, synthetically generated dataset towards specific requirements. The generation process delivers exact control over the distribution of math expressions, characters, location of characters, specific visual qualities of the math, image noise, and image augmentations to the developer. The developer controls every aspect of the data, down to each pixel. In many ways, the data synthesis runs backwards to the handwriting recognition model, creating visual complexity that the model must then untangle to uncover the ground truth math expression. Thus, we can arrive at a "many datasets, one model" paradigm that as product requirements change, the data can quickly iterate and adapt on agile cycles.

    In addition to affording more control over the product development process, synthetic data allows for 100% correct pixel by pixel tagging that opens the door for new modeling possibilities. Every image is tagged with the ground truth LaTeX for the expressions, bounding boxes per math character, and exact pixel masks for each character.

    Our goal in releasing this dataset is to provide the data science and machine learning community with resources for undertaking the challenging computer vision task of extracting math expressions from images. The data offers something to all levels, from beginners building simple character recognition models to experts who wish to predict pixel-by-pixel masks and decode the complex structure of math expressions.

    Content

    The images contain math expressions of limits, a topic typically encountered by students learning Calculus I in the United States. Features of the writing such as font, writing utensils (type, color, pressure, consistency), angle and distance of photo, and size of writing are all simulated. Backgrounds features include shadows, various plain paper types, bleed throughs, other distortions, and noise typical of student taking photos of their math.

    The strategy in defining the populations from which images are synthesized is to be a superset of what we expect students to submit. Therefore, the math expressions are not in themselves pedagogical, but aim to encompass the potential variety of student submissions, both mathematically correct and incorrect. The image features and augmentations are similarly designed to cover the range of possible student handwriting qualities.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F78c49b9673f8d07c91cd5c929e50ed13%2FPicture2.png?generation=1597361067979205&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F38f70b6a773709eb02578f20634e8433%2FPicture1.png?generation=1597361068613807&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F17a3a78ac635cd728f9d6ef32609aee8%2FPicture3.png?generation=1597361068784034&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2Fc052749a8085d66aa7bf97c78a4b6c6a%2FPicture4.png?generation=1597361068949074&alt=media%20=250x100" alt="">

    Data consis...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
Organization logo

OCR Document Text Recognition Dataset

Photos of the documents and text - OCR dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

  • images - contains of original images of documents
  • boxes - includes bounding box labeling for the original images
  • annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

  • "Text Title" - corresponds to titles, the box is red
  • "Text Paragraph" - corresponds to paragraphs of text, the box is blue
  • "Table" - corresponds to the table, the box is green
  • "Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

Search
Clear search
Close search
Google apps
Main menu