84 datasets found
  1. OCR Document Text Recognition Dataset

    • kaggle.com
    Updated Sep 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    OCR Text Detection in the Documents Object Detection dataset

    The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

    The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

    ๐Ÿ’ด For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

    The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

    Dataset structure

    • images - contains of original images of documents
    • boxes - includes bounding box labeling for the original images
    • annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

    Data Format

    Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

    Labels for the text:

    • "Text Title" - corresponds to titles, the box is red
    • "Text Paragraph" - corresponds to paragraphs of text, the box is blue
    • "Table" - corresponds to the table, the box is green
    • "Handwritten" - corresponds to handwritten text, the box is purple

    Example of XML file structure

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

    Text Detection in the Documents might be made in accordance with your requirements.

    ๐Ÿ’ด Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

  2. English Monograph OCR Dataset (Preprocessed) ๐Ÿ“„๐Ÿ”

    • kaggle.com
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arjav 007 (2025). English Monograph OCR Dataset (Preprocessed) ๐Ÿ“„๐Ÿ” [Dataset]. https://www.kaggle.com/datasets/arjav007/icdar-eng
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arjav 007
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset is a preprocessed version of the English Monograph subset from the ICDAR 2017 OCR Post-Correction competition. It contains OCR-generated text alongside its corresponding aligned ground truth, making it useful for OCR error detection and correction tasks.

    ๐Ÿ“Œ About the Dataset

    The dataset consists of historical English texts that were processed using OCR technology. Due to OCR errors, the text contains misrecognized characters, missing words, and other inaccuracies. This dataset provides both raw OCR output and gold-standard corrected text.

    ๐Ÿš€ Use Cases

    This dataset is ideal for:
    - OCR Error Detection & Correction ๐Ÿ“
    - Training Character-Based Machine Translation Models ๐Ÿ” 
    - Natural Language Processing (NLP) on Historical Texts ๐Ÿ“œ

    ๐Ÿ“Š Dataset Statistics

    • Total Entries: 724
    • Character-Level OCR Error Rate: ~1.79%
    • Common OCR Errors Observed:
      • 1 โ†’ I
      • tbe โ†’ the
      • tho โ†’ the
      • aud โ†’ and

    ๐Ÿ“œ Citation

    If you use this dataset, please cite the original ICDAR 2017 OCR Post-Correction paper:

    Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P. (2017). ICDAR 2017 Competition on Post-OCR Text Correction.

  3. License Plate Characters - Detection OCR

    • kaggle.com
    Updated Feb 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Pettini (2022). License Plate Characters - Detection OCR [Dataset]. https://www.kaggle.com/datasets/francescopettini/license-plate-characters-detection-ocr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Francesco Pettini
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Content

    The images come from this kaggle dataset. I cropped 209 license plates using the original bounding boxes and using LabelImg i labelled all the single characters, creating a total of 2026 character bounding boxes. Every image comes with a .xml annotation file with the same name, the format used is PascalVOC.

    Inside the count.txt you can find the total occurences of each character.

  4. OCR dataset

    • kaggle.com
    zip
    Updated May 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaibhav (2023). OCR dataset [Dataset]. https://www.kaggle.com/datasets/quantumkaze/infrrd-dataset
    Explore at:
    zip(464115421 bytes)Available download formats
    Dataset updated
    May 28, 2023
    Authors
    Vaibhav
    Description

    Dataset

    This dataset was created by Vaibhav

    Contents

  5. 14,511 Images English Handwriting OCR Data

    • m.nexdata.ai
    • nexdata.ai
    Updated May 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 14,511 Images English Handwriting OCR Data [Dataset]. https://m.nexdata.ai/datasets/ocr/1215?source=Kaggle
    Explore at:
    Dataset updated
    May 1, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Device, Accuracy, Data size, Data format, Data content, Photographic angle, Collecting environment, Population distribution, Nationality distribution
    Description

    14,511 Images English Handwriting OCR Data. The text carrier are A4 paper, lined paper, English paper, etc. The device is cellphone, the collection angle is eye-level angle. The dataset content includes English composition, poetry, prose, news, stories, etc. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.The dataset can be used for tasks such as English handwriting OCR.

  6. 14,980 Images PPT OCR Data of 8 Languages

    • m.nexdata.ai
    • nexdata.ai
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 14,980 Images PPT OCR Data of 8 Languages [Dataset]. https://m.nexdata.ai/datasets/ocr/979?source=Kaggle
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Device, Accuracy, Data size, Data format, Data diversity, Language types, Data environment, Collecting angles, Annotation content
    Description

    14,980 Images PPT OCR Data of 8 Languages. This dataset includes 8 languages, multiple scenes, different photographic angles, different photographic distances, different light conditions. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The dataset can be used for tasks such as OCR of multi-language.

  7. n

    105,941 Images Natural Scenes OCR Data of 12 Languages

    • m.nexdata.ai
    • nexdata.ai
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 105,941 Images Natural Scenes OCR Data of 12 Languages [Dataset]. https://m.nexdata.ai/datasets/ocr/1064?source=Kaggle
    Explore at:
    Dataset updated
    Apr 5, 2025
    Dataset provided by
    nexdata technology inc
    Nexdata
    Authors
    Nexdata
    Variables measured
    Device, Accuracy, Data size, Diversity, Image parameter, Annotation content, Collecting environment
    Description

    105,941 Images Natural Scenes OCR Data of 12 Languages. The data covers 12 languages (6 Asian languages, 6 European languages), multiple natural scenes, multiple photographic angles. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The data can be used for tasks such as OCR of multi-language.

  8. Vietnamese Receipts MC_OCR 2021

    • kaggle.com
    zip
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DoMixi1989 (2022). Vietnamese Receipts MC_OCR 2021 [Dataset]. https://www.kaggle.com/datasets/domixi1989/vietnamese-receipts-mc-ocr-2021
    Explore at:
    zip(2271709772 bytes)Available download formats
    Dataset updated
    Apr 8, 2022
    Authors
    DoMixi1989
    Description

    Dataset

    This dataset was created by DoMixi1989

    Contents

  9. i

    OCR Telugu Image Dataset

    • ieee-dataport.org
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kadavakollu Rao (2023). OCR Telugu Image Dataset [Dataset]. https://ieee-dataport.org/documents/ocr-telugu-image-dataset
    Explore at:
    Dataset updated
    Dec 8, 2023
    Authors
    Kadavakollu Rao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The choice of the dataset is the key for OCR systems. Unfortunately

  10. R

    Fire Kaggle Annotation Dataset

    • universe.roboflow.com
    zip
    Updated Jan 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ocr (2024). Fire Kaggle Annotation Dataset [Dataset]. https://universe.roboflow.com/ocr-knfae/fire-dataset-kaggle-annotation/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 2, 2024
    Dataset authored and provided by
    ocr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Fire Smoke 4BLg Bounding Boxes
    Description

    Fire Dataset Kaggle Annotation

    ## Overview
    
    Fire Dataset Kaggle Annotation is a dataset for object detection tasks - it contains Fire Smoke 4BLg annotations for 358 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  11. Handwriting OCR Data of Japanese and Korean

    • kaggle.com
    Updated Oct 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Wong (2023). Handwriting OCR Data of Japanese and Korean [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/handwriting-ocr-data-of-japanese-and-korean/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Frank Wong
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Description This dadaset was collected from 100 subjects including 50 Japanese, 49 Koreans and 1 Afghan. For different subjects, the corpus are different. The data diversity includes multiple cellphone models and different corpus. This dataset can be used for tasks, such as handwriting OCR data of Japanese and Korean. For more details, please visit: https://www.nexdata.ai/datasets/ocr/127?source=Kaggle

    Specifications

    Data size 100 people, the total number of handwriting piece is 22,163, at least 159 handwriting pieces for each subject Nationality distribution 50 Japanese, 49 Koreans and 1 Afghan Gender distribution males Age distribution the young and middle-aged people are the majorities Data diversity multiple cellphone models, different corpus Device cellphone Data format .json Annotation content text content, age, nationality, trace of handwriting Accuracy The annotation accuracy is not less than 95%

    Get the Dataset This is just an example of the data. To access more sample data or request the price, contact us at info@nexdata.ai

  12. h

    PashtoOCR

    • huggingface.co
    Updated May 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zirak (2025). PashtoOCR [Dataset]. https://huggingface.co/datasets/zirak-ai/PashtoOCR
    Explore at:
    Dataset updated
    May 18, 2025
    Dataset authored and provided by
    Zirak
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    PsOCR - Pashto OCR Dataset

      ๐ŸŒ Zirak.ai
         |  ๐Ÿค— HuggingFace
         |  GitHub
         |  Kaggle
         |  ๐Ÿ“‘ Paper
    

    PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language The dataset is also available at: https://www.kaggle.com/datasets/drijaz/PashtoOCR

      Introduction
    

    PsOCR is a large-scale synthetic dataset for Optical Character Recognition in low-resource Pashtoโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/zirak-ai/PashtoOCR.

  13. Z

    DECIMER Image classifier dataset

    • data.niaid.nih.gov
    Updated Jul 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Isabel agea (2022). DECIMER Image classifier dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6670745
    Explore at:
    Dataset updated
    Jul 9, 2022
    Dataset authored and provided by
    M. Isabel agea
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Images dataset divided into train (10905114 images), validation (2115528 images) and test (544946 images) folders containing a balanced number of images for two classes (chemical structures and non-chemical structures).

    The chemical structures were generated using RanDepict to random picked compounds from the ChEMBL30 database and the COCONUT database.

    The non-chemical structures were generated using Python or they were retrieved from several public datasets:

    COCO dataset, MIT Places-205 dataset, Visual Genome dataset, Google Open labeled Images, MMU-OCR-21 (kaggle), HandWritten_Character (kaggle), CoronaHack -Chest X-Ray-dataset (kaggle), PANDAS Augmented Images (kaggle), Bacterial_Colony (kaggle), Ceylon Epigraphy Periods (kaggle), Chinese Calligraphy Styles by Calligraphers (kaggle), Graphs Dataset (kaggle), Function_Graphs Polynomial (kaggle), sketches (kaggle), Person Face Sketches (kaggle), Art Pictograms (kaggle), Russian handwritten letters (kaggle), Handwritten Russian Letters (kaggle), Covid-19 Misinformation Tweets Labeled Dataset (kaggle) and grapheme-imgs-224x224 (kaggle).

    This data was used to build a CNN classification model using as a base model EfficienNetB0 and fine tuning it. The model is available on Github.

  14. Scanned Czech Receipts Dataset

    • kaggle.com
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Jansa (2025). Scanned Czech Receipts Dataset [Dataset]. https://www.kaggle.com/datasets/davidjansa/scanned-czech-receipts-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    David Jansa
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains 175 flatbed-scanned Czech receipts, each labeled from 001 to 175. The dataset includes real-world variability, such as faded or dark receipts (marked with a "b" in the filename, e.g. 014b.jpg).

    File Descriptions

    The dataset is organized into three directories:

    scans/ Contains JPEG images of scanned receipts. Some images are dark or have lower contrast, simulating real-world scanning scenarios.

    ocr_target/ Contains .txt files with a line-by-line literal transcription of each receipt, suitable for OCR model evaluation.

    segment_target/ Contains .json files with structured information extracted from each receipt. Each JSON file captures key details, such as store name, purchase date, currency, and itemized product data (including discounts). Product data DO NOT include duplicates. (Maybe I will update the segment_target dataset in the future to include duplicated product names as well...)

    Each .json file in segment_target/ follows this schema: { "company": "tesco", "date": "26.07.2024", "currency": "czk", "products": { "madeta cottage 150 g": 29.9, "raj.cel.lou400g/240g": 39.9, "cc raj.cel.lou400g/2": -20, "cc madeta cottage 15": -40 } } company: Name of the store or seller (e.g., "tesco") in lowercase.

    date: Date of purchase in DD.MM.YYYY format.

    currency: Transaction currency (e.g., "czk") in lowercase.

    products: Key-value pairs of product names (lowercase) and their prices. Discounts are represented as negative values.

    Warning: Some fields may contain null if the data could not be extracted reliably.

  15. OCR image data of Korean documents

    • kaggle.com
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Appen Limited (2025). OCR image data of Korean documents [Dataset]. https://www.kaggle.com/datasets/appenlimited/ocr-image-data-of-korean-documents
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 13, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Appen Limited
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ๅฆ‚้œ€ๅฎŒๆ•ดๆ•ฐๆฎ้›†ๆˆ–ไบ†่งฃๆ›ดๅคš๏ผŒ่ฏทๅ‘้‚ฎไปถ่‡ณcommercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com

    The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING

    1. Data Specification Usage Cases Image label recognition training Collecting device Mobile phone / Camera Collecting environment Multiple lights environments

    Database Name Category Quantity

    Korean Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024

    Vietnamese Document OCR Images

    RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080

    Spanish Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000

    French Document OCR Images

    RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003

    Thai Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037

    Japanese Document OCR Images

    RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147

    Indonesian Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006

    Tamil Document OCR Images

    RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963

    Burmese Document OCR Images

    RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118

    English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118

    1. Information provided by database
    2. Data Format๏ผš. JPG
  16. h

    student-enrollment

    • huggingface.co
    Updated Jul 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jina AI (2025). student-enrollment [Dataset]. https://huggingface.co/datasets/jinaai/student-enrollment
    Explore at:
    Dataset updated
    Jul 20, 2025
    Dataset authored and provided by
    Jina AI
    Description

    Student Enrollment Document Retrieval

    This dataset is created from the original Kaggle Delaware Student Enrollment dataset. The charts are rendered and queries created using templates. The text_description column contains OCR text extracted from the images using EasyOCR. This particular dataset is a subsample of at maximum 1000 random rows from the full dataset which can be found here.

      Disclaimer
    

    This dataset may contain publicly available images or text data. Allโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/jinaai/student-enrollment.

  17. Passport OCR

    • kaggle.com
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PhuDucNguyen108 (2025). Passport OCR [Dataset]. https://www.kaggle.com/datasets/phuducnguyen108/passport-ocr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    PhuDucNguyen108
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by PhuDucNguyen108

    Released under MIT

    Contents

  18. image-ocr-data

    • kaggle.com
    Updated Mar 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gechengze (2022). image-ocr-data [Dataset]. https://www.kaggle.com/datasets/gechengze/image-ocr-data/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    gechengze
    Description

    Dataset

    This dataset was created by gechengze

    Contents

  19. DATA ANALYSIS OF DATETIME-BASED OCR dataset

    • kaggle.com
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivanna Lin (2025). DATA ANALYSIS OF DATETIME-BASED OCR dataset [Dataset]. https://www.kaggle.com/datasets/ivannalin/data-analysis-of-datetime-based-ocr-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ivanna Lin
    Description

    Timestamped Surveillance Video Datasets for OCR

    This repository hosts datasets used in the project: DATA ANALYSIS OF DATETIME BASED OCR. These datasets are derived from surveillance videos embedded with overlay text showing the date and time of recording in YYYY-MM-DD and HH:MM:SS formats, respectively. The datasets are intended for use in OCR (Optical Character Recognition) training and evaluation, particularly in timestamp recognition tasks.

    Dataset Overview

    DatasetDate CapturedTime SpanDimensions (px)File Size RangeDuration
    125 October 202414:34:20 โ€“ 21:02:35457 ร— 553โ€“11 KB~7 hours
    219 October 202311:54:09 โ€“ 21:12:47224 ร— 251โ€“4 KB~9 hours
    310 January 202400:05:45 โ€“ 23:58:45420 ร— 502โ€“8 KB~24 hours

    Each image is a cropped region containing the timestamp overlay extracted from a video frame. The datasets include various degrees of corruption, camera motion, and resolution to reflect real-world surveillance conditions.

    Dataset Collection Pipeline

    Video Processing Details

    • Video format: MPEG-2 Transport Stream (.ts)
    • Frame sampling: Based on video frame rate (e.g., 25 fps)
    • Cropping: Region of interest (ROI) defined per dataset. Example (Dataset 1): [left=1438, top=15, right=1895, bottom=70]

    Each frame in the video is read at specified intervals, cropped using predefined coordinates, and saved in .jpg format to the corresponding dataset folder.

    Ground Truth Generation

    OCR-based timestamp labelling was performed semi-automatically using PaddleOCR with the following setup:

    • Language model: English (lang='en')

    Cleaning & Validation

    • Timestamps are extracted using regex matching.
    • Each extracted timestamp is validated against the folder nameโ€™s inferred date (e.g., 20240110 โ†’ 2024-01-10).
    • Low-confidence results (< 0.7) are flagged for manual inspection.
    • Metadata includes:

      • filename
      • timestamp

    Output Format

    Ground truth results are available in:

    • CSV

    Filtering Datasets

    It is observed that the training losses for both models are considerably higher than validation loss, which is less common behaviour. Suspecting that data quirks may be in play, the datasets are reevaluated.

    Due to the qualities of time, datetime data may present bias as time components of a higher degree persist throughout the dataset for a longer period than those of lower degree. To confirm that such bias is not amplified further by subsequent frames of same timestamps, all the datasets are filtered to remove images with duplicate timestamp values by keeping the first occurrence only. Then, they are reallocated into training and testing with the same train-test ratio, without similar data being present in both. This is done so that the model trains on diverse text samples rather than repeated words to better evaluate the modelsโ€™ exploration and generalisation capabilities.

    Filtered Dataset Descriptions

    Datasets 4.5.1 & 4.5.2 (Unfiltered)

    DatasetTrain SizeTest Size
    150,85412,713
    2112,81628,204
    35,7801,446

    Datasets 4.5.3 & 4.5.4 (Filtered)

    Filtered DatasetTrain SizeTest Size
    125,5056,377
    256,71814,180
    33,213804

    The "datasets 4.5.1 & 4.5.2" and "datasets 4.5.3 & 4.5.4" refer to the same datasets used in experiments detailed in sections 4.5.1โ€“2 and 4.5.3โ€“4 of the project respectively. The latter group of datasets have undergone a filtering process to remove duplicate timestamp instances.

    Related Project

    For more information, please refer to the main repository: ๐Ÿ‘‰ IvannaLin/DATA-ANALYSIS-OF-DATETIME-BASED-OCR

    Citation

    If you use this dataset in your research, please cite the associated source or contact the corresponding author.

  20. WikiDT - Table QA and Visual(OCR)-based TableQA

    • kaggle.com
    Updated Jul 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WikiDocument Dataset (2022). WikiDT - Table QA and Visual(OCR)-based TableQA [Dataset]. https://www.kaggle.com/datasets/wikidocumentdataset/questionanswering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    WikiDocument Dataset
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by WikiDocument Dataset

    Released under CC BY-SA 3.0

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
Organization logo

OCR Document Text Recognition Dataset

Photos of the documents and text - OCR dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

๐Ÿ’ด For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

  • images - contains of original images of documents
  • boxes - includes bounding box labeling for the original images
  • annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

  • "Text Title" - corresponds to titles, the box is red
  • "Text Paragraph" - corresponds to paragraphs of text, the box is blue
  • "Table" - corresponds to the table, the box is green
  • "Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

๐Ÿ’ด Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

Search
Clear search
Close search
Google apps
Main menu