84 datasets found

OCR Document Text Recognition Dataset
kaggle.com
Updated Sep 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

images - contains of original images of documents

boxes - includes bounding box labeling for the original images

annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

"Text Title" - corresponds to titles, the box is red

"Text Paragraph" - corresponds to paragraphs of text, the box is blue

"Table" - corresponds to the table, the box is green

"Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text
English Monograph OCR Dataset (Preprocessed) 📄🔍
kaggle.com
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arjav 007 (2025). English Monograph OCR Dataset (Preprocessed) 📄🔍 [Dataset]. https://www.kaggle.com/datasets/arjav007/icdar-eng
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 21, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arjav 007
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset is a preprocessed version of the English Monograph subset from the ICDAR 2017 OCR Post-Correction competition. It contains OCR-generated text alongside its corresponding aligned ground truth, making it useful for OCR error detection and correction tasks.

📌 About the Dataset

The dataset consists of historical English texts that were processed using OCR technology. Due to OCR errors, the text contains misrecognized characters, missing words, and other inaccuracies. This dataset provides both raw OCR output and gold-standard corrected text.

🚀 Use Cases

This dataset is ideal for:
- OCR Error Detection & Correction 📝
- Training Character-Based Machine Translation Models 🔠
- Natural Language Processing (NLP) on Historical Texts 📜

📊 Dataset Statistics

Total Entries: 724

Character-Level OCR Error Rate: ~1.79%

Common OCR Errors Observed:

1 → I

tbe → the

tho → the

aud → and

📜 Citation

If you use this dataset, please cite the original ICDAR 2017 OCR Post-Correction paper:

Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P. (2017). ICDAR 2017 Competition on Post-OCR Text Correction.
License Plate Characters - Detection OCR
kaggle.com
Updated Feb 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Pettini (2022). License Plate Characters - Detection OCR [Dataset]. https://www.kaggle.com/datasets/francescopettini/license-plate-characters-detection-ocr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Francesco Pettini
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Content

The images come from this kaggle dataset. I cropped 209 license plates using the original bounding boxes and using LabelImg i labelled all the single characters, creating a total of 2026 character bounding boxes. Every image comes with a .xml annotation file with the same name, the format used is PascalVOC.

Inside the count.txt you can find the total occurences of each character.
OCR dataset
kaggle.com
zip
Updated May 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaibhav (2023). OCR dataset [Dataset]. https://www.kaggle.com/datasets/quantumkaze/infrrd-dataset
Explore at:
zip(464115421 bytes)Available download formats
Dataset updated
May 28, 2023
Authors
Vaibhav
Description
Dataset

This dataset was created by Vaibhav

Contents
14,511 Images English Handwriting OCR Data
m.nexdata.ai
nexdata.ai
Updated May 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 14,511 Images English Handwriting OCR Data [Dataset]. https://m.nexdata.ai/datasets/ocr/1215?source=Kaggle
Explore at:
Dataset updated
May 1, 2025
Dataset authored and provided by
Nexdata
Variables measured
Device, Accuracy, Data size, Data format, Data content, Photographic angle, Collecting environment, Population distribution, Nationality distribution
Description
14,511 Images English Handwriting OCR Data. The text carrier are A4 paper, lined paper, English paper, etc. The device is cellphone, the collection angle is eye-level angle. The dataset content includes English composition, poetry, prose, news, stories, etc. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.The dataset can be used for tasks such as English handwriting OCR.
14,980 Images PPT OCR Data of 8 Languages
m.nexdata.ai
nexdata.ai
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 14,980 Images PPT OCR Data of 8 Languages [Dataset]. https://m.nexdata.ai/datasets/ocr/979?source=Kaggle
Explore at:
Dataset updated
May 6, 2025
Dataset authored and provided by
Nexdata
Variables measured
Device, Accuracy, Data size, Data format, Data diversity, Language types, Data environment, Collecting angles, Annotation content
Description
14,980 Images PPT OCR Data of 8 Languages. This dataset includes 8 languages, multiple scenes, different photographic angles, different photographic distances, different light conditions. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The dataset can be used for tasks such as OCR of multi-language.
n
105,941 Images Natural Scenes OCR Data of 12 Languages
m.nexdata.ai
nexdata.ai
Updated Apr 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 105,941 Images Natural Scenes OCR Data of 12 Languages [Dataset]. https://m.nexdata.ai/datasets/ocr/1064?source=Kaggle
Explore at:
Dataset updated
Apr 5, 2025
Dataset provided by
nexdata technology inc
Nexdata
Authors
Nexdata
Variables measured
Device, Accuracy, Data size, Diversity, Image parameter, Annotation content, Collecting environment
Description
105,941 Images Natural Scenes OCR Data of 12 Languages. The data covers 12 languages (6 Asian languages, 6 European languages), multiple natural scenes, multiple photographic angles. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The data can be used for tasks such as OCR of multi-language.
Vietnamese Receipts MC_OCR 2021
kaggle.com
zip
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DoMixi1989 (2022). Vietnamese Receipts MC_OCR 2021 [Dataset]. https://www.kaggle.com/datasets/domixi1989/vietnamese-receipts-mc-ocr-2021
Explore at:
zip(2271709772 bytes)Available download formats
Dataset updated
Apr 8, 2022
Authors
DoMixi1989
Description
Dataset

This dataset was created by DoMixi1989

Contents
i
OCR Telugu Image Dataset
ieee-dataport.org
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kadavakollu Rao (2023). OCR Telugu Image Dataset [Dataset]. https://ieee-dataport.org/documents/ocr-telugu-image-dataset
Explore at:
Dataset updated
Dec 8, 2023
Authors
Kadavakollu Rao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The choice of the dataset is the key for OCR systems. Unfortunately
R
Fire Kaggle Annotation Dataset
universe.roboflow.com
zip
Updated Jan 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ocr (2024). Fire Kaggle Annotation Dataset [Dataset]. https://universe.roboflow.com/ocr-knfae/fire-dataset-kaggle-annotation/model/1
Explore at:
zipAvailable download formats
Dataset updated
Jan 2, 2024
Dataset authored and provided by
ocr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Fire Smoke 4BLg Bounding Boxes
Description
Fire Dataset Kaggle Annotation

## Overview Fire Dataset Kaggle Annotation is a dataset for object detection tasks - it contains Fire Smoke 4BLg annotations for 358 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Handwriting OCR Data of Japanese and Korean
kaggle.com
Updated Oct 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Wong (2023). Handwriting OCR Data of Japanese and Korean [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/handwriting-ocr-data-of-japanese-and-korean/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Frank Wong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Description This dadaset was collected from 100 subjects including 50 Japanese, 49 Koreans and 1 Afghan. For different subjects, the corpus are different. The data diversity includes multiple cellphone models and different corpus. This dataset can be used for tasks, such as handwriting OCR data of Japanese and Korean. For more details, please visit: https://www.nexdata.ai/datasets/ocr/127?source=Kaggle

Specifications

Data size 100 people, the total number of handwriting piece is 22,163, at least 159 handwriting pieces for each subject Nationality distribution 50 Japanese, 49 Koreans and 1 Afghan Gender distribution males Age distribution the young and middle-aged people are the majorities Data diversity multiple cellphone models, different corpus Device cellphone Data format .json Annotation content text content, age, nationality, trace of handwriting Accuracy The annotation accuracy is not less than 95%

Get the Dataset This is just an example of the data. To access more sample data or request the price, contact us at info@nexdata.ai
h
PashtoOCR
huggingface.co
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zirak (2025). PashtoOCR [Dataset]. https://huggingface.co/datasets/zirak-ai/PashtoOCR
Explore at:
Dataset updated
May 18, 2025
Dataset authored and provided by
Zirak
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
PsOCR - Pashto OCR Dataset

🌐 Zirak.ai | 🤗 HuggingFace | GitHub | Kaggle | 📑 Paper

PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language The dataset is also available at: https://www.kaggle.com/datasets/drijaz/PashtoOCR

Introduction

PsOCR is a large-scale synthetic dataset for Optical Character Recognition in low-resource Pashto… See the full description on the dataset page: https://huggingface.co/datasets/zirak-ai/PashtoOCR.
Z
DECIMER Image classifier dataset
data.niaid.nih.gov
Updated Jul 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Isabel agea (2022). DECIMER Image classifier dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6670745
Explore at:
Dataset updated
Jul 9, 2022
Dataset authored and provided by
M. Isabel agea
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Images dataset divided into train (10905114 images), validation (2115528 images) and test (544946 images) folders containing a balanced number of images for two classes (chemical structures and non-chemical structures).

The chemical structures were generated using RanDepict to random picked compounds from the ChEMBL30 database and the COCONUT database.

The non-chemical structures were generated using Python or they were retrieved from several public datasets:

COCO dataset, MIT Places-205 dataset, Visual Genome dataset, Google Open labeled Images, MMU-OCR-21 (kaggle), HandWritten_Character (kaggle), CoronaHack -Chest X-Ray-dataset (kaggle), PANDAS Augmented Images (kaggle), Bacterial_Colony (kaggle), Ceylon Epigraphy Periods (kaggle), Chinese Calligraphy Styles by Calligraphers (kaggle), Graphs Dataset (kaggle), Function_Graphs Polynomial (kaggle), sketches (kaggle), Person Face Sketches (kaggle), Art Pictograms (kaggle), Russian handwritten letters (kaggle), Handwritten Russian Letters (kaggle), Covid-19 Misinformation Tweets Labeled Dataset (kaggle) and grapheme-imgs-224x224 (kaggle).

This data was used to build a CNN classification model using as a base model EfficienNetB0 and fine tuning it. The model is available on Github.
Scanned Czech Receipts Dataset
kaggle.com
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Jansa (2025). Scanned Czech Receipts Dataset [Dataset]. https://www.kaggle.com/datasets/davidjansa/scanned-czech-receipts-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
David Jansa
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains 175 flatbed-scanned Czech receipts, each labeled from 001 to 175. The dataset includes real-world variability, such as faded or dark receipts (marked with a "b" in the filename, e.g. 014b.jpg).

File Descriptions

The dataset is organized into three directories:

scans/ Contains JPEG images of scanned receipts. Some images are dark or have lower contrast, simulating real-world scanning scenarios.

ocr_target/ Contains .txt files with a line-by-line literal transcription of each receipt, suitable for OCR model evaluation.

segment_target/ Contains .json files with structured information extracted from each receipt. Each JSON file captures key details, such as store name, purchase date, currency, and itemized product data (including discounts). Product data DO NOT include duplicates. (Maybe I will update the segment_target dataset in the future to include duplicated product names as well...)

Each .json file in segment_target/ follows this schema: { "company": "tesco", "date": "26.07.2024", "currency": "czk", "products": { "madeta cottage 150 g": 29.9, "raj.cel.lou400g/240g": 39.9, "cc raj.cel.lou400g/2": -20, "cc madeta cottage 15": -40 } } company: Name of the store or seller (e.g., "tesco") in lowercase.

date: Date of purchase in DD.MM.YYYY format.

currency: Transaction currency (e.g., "czk") in lowercase.

products: Key-value pairs of product names (lowercase) and their prices. Discounts are represented as negative values.

Warning: Some fields may contain null if the data could not be extracted reliably.
OCR image data of Korean documents
kaggle.com
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Appen Limited (2025). OCR image data of Korean documents [Dataset]. https://www.kaggle.com/datasets/appenlimited/ocr-image-data-of-korean-documents
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 13, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Appen Limited
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
如需完整数据集或了解更多，请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com

The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING

Data Specification Usage Cases Image label recognition training Collecting device Mobile phone / Camera Collecting environment Multiple lights environments

Database Name Category Quantity

Korean Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024

Vietnamese Document OCR Images

RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080

Spanish Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000

French Document OCR Images

RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003

Thai Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037

Japanese Document OCR Images

RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147

Indonesian Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006

Tamil Document OCR Images

RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963

Burmese Document OCR Images

RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118

English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118

Information provided by database

Data Format：. JPG
h
student-enrollment
huggingface.co
Updated Jul 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jina AI (2025). student-enrollment [Dataset]. https://huggingface.co/datasets/jinaai/student-enrollment
Explore at:
Dataset updated
Jul 20, 2025
Dataset authored and provided by
Jina AI
Description
Student Enrollment Document Retrieval

This dataset is created from the original Kaggle Delaware Student Enrollment dataset. The charts are rendered and queries created using templates. The text_description column contains OCR text extracted from the images using EasyOCR. This particular dataset is a subsample of at maximum 1000 random rows from the full dataset which can be found here.

Disclaimer

This dataset may contain publicly available images or text data. All… See the full description on the dataset page: https://huggingface.co/datasets/jinaai/student-enrollment.
Passport OCR
kaggle.com
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PhuDucNguyen108 (2025). Passport OCR [Dataset]. https://www.kaggle.com/datasets/phuducnguyen108/passport-ocr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PhuDucNguyen108
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by PhuDucNguyen108

Released under MIT

Contents
image-ocr-data
kaggle.com
Updated Mar 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
gechengze (2022). image-ocr-data [Dataset]. https://www.kaggle.com/datasets/gechengze/image-ocr-data/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 23, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
gechengze
Description
Dataset

This dataset was created by gechengze

Contents

DATA ANALYSIS OF DATETIME-BASED OCR dataset

kaggle.com

Updated Jul 11, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ivanna Lin (2025). DATA ANALYSIS OF DATETIME-BASED OCR dataset [Dataset]. https://www.kaggle.com/datasets/ivannalin/data-analysis-of-datetime-based-ocr-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 11, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ivanna Lin

Description

Timestamped Surveillance Video Datasets for OCR

This repository hosts datasets used in the project: DATA ANALYSIS OF DATETIME BASED OCR. These datasets are derived from surveillance videos embedded with overlay text showing the date and time of recording in YYYY-MM-DD and HH:MM:SS formats, respectively. The datasets are intended for use in OCR (Optical Character Recognition) training and evaluation, particularly in timestamp recognition tasks.

Dataset Overview

Dataset	Date Captured	Time Span	Dimensions (px)	File Size Range	Duration
1	25 October 2024	14:34:20 – 21:02:35	457 × 55	3–11 KB	~7 hours
2	19 October 2023	11:54:09 – 21:12:47	224 × 25	1–4 KB	~9 hours
3	10 January 2024	00:05:45 – 23:58:45	420 × 50	2–8 KB	~24 hours

Each image is a cropped region containing the timestamp overlay extracted from a video frame. The datasets include various degrees of corruption, camera motion, and resolution to reflect real-world surveillance conditions.

Dataset Collection Pipeline

Video Processing Details

Video format: MPEG-2 Transport Stream (.ts)
Frame sampling: Based on video frame rate (e.g., 25 fps)
Cropping: Region of interest (ROI) defined per dataset. Example (Dataset 1): [left=1438, top=15, right=1895, bottom=70]

Each frame in the video is read at specified intervals, cropped using predefined coordinates, and saved in .jpg format to the corresponding dataset folder.

Ground Truth Generation

OCR-based timestamp labelling was performed semi-automatically using PaddleOCR with the following setup:

Language model: English (lang='en')

Cleaning & Validation

Timestamps are extracted using regex matching.
Each extracted timestamp is validated against the folder name’s inferred date (e.g., 20240110 → 2024-01-10).
Low-confidence results (< 0.7) are flagged for manual inspection.
Metadata includes:
- filename
- timestamp

Output Format

Ground truth results are available in:

Filtering Datasets

It is observed that the training losses for both models are considerably higher than validation loss, which is less common behaviour. Suspecting that data quirks may be in play, the datasets are reevaluated.

Due to the qualities of time, datetime data may present bias as time components of a higher degree persist throughout the dataset for a longer period than those of lower degree. To confirm that such bias is not amplified further by subsequent frames of same timestamps, all the datasets are filtered to remove images with duplicate timestamp values by keeping the first occurrence only. Then, they are reallocated into training and testing with the same train-test ratio, without similar data being present in both. This is done so that the model trains on diverse text samples rather than repeated words to better evaluate the models’ exploration and generalisation capabilities.

Filtered Dataset Descriptions

Datasets 4.5.1 & 4.5.2 (Unfiltered)

Dataset	Train Size	Test Size
1	50,854	12,713
2	112,816	28,204
3	5,780	1,446

Datasets 4.5.3 & 4.5.4 (Filtered)

Filtered Dataset	Train Size	Test Size
1	25,505	6,377
2	56,718	14,180
3	3,213	804

The "datasets 4.5.1 & 4.5.2" and "datasets 4.5.3 & 4.5.4" refer to the same datasets used in experiments detailed in sections 4.5.1–2 and 4.5.3–4 of the project respectively. The latter group of datasets have undergone a filtering process to remove duplicate timestamp instances.

Related Project

For more information, please refer to the main repository: 👉 IvannaLin/DATA-ANALYSIS-OF-DATETIME-BASED-OCR

Citation

If you use this dataset in your research, please cite the associated source or contact the corresponding author.

WikiDT - Table QA and Visual(OCR)-based TableQA
kaggle.com
Updated Jul 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WikiDocument Dataset (2022). WikiDT - Table QA and Visual(OCR)-based TableQA [Dataset]. https://www.kaggle.com/datasets/wikidocumentdataset/questionanswering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
WikiDocument Dataset
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset

This dataset was created by WikiDocument Dataset

Released under CC BY-SA 3.0

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2

OCR Document Text Recognition Dataset

Photos of the documents and text - OCR dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 7, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Training Data

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

images - contains of original images of documents
boxes - includes bounding box labeling for the original images
annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

"Text Title" - corresponds to titles, the box is red
"Text Paragraph" - corresponds to paragraphs of text, the box is blue
"Table" - corresponds to the table, the box is green
"Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

Clear search

Close search

Google apps

Main menu

OCR Document Text Recognition Dataset

OCR Text Detection in the Documents Object Detection dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

Dataset structure

Data Format

Labels for the text:

Example of XML file structure

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

English Monograph OCR Dataset (Preprocessed) 📄🔍

📌 About the Dataset

🚀 Use Cases

📊 Dataset Statistics

📜 Citation

License Plate Characters - Detection OCR

Content

OCR dataset

Dataset

Contents

14,511 Images English Handwriting OCR Data

14,980 Images PPT OCR Data of 8 Languages

105,941 Images Natural Scenes OCR Data of 12 Languages

Vietnamese Receipts MC_OCR 2021

Dataset

Contents

OCR Telugu Image Dataset

Fire Kaggle Annotation Dataset

Fire Dataset Kaggle Annotation

Handwriting OCR Data of Japanese and Korean

PashtoOCR

DECIMER Image classifier dataset

Scanned Czech Receipts Dataset

File Descriptions

OCR image data of Korean documents

Korean Document OCR Images

Vietnamese Document OCR Images

Spanish Document OCR Images

French Document OCR Images

Thai Document OCR Images

Japanese Document OCR Images

Indonesian Document OCR Images

Tamil Document OCR Images

Burmese Document OCR Images

student-enrollment

Passport OCR

Dataset

Contents

image-ocr-data

Dataset

Contents

DATA ANALYSIS OF DATETIME-BASED OCR dataset

Timestamped Surveillance Video Datasets for OCR

Dataset Overview

Dataset Collection Pipeline

Video Processing Details

Ground Truth Generation

Cleaning & Validation

Output Format

Filtering Datasets

Filtered Dataset Descriptions

Datasets 4.5.1 & 4.5.2 (Unfiltered)

Datasets 4.5.3 & 4.5.4 (Filtered)

Related Project

Citation

WikiDT - Table QA and Visual(OCR)-based TableQA

Dataset

Contents

OCR Document Text Recognition DatasetSee More Versions

Photos of the documents and text - OCR dataset

OCR Text Detection in the Documents Object Detection dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

Dataset structure

Data Format

Labels for the text:

Example of XML file structure

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

OCR Document Text Recognition Dataset