Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.
The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.
The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">
Each image from images
folder is accompanied by an XML-annotation in the annotations.xml
file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">
keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a preprocessed version of the English Monograph subset from the ICDAR 2017 OCR Post-Correction competition. It contains OCR-generated text alongside its corresponding aligned ground truth, making it useful for OCR error detection and correction tasks.
The dataset consists of historical English texts that were processed using OCR technology. Due to OCR errors, the text contains misrecognized characters, missing words, and other inaccuracies. This dataset provides both raw OCR output and gold-standard corrected text.
This dataset is ideal for:
- OCR Error Detection & Correction ๐
- Training Character-Based Machine Translation Models ๐
- Natural Language Processing (NLP) on Historical Texts ๐
1 โ I
tbe โ the
tho โ the
aud โ and
If you use this dataset, please cite the original ICDAR 2017 OCR Post-Correction paper:
Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P. (2017). ICDAR 2017 Competition on Post-OCR Text Correction.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The images come from this kaggle dataset. I cropped 209 license plates using the original bounding boxes and using LabelImg i labelled all the single characters, creating a total of 2026 character bounding boxes. Every image comes with a .xml annotation file with the same name, the format used is PascalVOC.
Inside the count.txt you can find the total occurences of each character.
This dataset was created by Vaibhav
14,511 Images English Handwriting OCR Data. The text carrier are A4 paper, lined paper, English paper, etc. The device is cellphone, the collection angle is eye-level angle. The dataset content includes English composition, poetry, prose, news, stories, etc. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.The dataset can be used for tasks such as English handwriting OCR.
14,980 Images PPT OCR Data of 8 Languages. This dataset includes 8 languages, multiple scenes, different photographic angles, different photographic distances, different light conditions. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The dataset can be used for tasks such as OCR of multi-language.
105,941 Images Natural Scenes OCR Data of 12 Languages. The data covers 12 languages (6 Asian languages, 6 European languages), multiple natural scenes, multiple photographic angles. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The data can be used for tasks such as OCR of multi-language.
This dataset was created by DoMixi1989
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The choice of the dataset is the key for OCR systems. Unfortunately
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Fire Dataset Kaggle Annotation is a dataset for object detection tasks - it contains Fire Smoke 4BLg annotations for 358 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description This dadaset was collected from 100 subjects including 50 Japanese, 49 Koreans and 1 Afghan. For different subjects, the corpus are different. The data diversity includes multiple cellphone models and different corpus. This dataset can be used for tasks, such as handwriting OCR data of Japanese and Korean. For more details, please visit: https://www.nexdata.ai/datasets/ocr/127?source=Kaggle
Specifications
Data size 100 people, the total number of handwriting piece is 22,163, at least 159 handwriting pieces for each subject Nationality distribution 50 Japanese, 49 Koreans and 1 Afghan Gender distribution males Age distribution the young and middle-aged people are the majorities Data diversity multiple cellphone models, different corpus Device cellphone Data format .json Annotation content text content, age, nationality, trace of handwriting Accuracy The annotation accuracy is not less than 95%
Get the Dataset This is just an example of the data. To access more sample data or request the price, contact us at info@nexdata.ai
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
PsOCR - Pashto OCR Dataset
๐ Zirak.ai
| ๐ค HuggingFace
| GitHub
| Kaggle
| ๐ Paper
PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language The dataset is also available at: https://www.kaggle.com/datasets/drijaz/PashtoOCR
Introduction
PsOCR is a large-scale synthetic dataset for Optical Character Recognition in low-resource Pashtoโฆ See the full description on the dataset page: https://huggingface.co/datasets/zirak-ai/PashtoOCR.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Images dataset divided into train (10905114 images), validation (2115528 images) and test (544946 images) folders containing a balanced number of images for two classes (chemical structures and non-chemical structures).
The chemical structures were generated using RanDepict to random picked compounds from the ChEMBL30 database and the COCONUT database.
The non-chemical structures were generated using Python or they were retrieved from several public datasets:
COCO dataset, MIT Places-205 dataset, Visual Genome dataset, Google Open labeled Images, MMU-OCR-21 (kaggle), HandWritten_Character (kaggle), CoronaHack -Chest X-Ray-dataset (kaggle), PANDAS Augmented Images (kaggle), Bacterial_Colony (kaggle), Ceylon Epigraphy Periods (kaggle), Chinese Calligraphy Styles by Calligraphers (kaggle), Graphs Dataset (kaggle), Function_Graphs Polynomial (kaggle), sketches (kaggle), Person Face Sketches (kaggle), Art Pictograms (kaggle), Russian handwritten letters (kaggle), Handwritten Russian Letters (kaggle), Covid-19 Misinformation Tweets Labeled Dataset (kaggle) and grapheme-imgs-224x224 (kaggle).
This data was used to build a CNN classification model using as a base model EfficienNetB0 and fine tuning it. The model is available on Github.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains 175 flatbed-scanned Czech receipts, each labeled from 001 to 175. The dataset includes real-world variability, such as faded or dark receipts (marked with a "b" in the filename, e.g. 014b.jpg).
The dataset is organized into three directories:
scans/
Contains JPEG images of scanned receipts. Some images are dark or have lower contrast, simulating real-world scanning scenarios.
ocr_target/
Contains .txt files with a line-by-line literal transcription of each receipt, suitable for OCR model evaluation.
segment_target/
Contains .json files with structured information extracted from each receipt. Each JSON file captures key details, such as store name, purchase date, currency, and itemized product data (including discounts). Product data DO NOT include duplicates. (Maybe I will update the segment_target dataset in the future to include duplicated product names as well...)
Each .json file in segment_target/ follows this schema:
{
"company": "tesco",
"date": "26.07.2024",
"currency": "czk",
"products": {
"madeta cottage 150 g": 29.9,
"raj.cel.lou400g/240g": 39.9,
"cc raj.cel.lou400g/2": -20,
"cc madeta cottage 15": -40
}
}
company
: Name of the store or seller (e.g., "tesco") in lowercase.
date
: Date of purchase in DD.MM.YYYY format.
currency
: Transaction currency (e.g., "czk") in lowercase.
products
: Key-value pairs of product names (lowercase) and their prices. Discounts are represented as negative values.
Warning: Some fields may contain null if the data could not be extracted reliably.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ๅฆ้ๅฎๆดๆฐๆฎ้ๆไบ่งฃๆดๅค๏ผ่ฏทๅ้ฎไปถ่ณcommercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com
The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING
Database Name Category Quantity
RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024
RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080
RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000
RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003
RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037
RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147
RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006
RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963
RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118
English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118
Student Enrollment Document Retrieval
This dataset is created from the original Kaggle Delaware Student Enrollment dataset. The charts are rendered and queries created using templates. The text_description column contains OCR text extracted from the images using EasyOCR. This particular dataset is a subsample of at maximum 1000 random rows from the full dataset which can be found here.
Disclaimer
This dataset may contain publicly available images or text data. Allโฆ See the full description on the dataset page: https://huggingface.co/datasets/jinaai/student-enrollment.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by PhuDucNguyen108
Released under MIT
This dataset was created by gechengze
This repository hosts datasets used in the project: DATA ANALYSIS OF DATETIME BASED OCR. These datasets are derived from surveillance videos embedded with overlay text showing the date and time of recording in YYYY-MM-DD
and HH:MM:SS
formats, respectively. The datasets are intended for use in OCR (Optical Character Recognition) training and evaluation, particularly in timestamp recognition tasks.
Dataset | Date Captured | Time Span | Dimensions (px) | File Size Range | Duration |
---|---|---|---|---|---|
1 | 25 October 2024 | 14:34:20 โ 21:02:35 | 457 ร 55 | 3โ11 KB | ~7 hours |
2 | 19 October 2023 | 11:54:09 โ 21:12:47 | 224 ร 25 | 1โ4 KB | ~9 hours |
3 | 10 January 2024 | 00:05:45 โ 23:58:45 | 420 ร 50 | 2โ8 KB | ~24 hours |
Each image is a cropped region containing the timestamp overlay extracted from a video frame. The datasets include various degrees of corruption, camera motion, and resolution to reflect real-world surveillance conditions.
.ts
)[left=1438, top=15, right=1895, bottom=70]
Each frame in the video is read at specified intervals, cropped using predefined coordinates, and saved in .jpg
format to the corresponding dataset folder.
OCR-based timestamp labelling was performed semi-automatically using PaddleOCR with the following setup:
lang='en'
)20240110 โ 2024-01-10
).Metadata includes:
Ground truth results are available in:
It is observed that the training losses for both models are considerably higher than validation loss, which is less common behaviour. Suspecting that data quirks may be in play, the datasets are reevaluated.
Due to the qualities of time, datetime data may present bias as time components of a higher degree persist throughout the dataset for a longer period than those of lower degree. To confirm that such bias is not amplified further by subsequent frames of same timestamps, all the datasets are filtered to remove images with duplicate timestamp values by keeping the first occurrence only. Then, they are reallocated into training and testing with the same train-test ratio, without similar data being present in both. This is done so that the model trains on diverse text samples rather than repeated words to better evaluate the modelsโ exploration and generalisation capabilities.
Dataset | Train Size | Test Size |
---|---|---|
1 | 50,854 | 12,713 |
2 | 112,816 | 28,204 |
3 | 5,780 | 1,446 |
Filtered Dataset | Train Size | Test Size |
---|---|---|
1 | 25,505 | 6,377 |
2 | 56,718 | 14,180 |
3 | 3,213 | 804 |
The "datasets 4.5.1 & 4.5.2" and "datasets 4.5.3 & 4.5.4" refer to the same datasets used in experiments detailed in sections 4.5.1โ2 and 4.5.3โ4 of the project respectively. The latter group of datasets have undergone a filtering process to remove duplicate timestamp instances.
For more information, please refer to the main repository: ๐ IvannaLin/DATA-ANALYSIS-OF-DATETIME-BASED-OCR
If you use this dataset in your research, please cite the associated source or contact the corresponding author.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset was created by WikiDocument Dataset
Released under CC BY-SA 3.0
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.
The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.
The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">
Each image from images
folder is accompanied by an XML-annotation in the annotations.xml
file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">
keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text