12 datasets found

h
TextOCR-Dataset
huggingface.co
Updated Sep 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yunus Serhat Bıçakçı (2021). TextOCR-Dataset [Dataset]. https://huggingface.co/datasets/yunusserhat/TextOCR-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 9, 2021
Authors
Yunus Serhat Bıçakçı
Description
TextOCR Dataset

Version 0.1 Training Set

Word Annotations: 714,770 (272MB) Images: 21,778 (6.6GB)

Validation Set

Word Annotations: 107,802 (39MB) Images: 3,124

Test Set

Metadata: 1MB Images: 3,232 (926MB)

General Information

License: Data is available under CC BY 4.0 license. Important Note: Numbers in the papers should be reported on the v0.1 test set.

Images

Training and validation set images are sourced… See the full description on the dataset page: https://huggingface.co/datasets/yunusserhat/TextOCR-Dataset.
h
textocr-gpt4v
huggingface.co
Updated Apr 3, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jimmy Carter (2015). textocr-gpt4v [Dataset]. https://huggingface.co/datasets/jimmycarter/textocr-gpt4v
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2015
Authors
Jimmy Carter
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for TextOCR-GPT4V

Dataset Summary

TextOCR-GPT4V is Meta's TextOCR dataset dataset captioned with emphasis on text OCR using GPT4V. To get the image, you will need to agree to their terms of service.

Supported Tasks

The TextOCR-GPT4V dataset is intended for generating benchmarks for comparison of an MLLM to GPT4v.

Languages

The caption languages are in English, while various texts in images are in many languages such as Spanish, Japanese… See the full description on the dataset page: https://huggingface.co/datasets/jimmycarter/textocr-gpt4v.
O
TextOCR
opendatalab.com
zip
Updated Apr 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Facebook AI Research (2023). TextOCR [Dataset]. https://opendatalab.com/OpenDataLab/TextOCR
Explore at:
zip(9222216147 bytes)Available download formats
Dataset updated
Apr 20, 2023
Dataset provided by
Facebook AI Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.
R
Textocr Dataset
universe.roboflow.com
zip
Updated Apr 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aryan Patel (2022). Textocr Dataset [Dataset]. https://universe.roboflow.com/aryan-patel/textocr-7vvkv/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 9, 2022
Dataset authored and provided by
Aryan Patel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Text Bounding Boxes
Description
TextOCR

## Overview TextOCR is a dataset for object detection tasks - it contains Text annotations for 659 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
TextOCR-TextExtractionfromImagesDataset
huggingface.co
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antoluis Hendry (2025). TextOCR-TextExtractionfromImagesDataset [Dataset]. https://huggingface.co/datasets/antokun/TextOCR-TextExtractionfromImagesDataset
Explore at:
Dataset updated
Mar 26, 2025
Authors
Antoluis Hendry
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
antokun/TextOCR-TextExtractionfromImagesDataset dataset hosted on Hugging Face and contributed by the HF Datasets community
OCR Document Text Recognition Dataset
kaggle.com
Updated Sep 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

images - contains of original images of documents

boxes - includes bounding box labeling for the original images

annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

"Text Title" - corresponds to titles, the box is red

"Text Paragraph" - corresponds to paragraphs of text, the box is blue

"Table" - corresponds to the table, the box is green

"Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text
A Curated List of Image Deblurring Datasets
kaggle.com
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jishnu Parayil Shibu (2023). A Curated List of Image Deblurring Datasets [Dataset]. https://www.kaggle.com/datasets/jishnuparayilshibu/a-curated-list-of-image-deblurring-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jishnu Parayil Shibu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.

Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.

To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.

Following is a list of the datasets that are currently provided: - GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. - HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes. - RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations. - CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
Chinese Characters Generator
kaggle.com
Updated Jul 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan (2017). Chinese Characters Generator [Dataset]. https://www.kaggle.com/dylanli/chinesecharacter/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 14, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dylan
Description
About This Dataset

You can use this fonts file to generate some Chinese character. Use this image can train a machine learning model to recognize text.

Dataset is updating

Tell me if you have other font file or anything related to this topic.
h
non-semantic-text-ocr-20k
huggingface.co
Updated Jun 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
XxZheng (2025). non-semantic-text-ocr-20k [Dataset]. https://huggingface.co/datasets/Fengx1nn/non-semantic-text-ocr-20k
Explore at:
Dataset updated
Jun 28, 2025
Authors
XxZheng
Description
Fengx1nn/non-semantic-text-ocr-20k dataset hosted on Hugging Face and contributed by the HF Datasets community
n
An Gaodhal Newspaper (1881-1898) Full-Text OCR Output Files
ultraviolet.library.nyu.edu
bin, csv, zip
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deirdre Ní Chonghaile; Deirdre Ní Chonghaile; Oksana Dereza; Oksana Dereza; Nicholas Wolf; Nicholas Wolf (2025). An Gaodhal Newspaper (1881-1898) Full-Text OCR Output Files [Dataset]. http://doi.org/10.58153/5ya5n-mc504
Explore at:
zip, csv, binAvailable download formats
Unique identifier
https://doi.org/10.58153/5ya5n-mc504
Dataset updated
Apr 25, 2025
Dataset provided by
New York University
Authors
Deirdre Ní Chonghaile; Deirdre Ní Chonghaile; Oksana Dereza; Oksana Dereza; Nicholas Wolf; Nicholas Wolf
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
Oct 1881 - Dec 1898
Description
Machine-readable full text of the Brooklyn-based bilingual (Irish and English) monthly newspaper An Gaodhal, the first serial dedicated to providing content to an Irish-language readership. Files consist of the results of the application of optical character recognition software to the text using newly created text-recognition models trained on the Irish-only and bilingual contents of An Gaodhal. Irish-language content in the newspaper was published using cló Gaelach font (a style based on handwritten manuscripts) and pre-standardized spelling. The full-text files are derived from a digitized print collection held by the James Hardiman Library at the University of Galway. Contents of the newspaper reflect the cultural interests of Irish speakers in New York, Ireland, and the wider diaspora; Irish American life; New York history; and the development of the Irish language during the Celtic Revival period. Coverage includes the years 1881-1898 when its founder, printer, and publisher Micheál Ó Lócháin (Michael J. Logan) spearheaded its production. This project was completed with support from the Robert D. L. Gardiner Foundation, the Irish Institute of New York, Glucksman Ireland House, and the University of Galway.
manga_text_ocr_test
kaggle.com
Updated Sep 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit (2022). manga_text_ocr_test [Dataset]. https://www.kaggle.com/datasets/sumityadav/manga-text-ocr-test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 28, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sumit
Description
Dataset

This dataset was created by Sumit

Contents
h
LLaVA-OneVision-Data
huggingface.co
Updated Aug 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LMMs-Lab (2024). LLaVA-OneVision-Data [Dataset]. https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data
Explore at:
Dataset updated
Aug 7, 2024
Dataset authored and provided by
LMMs-Lab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for LLaVA-OneVision

[2024-09-01]: Uploaded VisualWebInstruct(filtered), it's used in OneVision Stage

almost all subsets are uploaded with HF's required format and you can use the recommended interface to download them and follow our code below to convert them.

the subset of ureader_kg and ureader_qa are uploaded with the processed jsons and tar.gz of image folders. You may directly download them from the following url.… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yunus Serhat Bıçakçı (2021). TextOCR-Dataset [Dataset]. https://huggingface.co/datasets/yunusserhat/TextOCR-Dataset

TextOCR-Dataset

yunusserhat/TextOCR-Dataset

Explore at:

18 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 9, 2021

Authors

Yunus Serhat Bıçakçı

Description

TextOCR Dataset

  Version 0.1





  Training Set

Word Annotations: 714,770 (272MB) Images: 21,778 (6.6GB)

  Validation Set

Word Annotations: 107,802 (39MB) Images: 3,124

  Test Set

Metadata: 1MB Images: 3,232 (926MB)

  General Information

License: Data is available under CC BY 4.0 license. Important Note: Numbers in the papers should be reported on the v0.1 test set.

  Images

Training and validation set images are sourced… See the full description on the dataset page: https://huggingface.co/datasets/yunusserhat/TextOCR-Dataset.

Clear search

Close search

Google apps

Main menu

TextOCR-Dataset

textocr-gpt4v

TextOCR

Textocr Dataset

TextOCR

TextOCR-TextExtractionfromImagesDataset

OCR Document Text Recognition Dataset

OCR Text Detection in the Documents Object Detection dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

Dataset structure

Data Format

Labels for the text:

Example of XML file structure

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

A Curated List of Image Deblurring Datasets

Chinese Characters Generator

About This Dataset

Dataset is updating

Tell me if you have other font file or anything related to this topic.

non-semantic-text-ocr-20k

An Gaodhal Newspaper (1881-1898) Full-Text OCR Output Files

manga_text_ocr_test

Dataset

Contents

LLaVA-OneVision-Data

TextOCR-Dataset

yunusserhat/TextOCR-Dataset