TextOCR Dataset
Version 0.1
Training Set
Word Annotations: 714,770 (272MB) Images: 21,778 (6.6GB)
Validation Set
Word Annotations: 107,802 (39MB) Images: 3,124
Test Set
Metadata: 1MB Images: 3,232 (926MB)
General Information
License: Data is available under CC BY 4.0 license. Important Note: Numbers in the papers should be reported on the v0.1 test set.
Images
Training and validation set images are sourced… See the full description on the dataset page: https://huggingface.co/datasets/yunusserhat/TextOCR-Dataset.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for TextOCR-GPT4V
Dataset Summary
TextOCR-GPT4V is Meta's TextOCR dataset dataset captioned with emphasis on text OCR using GPT4V. To get the image, you will need to agree to their terms of service.
Supported Tasks
The TextOCR-GPT4V dataset is intended for generating benchmarks for comparison of an MLLM to GPT4v.
Languages
The caption languages are in English, while various texts in images are in many languages such as Spanish, Japanese… See the full description on the dataset page: https://huggingface.co/datasets/jimmycarter/textocr-gpt4v.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
TextOCR is a dataset for object detection tasks - it contains Text annotations for 659 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
antokun/TextOCR-TextExtractionfromImagesDataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.
The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.
The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">
Each image from images
folder is accompanied by an XML-annotation in the annotations.xml
file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">
keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.
Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.
To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.
Following is a list of the datasets that are currently provided:
- GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images.
- HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes.
- RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations.
- CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
You can use this fonts file to generate some Chinese character. Use this image can train a machine learning model to recognize text.
Fengx1nn/non-semantic-text-ocr-20k dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Machine-readable full text of the Brooklyn-based bilingual (Irish and English) monthly newspaper An Gaodhal, the first serial dedicated to providing content to an Irish-language readership. Files consist of the results of the application of optical character recognition software to the text using newly created text-recognition models trained on the Irish-only and bilingual contents of An Gaodhal. Irish-language content in the newspaper was published using cló Gaelach font (a style based on handwritten manuscripts) and pre-standardized spelling. The full-text files are derived from a digitized print collection held by the James Hardiman Library at the University of Galway. Contents of the newspaper reflect the cultural interests of Irish speakers in New York, Ireland, and the wider diaspora; Irish American life; New York history; and the development of the Irish language during the Celtic Revival period. Coverage includes the years 1881-1898 when its founder, printer, and publisher Micheál Ó Lócháin (Michael J. Logan) spearheaded its production. This project was completed with support from the Robert D. L. Gardiner Foundation, the Irish Institute of New York, Glucksman Ireland House, and the University of Galway.
This dataset was created by Sumit
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for LLaVA-OneVision
[2024-09-01]: Uploaded VisualWebInstruct(filtered), it's used in OneVision Stage
almost all subsets are uploaded with HF's required format and you can use the recommended interface to download them and follow our code below to convert them.
the subset of ureader_kg and ureader_qa are uploaded with the processed jsons and tar.gz of image folders. You may directly download them from the following url.… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
TextOCR Dataset
Version 0.1
Training Set
Word Annotations: 714,770 (272MB) Images: 21,778 (6.6GB)
Validation Set
Word Annotations: 107,802 (39MB) Images: 3,124
Test Set
Metadata: 1MB Images: 3,232 (926MB)
General Information
License: Data is available under CC BY 4.0 license. Important Note: Numbers in the papers should be reported on the v0.1 test set.
Images
Training and validation set images are sourced… See the full description on the dataset page: https://huggingface.co/datasets/yunusserhat/TextOCR-Dataset.