12 datasets found
  1. h

    TextOCR-Dataset

    • huggingface.co
    Updated Sep 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yunus Serhat Bıçakçı (2021). TextOCR-Dataset [Dataset]. https://huggingface.co/datasets/yunusserhat/TextOCR-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 9, 2021
    Authors
    Yunus Serhat Bıçakçı
    Description

    TextOCR Dataset

      Version 0.1
    
    
    
    
    
      Training Set
    

    Word Annotations: 714,770 (272MB) Images: 21,778 (6.6GB)

      Validation Set
    

    Word Annotations: 107,802 (39MB) Images: 3,124

      Test Set
    

    Metadata: 1MB Images: 3,232 (926MB)

      General Information
    

    License: Data is available under CC BY 4.0 license. Important Note: Numbers in the papers should be reported on the v0.1 test set.

      Images
    

    Training and validation set images are sourced… See the full description on the dataset page: https://huggingface.co/datasets/yunusserhat/TextOCR-Dataset.

  2. h

    textocr-gpt4v

    • huggingface.co
    Updated Apr 3, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jimmy Carter (2015). textocr-gpt4v [Dataset]. https://huggingface.co/datasets/jimmycarter/textocr-gpt4v
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2015
    Authors
    Jimmy Carter
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for TextOCR-GPT4V

      Dataset Summary
    

    TextOCR-GPT4V is Meta's TextOCR dataset dataset captioned with emphasis on text OCR using GPT4V. To get the image, you will need to agree to their terms of service.

      Supported Tasks
    

    The TextOCR-GPT4V dataset is intended for generating benchmarks for comparison of an MLLM to GPT4v.

      Languages
    

    The caption languages are in English, while various texts in images are in many languages such as Spanish, Japanese… See the full description on the dataset page: https://huggingface.co/datasets/jimmycarter/textocr-gpt4v.

  3. O

    TextOCR

    • opendatalab.com
    zip
    Updated Apr 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Facebook AI Research (2023). TextOCR [Dataset]. https://opendatalab.com/OpenDataLab/TextOCR
    Explore at:
    zip(9222216147 bytes)Available download formats
    Dataset updated
    Apr 20, 2023
    Dataset provided by
    Facebook AI Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.

  4. R

    Textocr Dataset

    • universe.roboflow.com
    zip
    Updated Apr 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aryan Patel (2022). Textocr Dataset [Dataset]. https://universe.roboflow.com/aryan-patel/textocr-7vvkv/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 9, 2022
    Dataset authored and provided by
    Aryan Patel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Text Bounding Boxes
    Description

    TextOCR

    ## Overview
    
    TextOCR is a dataset for object detection tasks - it contains Text annotations for 659 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  5. h

    TextOCR-TextExtractionfromImagesDataset

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoluis Hendry (2025). TextOCR-TextExtractionfromImagesDataset [Dataset]. https://huggingface.co/datasets/antokun/TextOCR-TextExtractionfromImagesDataset
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    Antoluis Hendry
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    antokun/TextOCR-TextExtractionfromImagesDataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. OCR Document Text Recognition Dataset

    • kaggle.com
    Updated Sep 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    OCR Text Detection in the Documents Object Detection dataset

    The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

    The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

    💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

    The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

    Dataset structure

    • images - contains of original images of documents
    • boxes - includes bounding box labeling for the original images
    • annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

    Data Format

    Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

    Labels for the text:

    • "Text Title" - corresponds to titles, the box is red
    • "Text Paragraph" - corresponds to paragraphs of text, the box is blue
    • "Table" - corresponds to the table, the box is green
    • "Handwritten" - corresponds to handwritten text, the box is purple

    Example of XML file structure

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

    Text Detection in the Documents might be made in accordance with your requirements.

    💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

  7. A Curated List of Image Deblurring Datasets

    • kaggle.com
    Updated Mar 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jishnu Parayil Shibu (2023). A Curated List of Image Deblurring Datasets [Dataset]. https://www.kaggle.com/datasets/jishnuparayilshibu/a-curated-list-of-image-deblurring-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jishnu Parayil Shibu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.

    Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.

    To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.

    Following is a list of the datasets that are currently provided: - GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. - HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes. - RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations. - CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
    - TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018

  8. Chinese Characters Generator

    • kaggle.com
    Updated Jul 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan (2017). Chinese Characters Generator [Dataset]. https://www.kaggle.com/dylanli/chinesecharacter/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dylan
    Description

    About This Dataset

    You can use this fonts file to generate some Chinese character. Use this image can train a machine learning model to recognize text.

    Dataset is updating

    Tell me if you have other font file or anything related to this topic.

  9. h

    non-semantic-text-ocr-20k

    • huggingface.co
    Updated Jun 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    XxZheng (2025). non-semantic-text-ocr-20k [Dataset]. https://huggingface.co/datasets/Fengx1nn/non-semantic-text-ocr-20k
    Explore at:
    Dataset updated
    Jun 28, 2025
    Authors
    XxZheng
    Description

    Fengx1nn/non-semantic-text-ocr-20k dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. n

    An Gaodhal Newspaper (1881-1898) Full-Text OCR Output Files

    • ultraviolet.library.nyu.edu
    bin, csv, zip
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deirdre Ní Chonghaile; Deirdre Ní Chonghaile; Oksana Dereza; Oksana Dereza; Nicholas Wolf; Nicholas Wolf (2025). An Gaodhal Newspaper (1881-1898) Full-Text OCR Output Files [Dataset]. http://doi.org/10.58153/5ya5n-mc504
    Explore at:
    zip, csv, binAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    New York University
    Authors
    Deirdre Ní Chonghaile; Deirdre Ní Chonghaile; Oksana Dereza; Oksana Dereza; Nicholas Wolf; Nicholas Wolf
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Oct 1881 - Dec 1898
    Description

    Machine-readable full text of the Brooklyn-based bilingual (Irish and English) monthly newspaper An Gaodhal, the first serial dedicated to providing content to an Irish-language readership. Files consist of the results of the application of optical character recognition software to the text using newly created text-recognition models trained on the Irish-only and bilingual contents of An Gaodhal. Irish-language content in the newspaper was published using cló Gaelach font (a style based on handwritten manuscripts) and pre-standardized spelling. The full-text files are derived from a digitized print collection held by the James Hardiman Library at the University of Galway. Contents of the newspaper reflect the cultural interests of Irish speakers in New York, Ireland, and the wider diaspora; Irish American life; New York history; and the development of the Irish language during the Celtic Revival period. Coverage includes the years 1881-1898 when its founder, printer, and publisher Micheál Ó Lócháin (Michael J. Logan) spearheaded its production. This project was completed with support from the Robert D. L. Gardiner Foundation, the Irish Institute of New York, Glucksman Ireland House, and the University of Galway.

  11. manga_text_ocr_test

    • kaggle.com
    Updated Sep 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit (2022). manga_text_ocr_test [Dataset]. https://www.kaggle.com/datasets/sumityadav/manga-text-ocr-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sumit
    Description

    Dataset

    This dataset was created by Sumit

    Contents

  12. h

    LLaVA-OneVision-Data

    • huggingface.co
    Updated Aug 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LMMs-Lab (2024). LLaVA-OneVision-Data [Dataset]. https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data
    Explore at:
    Dataset updated
    Aug 7, 2024
    Dataset authored and provided by
    LMMs-Lab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for LLaVA-OneVision

    [2024-09-01]: Uploaded VisualWebInstruct(filtered), it's used in OneVision Stage

    almost all subsets are uploaded with HF's required format and you can use the recommended interface to download them and follow our code below to convert them.

    the subset of ureader_kg and ureader_qa are uploaded with the processed jsons and tar.gz of image folders. You may directly download them from the following url.… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data.

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yunus Serhat Bıçakçı (2021). TextOCR-Dataset [Dataset]. https://huggingface.co/datasets/yunusserhat/TextOCR-Dataset

TextOCR-Dataset

yunusserhat/TextOCR-Dataset

Explore at:
18 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 9, 2021
Authors
Yunus Serhat Bıçakçı
Description

TextOCR Dataset

  Version 0.1





  Training Set

Word Annotations: 714,770 (272MB) Images: 21,778 (6.6GB)

  Validation Set

Word Annotations: 107,802 (39MB) Images: 3,124

  Test Set

Metadata: 1MB Images: 3,232 (926MB)

  General Information

License: Data is available under CC BY 4.0 license. Important Note: Numbers in the papers should be reported on the v0.1 test set.

  Images

Training and validation set images are sourced… See the full description on the dataset page: https://huggingface.co/datasets/yunusserhat/TextOCR-Dataset.

Search
Clear search
Close search
Google apps
Main menu