Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.
The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.
The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">
Each image from images
folder is accompanied by an XML-annotation in the annotations.xml
file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">
keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text
Total individuals: 1661 Total images: 3623 Images per users: 2.18
Top Countries: - Nigeria 44,6% - United States of America 7,2% - Bangladesh 7,1% - Ethiopia 6,7% - Indonesia 4,8% - India 4,8% - Kenya 2,4% - Iran 2,3% - Nepal 1,7% - Pakistan 1,4% (Total 67 countries)
Type of documents: - Identification Card (ID Card) 63,2% - Driver's License 6,4% - Student ID 4,9% - International passport 2,8% - Domestic passport 0,8% - Residence Permit 0,7% - Military ID 0,4% - Health Insurance Card 0,2%
Data is organized in per‑user folders and includes rich metadata.
Within a folder you may find: (a) multiple document categories for the same person, and/or (b) repeated captures of the same document against different backgrounds or lighting setups. The maximum volume per individual is 28 images.
Metadata includes country of document, type of document, created date, last name, first name, day of birth, month of birth and year of birth.
Every image was provided with explicit user consent. This ensures downstream use cases—such as training and evaluating document detection, classification, text extraction, and identity authentication models—are supported by legally sourced data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The choice of the dataset is the key for OCR systems. Unfortunately
Text labels are an integral part of cadastral maps and floor plans. Text is also prevalent in natural scenes around us in the form of road signs, billboards, house numbers and place names. Extracting this text can provide additional context and details about the places the text describes and the information it conveys. Digitization of documents and extracting texts from them helps in retrieving and archiving of important information.This deep learning model is based on the MMOCR model and uses optical character recognition (OCR) technology to detect text in images. This model was trained on a large dataset of different types and styles of text with diverse background and contexts, allowing for precise text extraction. It can be applied to various tasks such as automatically detecting and reading text from documents, sign boards, scanned maps, etc., thereby converting images containing text to actionable data.Using the modelFollow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Fine-tuning the modelThis model cannot be fine-tuned using ArcGIS tools.InputHigh-resolution, 3-band street-level imagery/oriented imagery, scanned maps, or documents, with medium to large size text.OutputA feature layer with the recognized text and bounding box around it.Model architectureThis model is based on the open-source MMOCR model by MMLab.Sample resultsHere are a few results from the model.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset consists of 24,280 high-resolution word images extracted from Manchu ancient books dating from 1733 to 1867, collected within the present-day territory of China. The images were sourced from the Series of Rare Ancient Books in Manchu and Chinese curated by the National Library of China. Each of the 2,428 unique Manchu words in the dataset is represented by exactly 10 distinct image samples, resulting in a balanced and well-structured dataset suitable for training and evaluating deep learning models in the task of Manchu OCR (optical character recognition).This dataset was constructed using a semi-automated workflow to address the challenges posed by manual segmentation of historical scripts—such as high annotation costs and time-consuming processing—and to preserve the visual details of each page. The image acquisition process involved high-precision scanning at 600 dpi. Word regions were first identified using computer vision algorithms, followed by manual verification and correction to ensure the accuracy and completeness of the extracted samples.All images are stored in standard .jpg format with consistent resolution and naming conventions. The dataset is divided into structured folders by word category, and accompanying metadata files provide annotations, including word labels, file paths, and page source references. The released version has no missing data entries, and the dataset has been quality-checked to exclude samples with severe degradation, such as illegible characters, torn pages, or significant shadowing.To our knowledge, this is the largest publicly available Manchu word image dataset to date. It offers a valuable resource for researchers in historical document analysis, Manchu linguistics, and machine learning-based OCR. The dataset can be used for model training and evaluation, benchmarking segmentation algorithms, and exploring multimodal representations of Manchu script.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for Industry Documents Library (IDL)
Dataset Summary
Industry Documents Library (IDL) is a document dataset filtered from UCSF documents library with 19 million pages kept as valid samples. Each document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the idl_data project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/idl-wds.
Document OCR using Nanonets-OCR-s
This dataset contains markdown-formatted OCR results from images in /content/input using Nanonets-OCR-s.
Processing Details
Source Dataset: /content/input Model: nanonets/Nanonets-OCR-s Number of Samples: 32 Processing Time: 7.9 minutes Processing Date: 2025-08-14 04:32 UTC
Configuration
Image Column: image Output Column: markdown Dataset Split: train Batch Size: 32 Max Model Length: 8,192 tokens Max Output Tokens: 4,096… See the full description on the dataset page: https://huggingface.co/datasets/phuongkhanh123/my-ocr-output.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset contentsThis dataset is an OCR text corpus of 2,984 printed works (monographs and serials) from the collection of the Qatar National Library. All works are mostly in Arabic language, but fragments of texts in other languages can also be found. Besides the OCR text, the basic descriptive metadata for each item is also provided.Release note for version 2 of the datasetThe dataset of OCRed Arabic books has been fully updated to ensure consistency and quality. All items in the dataset have now been processed using the latest retrained data. Furthermore, every item has undergone a thorough visual quality assurance check conducted using a representative sample of pages. This update has resulted in a significant enhancement of word-level accuracy across the entire dataset, ensuring higher reliability and usability.The exact list of files changed between version 1 and version 2 of the dataset can be determined by comparing the SHA256 checksums provided with each dataset version (see below for details).Dataset structureThe dataset consists of three files:QNL-ArabicContentDataset-Metadata.csv and QNL-ArabicContentDataset-Metadata.xlsx contain the same basic metadata of 2,894 items from the Qatar National Library collection. Both files have the same content and are structured into the following columns:CALL #(ITEM) - Item call number in the QNL catalogRECORD #(ITEM) - Item record number in the QNL catalog (unique for each item)Repository URL - URL to digitized item content in the QNL repositoryCatalog URL - URL to the complete item metadata record in the QNL catalogAUTHOR - Main author information for the itemADD AUTHOR - Additional author information for the itemPUB INFO - Item publication infoTITLE - Item titleDESCRIPTION - Item descriptionVOLUME - Item volume information (in case of some serial publications)QNL_ArabicOCR_Corpus-v2.zip contains:2,894 text files with the following naming pattern: [unique item record number]-[unique item QNL repository id].txt. The unique item record number should be used to match each file with a related metadata record. Each file contains text extracted from a particular item using OCR software.checksums.sha256 - contains SHA256 checksums for all 2,894 text files
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corpus for the ICDAR2019 Competition on Post-OCR Text Correction (October 2019)Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreuxhttp://l3i.univ-larochelle.fr/ICDAR2019PostOCR-------------------------------------------------------------------------------These are the supplementary materials for the ICDAR 2019 paper ICDAR 2019 Competition on Post-OCR Text CorrectionPlease use the following citation:@inproceedings{rigaud2019pocr,title=""ICDAR 2019 Competition on Post-OCR Text Correction"",author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},year={2019},booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}}
Description: The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GS comes both from BnF's internal projects and external initiatives such as Europeana Newspapers, IMPACT, Project Gutenberg, Perseus and Wikisource. Repartition of the dataset- ICDAR2019_Post_OCR_correction_training_18M.zip: 80% of the full dataset, provided to train participants' methods.- ICDAR2019_Post_OCR_correction_evaluation_4M: 20% of the full dataset used for the evaluation (with Gold Standard made publicly after the competition).- ICDAR2019_Post_OCR_correction_full_22M: full dataset made publicly available after the competition. Special case for Finnish language Material from the National Library of Finland (Finnish dataset FI > FI1) are not allowed to be re-shared on other website. Please follow these guidelines to get and format the data from the original website.1. Go to https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en;2. Download OCR Ground Truth Pages (Finnish Fraktur) [v1](4.8GB) from Digitalia (2015-17) package;3. Convert the Excel file ""~/metadata/nlf_ocr_gt_tescomb5_2017.xlsx"" as Comma Separated Format (.csv) by using save as function in a spreadsheet software (e.g. Excel, Calc) and copy it into ""FI/FI1/HOWTO_get_data/input/"";4. Go to ""FI/FI1/HOWTO_get_data/"" and run ""script_1.py"" to generate the full ""FI1"" dataset in ""output/full/"";4. Run ""script_2.py"" to split the ""output/full/"" dataset into ""output/training/"" and ""output/evaluation/"" sub sets.At the end of the process, you should have a ""training"", ""evaluation"" and ""full"" folder with 1579528, 380817 and 1960345 characters respectively.
Licenses: free to use for non-commercial uses, according to sources in details- BG1: IMPACT - National Library of Bulgaria: CC BY NC ND- CZ1: IMPACT - National Library of the Czech Republic: CC BY NC SA- DE1: Front pages of Swiss newspaper NZZ: Creative Commons Attribution 4.0 International (https://zenodo.org/record/3333627)- DE2: IMPACT - German National Library: CC BY NC ND- DE3: GT4Hist-dta19 dataset: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE4: GT4Hist - EarlyModernLatin: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE5: GT4Hist - Kallimachos: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE6: GT4Hist - RefCorpus-ENHG-Incunabula: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE7: GT4Hist - RIDGES-Fraktur: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- EN1: IMPACT - British Library: CC BY NC SA 3.0- ES1: IMPACT - National Library of Spain: CC BY NC SA- FI1: National Library of Finland: no re-sharing allowed, follow the above section to get the data. (https://digi.kansalliskirjasto.fi/opendata)- FR1: HIMANIS Project: CC0 (https://www.himanis.org)- FR2: IMPACT - National Library of France: CC BY NC SA 3.0- FR3: RECEIPT dataset: CC0 (http://findit.univ-lr.fr)- NL1: IMPACT - National library of the Netherlands: CC BY- PL1: IMPACT - National Library of Poland: CC BY- SL1: IMPACT - Slovak National Library: CC BY NCText post-processing such as cleaning and alignment have been applied on the resources mentioned above, so that the Gold Standard and the OCRs provided are not necessarily identical to the originals.
Structure- **Content** [./lang_type/sub_folder/#.txt] - ""[OCR_toInput] "" => Raw OCRed text to be de-noised. - ""[OCR_aligned] "" => Aligned OCRed text. - ""[ GS_aligned] "" => Aligned Gold Standard text.The aligned OCRed/GS texts are provided for training and test purposes. The alignment was made at the character level using ""@"" symbols. ""#"" symbols correspond to the absence of GS either related to alignment uncertainties or related to unreadable characters in the source document. For a better view of the alignment, make sure to disable the ""word wrap"" option in your text editor.The Error Rate and the quality of the alignment vary according to the nature and the state of degradation of the source documents. Periodicals (mostly historical newspapers) for example, due to their complex layout and their original fonts have been reported to be especially challenging. In addition, it should be mentioned that the quality of Gold Standard also varies as the dataset aggregates resources from different projects that have their own annotation procedure, and obviously contains some errors.
ICDAR2019 competitionInformation related to the tasks, formats and the evaluation metrics are details on :https://sites.google.com/view/icdar2019-postcorrectionocr/evaluation
References - IMPACT, European Commission's 7th Framework Program, grant agreement 215064 - Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. - https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland- EU Horizon 2020 research and innovation programme grant agreement No 770299
Contact- christophe.rigaud(at)univ-lr.fr- antoine.doucet(at)univ-lr.fr- mickael.coustaty(at)univ-lr.fr- jean-philippe.moreux(at)bnf.frL3i - University of la Rochelle, http://l3i.univ-larochelle.frBnF - French National Library, http://www.bnf.fr
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for PDF Association dataset (PDFA)
Dataset Summary
PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf documents analysis. The purpose of that subset differs in that regard, as focus has been done on making the dataset machine learning-ready for vision-language models.
An example page of one pdf document, with added bounding boxes… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/pdfa-eng-wds.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Optical Character Recognition (OCR) systems frequently encounter difficulties when processing rare or ancient scripts, especially when they occur in historical contexts involving multiple writing systems. These challenges often constrain researchers to fine-tune or to train new OCR models tailored to their specific needs. To support these efforts, we introduce a synthetic dataset comprising 6.2 million lines, specifically geared towards mixed polytonic Greek and Latin scripts. Being augmented with artificially degraded lines, the dataset bolsters strong results when used to train historical OCR models. This resource can be used both for training and testing purposes, and is particularly valuable for researchers working with ancient Greek and limited annotated data. The software used to generate this datasets is linked to below on our Git. This is a sample, but please contact us if you would like access to the whole dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic Printed word image data and test protocols word image Data repository for the paper "A Multifaceted Evaluation of Representation of Graphemes for Practically Effective Bangla OCR." In this paper, we have utilized the popular Convolutional Recurrent Neural Network (CRNN) architecture and implemented our grapheme representation strategies to design the final labels of the model. Due to the absence of a large-scale Bangla word-level printed dataset, we created a synthetically generated Bangla corpus containing 2 million samples that are representative and sufficiently varied in terms of fonts, domain, and vocabulary size to train our Bangla OCR model. To test the various aspects of our model, we have also created 6 test protocols. Finally, to establish the generalizability of our grapheme representation methods, we have performed training and testing on external handwriting datasets. Updates: 10 June 2023: The paper has been accepted for publication in International Journal on Document Analysis and Recognition (IJDAR).
Document OCR using NuMarkdown-8B-Thinking
This dataset contains markdown-formatted OCR results from images in davanstrien/india-medical-test using NuMarkdown-8B-Thinking.
Processing Details
Source Dataset: davanstrien/india-medical-test Model: numind/NuMarkdown-8B-Thinking Number of Samples: 50 Processing Time: 13.3 minutes Processing Date: 2025-08-07 08:04 UTC
Configuration
Image Column: image Output Column: markdown Dataset Split: train Batch Size: 16… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/india-medical-ocr-test.
According to our latest research, the global Smart Document Scanner OCR App market size reached USD 3.85 billion in 2024, exhibiting robust growth driven by the rapid digitization of workflows and the increasing need for document automation across various sectors. The market is projected to grow at a CAGR of 13.7% from 2025 to 2033, with the market size forecasted to reach USD 11.89 billion by 2033. This significant expansion is primarily attributed to the widespread adoption of mobile devices, advancements in artificial intelligence and machine learning, and the growing demand for efficient document management solutions in both personal and professional environments.
One of the primary growth factors fueling the Smart Document Scanner OCR App market is the accelerating pace of digital transformation across industries such as healthcare, finance, education, and government. Organizations are increasingly seeking ways to streamline their document handling processes, reduce manual data entry errors, and improve operational efficiency. The integration of Optical Character Recognition (OCR) technology into smart document scanning apps enables users to quickly convert paper documents into editable and searchable digital formats, significantly enhancing productivity. Furthermore, the proliferation of remote work and the need for secure, cloud-based document sharing have further heightened the demand for advanced OCR-enabled scanning solutions.
Another significant driver is the continuous innovation in artificial intelligence and machine learning algorithms, which are making OCR technology more accurate, reliable, and versatile. Modern Smart Document Scanner OCR Apps can now recognize a wide range of fonts, languages, and complex layouts, including tables and handwritten notes, with remarkable precision. This technological evolution has broadened the application scope of these apps, allowing them to be used not only for basic document digitization but also for tasks such as invoice processing, identity verification, and compliance management. The incorporation of AI-powered features such as automatic document detection, real-time translation, and advanced data extraction is further propelling market growth.
The increasing penetration of smartphones and mobile devices globally has also played a crucial role in the expansion of the Smart Document Scanner OCR App market. With the majority of the population now having access to high-resolution cameras and powerful processing capabilities on their mobile devices, scanning and digitizing documents has become more convenient than ever. This trend is particularly pronounced in emerging markets, where mobile-first solutions are often preferred over traditional desktop-based applications. Additionally, the growing emphasis on paperless offices and environmental sustainability is encouraging both individuals and enterprises to adopt digital document management practices, thereby boosting the market for OCR-enabled scanner apps.
From a regional perspective, North America currently dominates the global Smart Document Scanner OCR App market, accounting for the largest share in 2024. This is largely due to the high adoption rate of advanced technologies, a mature IT infrastructure, and the presence of leading solution providers in the region. However, Asia Pacific is expected to witness the fastest growth over the forecast period, driven by rapid urbanization, increasing smartphone penetration, and rising investments in digital transformation initiatives across countries such as China, India, and Japan. Europe also presents significant growth opportunities, supported by stringent regulatory requirements for data management and a strong focus on innovation in document processing technologies.
The Component segment of the Smart Document Scanner OCR App market is bifurcated into Software and Services. The Software sub-segment holds the lion’s share of the market, as the co
This data consists of a number of .zip files containing everything needed to run the hieratic optical character recognition program presented at https://github.com/jtabin/PaPYrus. The files included are as follows: 1. "Dataset By Sign": This is all 13,134 data set images, categorized in folders by their Gardiner sign. Each image is a black and white .png image of a hieratic sign. The signs are labeled with unique identifiers, corresponding in order to their placement in a text from the 1st (0001) to the 9999th (9999), facsimile maker (1 for Möller, 2 for Poe, 3 for Tabin), provenance (1: Thebes, 2: Lahun, 3: Hatnub, 4: Unknown), and original text (1: Shipwrecked Sailor, 2: Eloquent Peasant B1, 3: Eloquent Peasant R, 4: Sinuhe B, 5: Sinuhe R, 6: Papyrus Prisse, 7: Hymn to Senwosret III, 8: Lahun Temple Files, 9: Will of Wah, 10: Texte aus Hatnub, 11: Papyrus Ebers, 12: Rhind Papyrus, 13: Papyrus Westcar). 2. "Dataset Categorized": This is every data set image, as above, categorized in folders by their provenance, text, and facsimile maker (i.e. where the tags originate from). 3. "Dataset Whole": This is every data set image in one folder. This is what is used for the analyses done by the OCR program. 4. "Precalculated Data Set Stats": This is a collection of .csv files outputted by the "Data Set Prep.ipynb" code (code found on the aforementioned GitHub page). "pxls_16.csv", "pxls_20.csv", and "pxls_25.csv" are the pixel values for every sign in the data set, after they were resized to 16, 20, and 25 pixels, respectively. "datasetstats.csv" includes the aspect ratios and sign names for every sign in the data set. The two files beginning with "A1cut" are the same stats, but after every A1 sign had its tail manually cut off. 5. "Precalculated OCR Results": This is a collection of .csv files outputted by the "Image Identification.ipynb" code (also found on the GitHub page). The files are mostly the product of all of one sign from the data set being run through the OCR program and they are labeled with the name of the sign. These result in columns of signs and their similarity scores when compared to other signs. Some files, such as "randsamp_fullresults.csv", come from other analyses explained in their file names (that file, for instance, is a random sample from the data set).
https://www.koncile.ai/en/termsandconditionshttps://www.koncile.ai/en/termsandconditions
AI-powered OCR to extract all fields from your ID documents (PDF or image). Turn your documents into data via API or SDK. Reliable and customizable.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
BLIP3-OCR-200M Dataset
Overview
The BLIP3-OCR-200M dataset is designed to address the limitations of current Vision-Language Models (VLMs) in processing and interpreting text-rich images, such as documents and charts. Traditional image-text datasets often struggle to capture nuanced textual information, which is crucial for tasks requiring complex text comprehension and reasoning.
Key Features
OCR Integration: The dataset incorporates Optical Character… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/blip3-ocr-200m.
This dataset is a processed and classified (labelled) dataset for Tamil OCR derived from this dataset from this notebook. I re-uploaded this dataset for use with my copy of the notebook but also publish it here so that it might be useful to others. I suspect the original datasets in the notebook were unintentionally left private as the author provided the Google Drive link to the files in their public notebook. I have prepared this dataset for sharing here in the hopes it may be useful. Licensing information was not provided with the original dataset. Please direct licensing queries to either the original dataset publisher or me.
u[?]_[character_number]t[sample_number].tiff
where character_number
indexes the identity of the characters (providing labelling information) and sample_number
indexes the samples of said characters. The significance of u[?]
is unknown to me but I suspect it corresponds to the identities of the people that hand-wrote the samplesApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenDoc-Pdf-Preview
OpenDoc-Pdf-Preview is a compact visual preview dataset containing 6,000 high-resolution document images extracted from PDFs. This dataset is designed for Image-to-Text tasks such as document OCR pretraining, layout understanding, and multimodal document analysis.
Dataset Summary
Modality: Image-to-Text Content Type: PDF-based document previews Number of Samples: 6,000 Language: English Format: Parquet Split: train only Size: 606 MB License: Apache… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/OpenDoc-Pdf-Preview.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.
The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.
The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">
Each image from images
folder is accompanied by an XML-annotation in the annotations.xml
file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">
keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text