20 datasets found
  1. d

    ID's photo Dataset | 67 countries | 11 types of documents | Document...

    • datarade.ai
    .jpg, .jpeg, .png
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2025). ID's photo Dataset | 67 countries | 11 types of documents | Document Recognition | OCR Training | Computer Vision [Dataset]. https://datarade.ai/data-products/id-s-photo-dataset-67-countries-11-types-of-documents-d-filemarket
    Explore at:
    .jpg, .jpeg, .pngAvailable download formats
    Dataset updated
    Jul 25, 2025
    Dataset authored and provided by
    FileMarket
    Area covered
    Bulgaria, France, Egypt, Cuba, Sri Lanka, Venezuela (Bolivarian Republic of), Indonesia, Mexico, Peru, Benin
    Description

    Total individuals: 1661 Total images: 3623 Images per users: 2.18

    Top Countries: - Nigeria 44,6% - United States of America 7,2% - Bangladesh 7,1% - Ethiopia 6,7% - Indonesia 4,8% - India 4,8% - Kenya 2,4% - Iran 2,3% - Nepal 1,7% - Pakistan 1,4% (Total 67 countries)

    Type of documents: - Identification Card (ID Card) 63,2% - Driver's License 6,4% - Student ID 4,9% - International passport 2,8% - Domestic passport 0,8% - Residence Permit 0,7% - Military ID 0,4% - Health Insurance Card 0,2%

    Data is organized in per‑user folders and includes rich metadata.

    Within a folder you may find: (a) multiple document categories for the same person, and/or (b) repeated captures of the same document against different backgrounds or lighting setups. The maximum volume per individual is 28 images.

    Metadata includes country of document, type of document, created date, last name, first name, day of birth, month of birth and year of birth.

    Every image was provided with explicit user consent. This ensures downstream use cases—such as training and evaluating document detection, classification, text extraction, and identity authentication models—are supported by legally sourced data.

  2. Dataset of invoices and receipts including annotation of relevant fields

    • zenodo.org
    zip
    Updated Apr 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli (2022). Dataset of invoices and receipts including annotation of relevant fields [Dataset]. http://doi.org/10.5281/zenodo.6371710
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 3, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.

  3. i

    IITBBS-OCR-Dataset

    • ieee-dataport.org
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niladri Puhan (2024). IITBBS-OCR-Dataset [Dataset]. https://ieee-dataport.org/documents/iitbbs-ocr-dataset
    Explore at:
    Dataset updated
    May 21, 2024
    Authors
    Niladri Puhan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    popularity and usefulness

  4. E

    Dataset of ICDAR 2019 Competition on Post-OCR Text Correction

    • live.european-language-grid.eu
    • zenodo.org
    • +1more
    txt
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Dataset of ICDAR 2019 Competition on Post-OCR Text Correction [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7738
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 12, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Corpus for the ICDAR2019 Competition on Post-OCR Text Correction (October 2019)Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreuxhttp://l3i.univ-larochelle.fr/ICDAR2019PostOCR-------------------------------------------------------------------------------These are the supplementary materials for the ICDAR 2019 paper ICDAR 2019 Competition on Post-OCR Text CorrectionPlease use the following citation:@inproceedings{rigaud2019pocr,title=""ICDAR 2019 Competition on Post-OCR Text Correction"",author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},year={2019},booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}}

    Description: The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GS comes both from BnF's internal projects and external initiatives such as Europeana Newspapers, IMPACT, Project Gutenberg, Perseus and Wikisource. Repartition of the dataset- ICDAR2019_Post_OCR_correction_training_18M.zip: 80% of the full dataset, provided to train participants' methods.- ICDAR2019_Post_OCR_correction_evaluation_4M: 20% of the full dataset used for the evaluation (with Gold Standard made publicly after the competition).- ICDAR2019_Post_OCR_correction_full_22M: full dataset made publicly available after the competition. Special case for Finnish language Material from the National Library of Finland (Finnish dataset FI > FI1) are not allowed to be re-shared on other website. Please follow these guidelines to get and format the data from the original website.1. Go to https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en;2. Download OCR Ground Truth Pages (Finnish Fraktur) [v1](4.8GB) from Digitalia (2015-17) package;3. Convert the Excel file ""~/metadata/nlf_ocr_gt_tescomb5_2017.xlsx"" as Comma Separated Format (.csv) by using save as function in a spreadsheet software (e.g. Excel, Calc) and copy it into ""FI/FI1/HOWTO_get_data/input/"";4. Go to ""FI/FI1/HOWTO_get_data/"" and run ""script_1.py"" to generate the full ""FI1"" dataset in ""output/full/"";4. Run ""script_2.py"" to split the ""output/full/"" dataset into ""output/training/"" and ""output/evaluation/"" sub sets.At the end of the process, you should have a ""training"", ""evaluation"" and ""full"" folder with 1579528, 380817 and 1960345 characters respectively.

    Licenses: free to use for non-commercial uses, according to sources in details- BG1: IMPACT - National Library of Bulgaria: CC BY NC ND- CZ1: IMPACT - National Library of the Czech Republic: CC BY NC SA- DE1: Front pages of Swiss newspaper NZZ: Creative Commons Attribution 4.0 International (https://zenodo.org/record/3333627)- DE2: IMPACT - German National Library: CC BY NC ND- DE3: GT4Hist-dta19 dataset: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE4: GT4Hist - EarlyModernLatin: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE5: GT4Hist - Kallimachos: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE6: GT4Hist - RefCorpus-ENHG-Incunabula: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE7: GT4Hist - RIDGES-Fraktur: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- EN1: IMPACT - British Library: CC BY NC SA 3.0- ES1: IMPACT - National Library of Spain: CC BY NC SA- FI1: National Library of Finland: no re-sharing allowed, follow the above section to get the data. (https://digi.kansalliskirjasto.fi/opendata)- FR1: HIMANIS Project: CC0 (https://www.himanis.org)- FR2: IMPACT - National Library of France: CC BY NC SA 3.0- FR3: RECEIPT dataset: CC0 (http://findit.univ-lr.fr)- NL1: IMPACT - National library of the Netherlands: CC BY- PL1: IMPACT - National Library of Poland: CC BY- SL1: IMPACT - Slovak National Library: CC BY NCText post-processing such as cleaning and alignment have been applied on the resources mentioned above, so that the Gold Standard and the OCRs provided are not necessarily identical to the originals.

    Structure- **Content** [./lang_type/sub_folder/#.txt] - ""[OCR_toInput] "" => Raw OCRed text to be de-noised. - ""[OCR_aligned] "" => Aligned OCRed text. - ""[ GS_aligned] "" => Aligned Gold Standard text.The aligned OCRed/GS texts are provided for training and test purposes. The alignment was made at the character level using ""@"" symbols. ""#"" symbols correspond to the absence of GS either related to alignment uncertainties or related to unreadable characters in the source document. For a better view of the alignment, make sure to disable the ""word wrap"" option in your text editor.The Error Rate and the quality of the alignment vary according to the nature and the state of degradation of the source documents. Periodicals (mostly historical newspapers) for example, due to their complex layout and their original fonts have been reported to be especially challenging. In addition, it should be mentioned that the quality of Gold Standard also varies as the dataset aggregates resources from different projects that have their own annotation procedure, and obviously contains some errors.

    ICDAR2019 competitionInformation related to the tasks, formats and the evaluation metrics are details on :https://sites.google.com/view/icdar2019-postcorrectionocr/evaluation

    References - IMPACT, European Commission's 7th Framework Program, grant agreement 215064 - Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. - https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland- EU Horizon 2020 research and innovation programme grant agreement No 770299

    Contact- christophe.rigaud(at)univ-lr.fr- antoine.doucet(at)univ-lr.fr- mickael.coustaty(at)univ-lr.fr- jean-philippe.moreux(at)bnf.frL3i - University of la Rochelle, http://l3i.univ-larochelle.frBnF - French National Library, http://www.bnf.fr

  5. i

    MANUU: Handwritten Urdu OCR Dataset

    • ieee-dataport.org
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaik Ahmed (2024). MANUU: Handwritten Urdu OCR Dataset [Dataset]. https://ieee-dataport.org/documents/manuu-handwritten-urdu-ocr-dataset
    Explore at:
    Dataset updated
    Dec 15, 2024
    Authors
    Shaik Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    digits

  6. T

    Text Recognition Software Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Text Recognition Software Report [Dataset]. https://www.archivemarketresearch.com/reports/text-recognition-software-56023
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Mar 11, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global text recognition software market is experiencing robust growth, driven by the increasing digitization of documents and the rising demand for automation in various industries. The market is estimated to be valued at $15 billion in 2025, exhibiting a Compound Annual Growth Rate (CAGR) of 18% from 2025 to 2033. This significant expansion is fueled by several key factors. The proliferation of unstructured data in businesses necessitates efficient and accurate text extraction solutions, leading to increased adoption of text recognition software across sectors. Furthermore, advancements in artificial intelligence (AI) and machine learning (ML) are enhancing the accuracy and speed of text recognition, making the technology more accessible and appealing to a wider user base. The emergence of cloud-based solutions is further accelerating market growth, offering scalability, cost-effectiveness, and enhanced accessibility. Key market segments include free and paid software options, catering to diverse budgetary requirements, and applications spanning enterprise, municipal, university, and other sectors. Competition is fierce, with established players like Google, Amazon, and Adobe alongside specialized providers vying for market share. The market's growth trajectory is projected to remain strong throughout the forecast period, driven by continued technological advancements, rising adoption in emerging economies, and the growing need for efficient data processing across various industries. However, challenges remain, including data security and privacy concerns, the need for high-quality training data for AI/ML models, and the potential for inaccuracies in complex or poorly scanned documents. Nonetheless, the long-term outlook for the text recognition software market remains positive, indicating a substantial opportunity for growth and innovation. The market segmentation provides various entry points for different vendors, encouraging specialized solutions catered to particular needs. The geographic spread of adoption, encompassing regions like North America, Europe, and Asia-Pacific, signifies a global and rapidly expanding market with significant future potential.

  7. 19th-Century Romanian Transitional Script

    • kaggle.com
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marius E. Penteliuc (2024). 19th-Century Romanian Transitional Script [Dataset]. https://www.kaggle.com/datasets/mariuspenteliuc/rts-ocr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marius E. Penteliuc
    Description

    This dataset consists of 156 pages of Romanian texts written in the Romanian Transitional Script (RTS). RTS is a mix of Latin and Cyrillic characters that were used in the 19th century in the Romanian provinces to facilitate the transition from the Romanian Cyrillic Script to the modern Latin Script. The images cover the period between 1833 and 1864. The selected texts cover a diverse range of literary genres, including poems, novels, dramas, stories, newspapers, and religious texts.

    The dataset was obtained from the Central University Libraries (BCU) of Timișoara, Iași, and Cluj-Napoca through their free online platforms or by request. The scanned images are provided in JPEG and PNG formats, with dimensions ranging from approximately 300 by 900 pixels to 2000 by 3000 pixels. The file sizes vary between 70 KB and 10 MB.

    To ensure diversity, the dataset includes images with various fonts, styles, regions, publishers, and years. It covers all three main Romanian provinces' key publishing regions (Bucharest - B, Iasi - IS, Brasov - BV, Sibiu - SB, Blaj - BJ) as well as some located outside Romania that printed texts in RTS (Vienna - V, Budapest - BD, Paris - P). It comprises 4588 lines of text, totaling 31,132 words and 158,656 characters. Among these characters, there are 61,065 Cyrillic characters, 27,022 Latin characters, 53,844 overlapping characters (identical symbols), and 16,725 other characters (e.g., punctuation, digits). The images below summarize its content per publisher and decade. More statistics (including per publishing house and per character) are available in the code provided.

    Statistics of characters in the dataset per publisher and decade* https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15661653%2F13bd86216df169b5c4783813a4b5118f%2Fchar-count.png?generation=1687532923729343&alt=media" alt="">

    Percentage of Latin vs. Cyrillic vs. other characters in the dataset* https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15661653%2F0cfad1574aa2823b798fcf2b515beff6%2Fchar-ratio.png?generation=1687532980067286&alt=media" alt="">

    The dataset presents typical challenges found in old documents, such as wear and tear, blemishes, discolorations, library imprints, handwriting, ink smudges, and variations in text alignment. These factors may impact legibility, and some scanned lines of text may not be uniformly straight.

    This dataset provides a valuable resource for researchers and practitioners interested in historical document analysis, transliteration techniques, and studying the evolution of the Romanian language. It allows for the development and evaluation of OCR models and other language processing techniques in the context of the Romanian Transitional Script. The images provided are accompanied by ground truth texts (.gt.txt files) containing the correct text found in them, as well as .box files for the Tesseract 5 OCR engine.

    Usage

    You may use the dataset freely as long as you mention this page or the project below.

    Acknowledgements

    This work was supported by a grant of the Romanian Ministry of Research, Innovation and Digitization, CCCDI - UEFISCDI, project number PN-III-P2-2.1-PED-2021-0693, within PNCDI III. Project website: ROTLA

    *Plots are based on the original dataset distribution

  8. i

    SinOCR and SinFUND - Sinhala OCR and Form Understanding Datasets

    • ieee-dataport.org
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thanuja Ambegoda (2024). SinOCR and SinFUND - Sinhala OCR and Form Understanding Datasets [Dataset]. https://ieee-dataport.org/documents/sinocr-and-sinfund-sinhala-ocr-and-form-understanding-datasets
    Explore at:
    Dataset updated
    Jun 20, 2024
    Authors
    Thanuja Ambegoda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    includes 100

  9. t

    Mohamed Dhouib, Ghassen Bettaieb, Aymen Shabou (2025). Dataset: DocParser:...

    • service.tib.eu
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Mohamed Dhouib, Ghassen Bettaieb, Aymen Shabou (2025). Dataset: DocParser: End-to-end OCR-free Information Extraction from Visually Rich Documents. https://doi.org/10.57702/ax947n8j [Dataset]. https://service.tib.eu/ldmservice/dataset/docparser--end-to-end-ocr-free-information-extraction-from-visually-rich-documents
    Explore at:
    Dataset updated
    Jan 2, 2025
    Description

    Information Extraction from visually rich documents is a challenging task that has gained a lot of attention in recent years due to its importance in several document-control based applications and its widespread commercial value.

  10. synthdog-ko

    • huggingface.co
    Updated Dec 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NAVER CLOVA INFORMATION EXTRACTION (2024). synthdog-ko [Dataset]. https://huggingface.co/datasets/naver-clova-ix/synthdog-ko
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Naver Corporationhttp://www.navercorp.com/
    Authors
    NAVER CLOVA INFORMATION EXTRACTION
    Description

    Donut 🍩 : OCR-Free Document Understanding Transformer (ECCV 2022) -- SynthDoG datasets

    For more information, please visit https://github.com/clovaai/donut

    The links to the SynthDoG-generated datasets are here:

    synthdog-en: English, 0.5M. synthdog-zh: Chinese, 0.5M. synthdog-ja: Japanese, 0.5M. synthdog-ko: Korean, 0.5M.

    To generate synthetic datasets with our SynthDoG, please see ./synthdog/README.md and our paper for details.

      How to Cite
    

    If you find this work useful… See the full description on the dataset page: https://huggingface.co/datasets/naver-clova-ix/synthdog-ko.

  11. I

    Intelligent Text Recognition Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Intelligent Text Recognition Report [Dataset]. https://www.marketresearchforecast.com/reports/intelligent-text-recognition-11806
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Jan 21, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Intelligent Text Recognition market size was valued at USD 4,351.1 million in 2025 and is projected to reach USD 21,056.6 million by 2033, exhibiting a CAGR of 24.1% during the forecast period (2025-2033). The market growth is attributed to the rising adoption of OCR technology to automate data capture and processing, increasing demand for document digitization in various industries, and growing popularity of mobile devices equipped with OCR capabilities. Key market drivers include increasing demand for efficient and error-free data entry, advancements in AI and machine learning technologies, growing use of OCR in automated document processing, and increasing adoption of OCR in cloud-based solutions. The market is segmented by type (print recognition, handwriting recognition), application (logistics industry, financial industry, medical industry, government and public services, others), and region (North America, Europe, Asia Pacific, Middle East & Africa, South America). North America is expected to dominate the market throughout the forecast period due to the presence of major technology companies and early adoption of advanced technologies. Asia Pacific is anticipated to witness the fastest growth during the forecast period due to the increasing demand for OCR solutions in emerging economies like China, India, and Japan.

  12. Smart Document Scanner OCR App Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Smart Document Scanner OCR App Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/smart-document-scanner-ocr-app-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Jun 29, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Smart Document Scanner OCR App Market Outlook



    According to our latest research, the global Smart Document Scanner OCR App market size reached USD 3.85 billion in 2024, exhibiting robust growth driven by the rapid digitization of workflows and the increasing need for document automation across various sectors. The market is projected to grow at a CAGR of 13.7% from 2025 to 2033, with the market size forecasted to reach USD 11.89 billion by 2033. This significant expansion is primarily attributed to the widespread adoption of mobile devices, advancements in artificial intelligence and machine learning, and the growing demand for efficient document management solutions in both personal and professional environments.




    One of the primary growth factors fueling the Smart Document Scanner OCR App market is the accelerating pace of digital transformation across industries such as healthcare, finance, education, and government. Organizations are increasingly seeking ways to streamline their document handling processes, reduce manual data entry errors, and improve operational efficiency. The integration of Optical Character Recognition (OCR) technology into smart document scanning apps enables users to quickly convert paper documents into editable and searchable digital formats, significantly enhancing productivity. Furthermore, the proliferation of remote work and the need for secure, cloud-based document sharing have further heightened the demand for advanced OCR-enabled scanning solutions.




    Another significant driver is the continuous innovation in artificial intelligence and machine learning algorithms, which are making OCR technology more accurate, reliable, and versatile. Modern Smart Document Scanner OCR Apps can now recognize a wide range of fonts, languages, and complex layouts, including tables and handwritten notes, with remarkable precision. This technological evolution has broadened the application scope of these apps, allowing them to be used not only for basic document digitization but also for tasks such as invoice processing, identity verification, and compliance management. The incorporation of AI-powered features such as automatic document detection, real-time translation, and advanced data extraction is further propelling market growth.




    The increasing penetration of smartphones and mobile devices globally has also played a crucial role in the expansion of the Smart Document Scanner OCR App market. With the majority of the population now having access to high-resolution cameras and powerful processing capabilities on their mobile devices, scanning and digitizing documents has become more convenient than ever. This trend is particularly pronounced in emerging markets, where mobile-first solutions are often preferred over traditional desktop-based applications. Additionally, the growing emphasis on paperless offices and environmental sustainability is encouraging both individuals and enterprises to adopt digital document management practices, thereby boosting the market for OCR-enabled scanner apps.




    From a regional perspective, North America currently dominates the global Smart Document Scanner OCR App market, accounting for the largest share in 2024. This is largely due to the high adoption rate of advanced technologies, a mature IT infrastructure, and the presence of leading solution providers in the region. However, Asia Pacific is expected to witness the fastest growth over the forecast period, driven by rapid urbanization, increasing smartphone penetration, and rising investments in digital transformation initiatives across countries such as China, India, and Japan. Europe also presents significant growth opportunities, supported by stringent regulatory requirements for data management and a strong focus on innovation in document processing technologies.





    Component Analysis



    The Component segment of the Smart Document Scanner OCR App market is bifurcated into Software and Services. The Software sub-segment holds the lion’s share of the market, as the co

  13. h

    MMDocBench

    • huggingface.co
    Updated Sep 15, 2003
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TAT@NExT (2003). MMDocBench [Dataset]. https://huggingface.co/datasets/next-tat/MMDocBench
    Explore at:
    Dataset updated
    Sep 15, 2003
    Dataset authored and provided by
    TAT@NExT
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

    MMDocBench is an open-sourced benchmark with various OCR-free document understanding tasks for evaluating fine-grained visual perception and reasoning abilities. For more details, please refer to the project page: https://MMDocBench.github.io/.

      Dataset Structure
    

    MMDocBench consists of 15 main tasks and 48 sub-tasks, involving 2,400 document images, 4,338 QA pairs… See the full description on the dataset page: https://huggingface.co/datasets/next-tat/MMDocBench.

  14. m

    USA Driving Licence Image Dataset to train AI/ML Model

    • data.macgence.com
    mp3
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). USA Driving Licence Image Dataset to train AI/ML Model [Dataset]. https://data.macgence.com/dataset/usa-driving-licence-image-dataset-to-train-aiml-model
    Explore at:
    mp3Available download formats
    Dataset updated
    Jun 1, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Comprehensive USA Driving Licence dataset by Macgence, designed to train AI/ML models. Enhance OCR, image recognition, and document automation projects effectively.

  15. Document Capture Software Market by End-user and Geography - Forecast and...

    • technavio.com
    pdf
    Updated Jul 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2021). Document Capture Software Market by End-user and Geography - Forecast and Analysis 2021-2025 [Dataset]. https://www.technavio.com/report/document-capture-software-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 27, 2021
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2020 - 2024
    Description

    Snapshot img

    The document capture software market share is expected to increase by USD 3.57 billion from 2019 to 2024, and the market’s growth momentum will accelerate at a CAGR of 11.57%.

    This document capture software market research report provides valuable insights on the post COVID-19 impact on the market, which will help companies evaluate their business approaches. Furthermore, this report extensively covers the document capture software market segmentations by end-user (BFSI, healthcare, government, legal, and others) and geography (North America, Europe, APAC, and South America). The document capture software market report also offers information on several market vendors, including ABBYY Solutions Ltd., Adobe Inc., Canon Inc., Dell Technologies Inc., Kofax Inc., Oracle Corp., Parashift AG, Rossum Ltd., Seiko Epson Corp., and Xerox Corp. among others.

    What will the Document Capture Software Market Size be During the Forecast Period?

    Download the Free Report Sample to Unlock the Document Capture Software Market Size for the Forecast Period and Other Important Statistics

    Document Capture Software Market: Key Drivers, Trends, and Challenges

    Based on our research output, there has been a positive impact on the market growth during and post COVID-19 era. The growing use of big data analytics is notably driving the document capture software market growth, although factors such as risks of data theft and cyber attacks may impede market growth. Our research analysts have studied the historical data and deduced the key market drivers and the COVID-19 pandemic impact on the document capture software industry. The holistic analysis of the drivers will help in deducing end goals and refining marketing strategies to gain a competitive edge.

    Key Document Capture Software Market Driver

    The growing use of big data analytics is one of the key factors driving the growth of the global document capture software machine market. Big data analytics is the process of examining large and diverse data sets to decipher data patterns, data correlations, customer preferences, and market trends that can help organizations make informed business decisions. The adoption of big data analytics among enterprises is increasing as it offers many business benefits, including improving customer service and operational efficiency, devising effective marketing strategies, identifying new revenue opportunities, and gaining competitive advantages over rivals. While transaction, structured, unstructured, and semi-structured data that is generated by a diverse set of enterprise applications can be directly integrated with the analytics solution, a huge amount of valuable information lies in the physical document repositories of enterprises. This data can be harnessed and analyzed using analytical software after digitization. Data capture software enables organizations to utilize the information stored in physical documents with the help of digital transformation. Such factors will increase the market focus during the forecast period.

    Key Document Capture Software Market Trend

    The use of mobile-based data capture software will fuel the global document capture software machine market growth. The increasing use of mobile devices is encouraging vendors to integrate advanced technologies with mobile-based document capture software to increase customer usability. Examples of such modified offerings include advanced capture, which enables capture from multiple sources and identifies valuable data. Furthermore, the integration of mobile devices and cloud solutions will allow to capture content anytime, anywhere, and from all types of sources. Moreover, to exploit the opportunity, all the leading document capture software vendors are providing mobile-based document capture software/applications. For instance, ABBYY has come up with ABBYY Mobile OCR Engine, allowing developers to integrate optical character recognition (OCR) into their mobile apps and small-footprint applications. With this mobile OCR software development kit (SDK), apps can extract text from photographed images, transforming smartphones and tablets into efficient mobile data capture devices. Such factors will increase the market focus during the forecast period.

    Key Document Capture Software Market Challenge

    The risks of data theft and cyber-attacks are a major challenge for global document capture software machine market growth. Digitized data and online documents contain confidential data of companies. These documents are available online, either on the cloud or in the on-premise database. Thus, the data is always at risk from cyberattacks. Cybersecurity and privacy concerns pose a challenge for the adoption of document management systems, which include document capture software. The mismanagement of digital content captured using document capture software increases the vulnerability to cyberattacks. This can lead to a reduction in

  16. AI-Enabled Document Redaction Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). AI-Enabled Document Redaction Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/ai-enabled-document-redaction-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Aug 4, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI-Enabled Document Redaction Market Outlook



    According to our latest research, the global AI-Enabled Document Redaction market size was valued at USD 1.28 billion in 2024 and is expected to reach USD 6.74 billion by 2033, expanding at a robust CAGR of 20.1% during the forecast period. This impressive growth trajectory is primarily driven by the increasing adoption of artificial intelligence for secure data management, escalating privacy regulations, and a rising need for automation in sensitive document processing across industries.




    One of the principal growth drivers for the AI-Enabled Document Redaction market is the proliferation of stringent data privacy regulations worldwide. Laws such as the General Data Protection Regulation (GDPR) in Europe, the California Consumer Privacy Act (CCPA) in the United States, and similar frameworks in other regions have made it imperative for organizations to ensure that personally identifiable information (PII) and sensitive data are adequately protected. Non-compliance can result in substantial fines and reputational damage, making automated redaction solutions powered by AI an essential investment. These solutions not only accelerate the redaction process but also significantly reduce the risk of human error, ensuring higher accuracy and compliance in data handling.




    Another significant factor fueling market expansion is the exponential growth in digital documentation and the increasing volume of unstructured data generated across various sectors. Enterprises, especially in legal, healthcare, BFSI, and government domains, manage vast repositories of documents containing confidential information. Manual redaction is time-consuming, labor-intensive, and prone to oversight. AI-enabled document redaction leverages machine learning and natural language processing to automate the identification and removal of sensitive content, thereby enhancing operational efficiency and reducing costs. The integration of these AI-driven solutions into existing workflows has become a key differentiator for organizations aiming to streamline document management and ensure robust data privacy.




    The rapid advancements in AI technologies, including deep learning, optical character recognition (OCR), and contextual analysis, have further propelled the capabilities of document redaction platforms. Modern AI-enabled redaction tools can process complex document formats, recognize various data types, and adapt to evolving regulatory requirements. This technological evolution has expanded the applicability of these solutions beyond traditional sectors, attracting interest from emerging industries such as IT and telecom, retail, and e-commerce. As AI algorithms continue to mature, the market is expected to witness increased adoption across small and medium enterprises (SMEs) in addition to large corporations, democratizing access to advanced data protection tools.




    Regionally, North America currently dominates the AI-Enabled Document Redaction market, accounting for the largest revenue share in 2024. This leadership is attributed to early technology adoption, a mature regulatory environment, and the presence of leading solution providers. However, Asia Pacific is anticipated to exhibit the fastest growth over the forecast period, driven by rapid digital transformation, growing awareness about data privacy, and increasing investments in AI infrastructure. Europe remains a key market, bolstered by robust privacy laws and a high concentration of global enterprises. The Middle East & Africa and Latin America are also showing promising potential, with governments and organizations gradually embracing AI-powered document management solutions to enhance data security and compliance.





    Component Analysis



    The Component segment of the AI-Enabled Document Redaction market is primarily divided into Software and Services. Software solutions constitute the core of this market, offering advanced AI capabilities such as natural langua

  17. k

    Bank check (Template)

    • koncile.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koncile, Bank check (Template) [Dataset]. https://www.koncile.ai/en/extraction-ocr/bank-check
    Explore at:
    Dataset authored and provided by
    Koncile
    License

    https://www.koncile.ai/en/termsandconditionshttps://www.koncile.ai/en/termsandconditions

    Variables measured
    Name, Bank name, Firstname, Payment date, Place of issue, Total amount paid
    Description

    AI OCR to extract data from bank checks. Fast, accurate, and integrable via API/SDK to automate the processing of banking documents.

  18. h

    Czech-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Czech-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Czech-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    🇨🇿 Czech Public Domain 🇨🇿

    Czech-Public Domain or Czech-PD is a large collection aiming to aggregate all Czech monographies and periodicals in the public domain. As of March 2024, it is the biggest Czech open corpus.

      Dataset summary
    

    The collection contains 1585 individual titles making up 259,435,959 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet file has the… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Czech-PD.

  19. h

    Viet-OCR-VQA

    • huggingface.co
    Updated Jul 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fifth Civil Defender - 5CD (2024). Viet-OCR-VQA [Dataset]. https://huggingface.co/datasets/5CD-AI/Viet-OCR-VQA
    Explore at:
    Dataset updated
    Jul 13, 2024
    Dataset authored and provided by
    Fifth Civil Defender - 5CD
    Area covered
    Việt Nam
    Description

    Dataset Overview

    The dataset comprises over 137,000 images potentially containing Vietnamese 🇻🇳 textual content. It was curated using the Gemini 1.5 Flash model, currently Google model leading on the WildVision Arena Leaderboard for Visual Question Answering (VQA). Each image is accompanied by a detailed description and 5 self-generated questions and answers related to the textual content within the image. In total, there are more than 822,679 individual questions, encompassing… See the full description on the dataset page: https://huggingface.co/datasets/5CD-AI/Viet-OCR-VQA.

  20. g

    Deliberations of the bodies of the city of Nantes and Nantes Métropole |...

    • gimi9.com
    Updated Jan 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Deliberations of the bodies of the city of Nantes and Nantes Métropole | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_https-data-nantesmetropole-fr-explore-dataset-244400404_deliberations-instances-metropole-nantes-/
    Explore at:
    Dataset updated
    Jan 11, 2024
    Area covered
    Nantes Métropole, Nantes
    Description

    Deliberations of the Municipal Council of the City of Nantes, the Metropolitan Council, the Metropolitan Bureau of Nantes Métropole and the Communal Centre for Social Action of the City of Nantes. * * * * This dataset aggregates the information obtained from the deliberations of the various bodies of the Collectivité Nantes Métropole and the City. A description of each instance, as well as all the agendas and reports are available on the Community’s institutional website on the dedicated pages: * **to City Council ** * to the Metropolitan Council * at the Metropolitan Office * **at CCAS ** The data of the open deliberations in this game are extracted from the files transmitted by the community to the Prefecture for the control of legality through the FAST – Acts service. Deliberations are part of the common core of local data, i.e. a set of data that communities agree to publish as a matter of priority, following a way of organising information. As a result, the file is modeled to correspond to the standard schema defined under the umbrella of the Open Data France association. Specification of the textual content of the deliberations included to facilitate the search: Currently, the deliberations of the community bodies are validated on paper and signed in handwritten form. The final versions published on the community’s website are scans of these documents. In the case of scanned images, their content is only visually accessible and their content is not indexed by search engines. To facilitate the search in this database, a free optical character recognition engine (Tesseract 4) is used, which is based on artificial intelligence (LSTM-type neural network, see Tesseract documentation). The content has a very high level of reliability, but occasional errors may remain. For functions other than search, it is always necessary to refer to the pdf documents which alone are authentic.

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FileMarket (2025). ID's photo Dataset | 67 countries | 11 types of documents | Document Recognition | OCR Training | Computer Vision [Dataset]. https://datarade.ai/data-products/id-s-photo-dataset-67-countries-11-types-of-documents-d-filemarket

ID's photo Dataset | 67 countries | 11 types of documents | Document Recognition | OCR Training | Computer Vision

Explore at:
.jpg, .jpeg, .pngAvailable download formats
Dataset updated
Jul 25, 2025
Dataset authored and provided by
FileMarket
Area covered
Bulgaria, France, Egypt, Cuba, Sri Lanka, Venezuela (Bolivarian Republic of), Indonesia, Mexico, Peru, Benin
Description

Total individuals: 1661 Total images: 3623 Images per users: 2.18

Top Countries: - Nigeria 44,6% - United States of America 7,2% - Bangladesh 7,1% - Ethiopia 6,7% - Indonesia 4,8% - India 4,8% - Kenya 2,4% - Iran 2,3% - Nepal 1,7% - Pakistan 1,4% (Total 67 countries)

Type of documents: - Identification Card (ID Card) 63,2% - Driver's License 6,4% - Student ID 4,9% - International passport 2,8% - Domestic passport 0,8% - Residence Permit 0,7% - Military ID 0,4% - Health Insurance Card 0,2%

Data is organized in per‑user folders and includes rich metadata.

Within a folder you may find: (a) multiple document categories for the same person, and/or (b) repeated captures of the same document against different backgrounds or lighting setups. The maximum volume per individual is 28 images.

Metadata includes country of document, type of document, created date, last name, first name, day of birth, month of birth and year of birth.

Every image was provided with explicit user consent. This ensures downstream use cases—such as training and evaluating document detection, classification, text extraction, and identity authentication models—are supported by legally sourced data.

Search
Clear search
Close search
Google apps
Main menu