18 datasets found
  1. Noisy OCR Dataset (NOD)

    • zenodo.org
    bin
    Updated Jul 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Hegghammer; Thomas Hegghammer (2021). Noisy OCR Dataset (NOD) [Dataset]. http://doi.org/10.5281/zenodo.5068735
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas Hegghammer; Thomas Hegghammer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).

    Source images

    The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.

    Artificial noise application

    The dataset was created as follows:
    - First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
    - Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
    - Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.

    This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.

    The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.

    References:

    Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https:
    //github.com/PedroBarcha/old-books-dataset.

    Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk
    Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science
    and Information Technology (CSIT)
    , 150–54. IEEE.

    Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs

  2. Tesseract OCR of IIT-CDIP Dataset

    • zenodo.org
    application/gzip
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Davis; Brian Davis (2022). Tesseract OCR of IIT-CDIP Dataset [Dataset]. http://doi.org/10.5281/zenodo.6540454
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Brian Davis; Brian Davis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is Tesseract generated transcriptions (no images) of (most of) the IIT-CDIP dataset. To download the images of the IIT-CDIP dataset go to https://data.nist.gov/od/id/mds2-2531

    The directory struture of this dataset is the same as the IIT-CDIP dataset (although has everything in one tar, with "a.a", "a.b", ... directories) and can thus be combine with the image IIT-CDIP dataset using rsync or similar tool. This dataset contains a "X.layout.json" for each "X.png" in the IIT-CDIP dataset (doesn't have sections 'a', 'w', 'x', 'y', and 'z').

    The jsons contain block/paragraph, line and word bounding boxes, with transcriptions for the words following the Tesseract format. The line and word annotations are directly taken from Tesseract. The block and paragraph output of Tesseract was discarded. The images were then run through both the Publaynet and PrimaNet models available on LayoutParser (https://layout-parser.github.io/). The combine output of these models became the block/paragraph annotations (we kept the Tesseract output format, but each block has 1 paragraph of exactly the same shape).

    Important: There is also a "rotation" value in the json (0, 90, 180, or 270) indicating the json may be for a rotated version of the IIT-CDIP image by the given amount (attempted to rotated documents to upright position to get better OCR results).

    These are the annotations used to pre-train Dessurt (https://arxiv.org/abs/2203.16618).

    These annotations will be worse than those that would be obtained using a commercial OCR system (like those used to pre-train LayoutLMv2/v3).

    The code used to produce these annotations is available here: https://github.com/herobd/ocr

  3. e

    Scrambled text: training Language Models to correct OCR errors using...

    • b2find.eudat.eu
    Updated Oct 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Scrambled text: training Language Models to correct OCR errors using synthetic data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1ea0205e-de3a-54e7-a918-fde36ad3156f
    Explore at:
    Dataset updated
    Oct 27, 2024
    Description

    This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as followsncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the papersynth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data

  4. h

    chinese_text_recognition

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    priyank, chinese_text_recognition [Dataset]. https://huggingface.co/datasets/priyank-m/chinese_text_recognition
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    priyank
    License

    https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/

    Description
  5. NomNaOCR

    • kaggle.com
    Updated Oct 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quan Dang (2022). NomNaOCR [Dataset]. https://www.kaggle.com/quandang/nomnaocr/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Quan Dang
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    A Dataset for Optical Character Recognition on old Vietnamese Handwritten

    GitHub: https://github.com/ds4v/NomNaOCR

    Paper: https://ieeexplore.ieee.org/document/10013842

    Demo: https://youtu.be/o5xpfwalEWw

    Here, we introduce the NomNaOCR dataset for the old Vietnamese HĂĄn-**NĂŽm** script based on 3 tremendous and valuable historical works of Vietnam, including: - LỄc VĂąn TiĂȘn by Nguyễn ĐÏnh Chiểu. - Tale of Kiều or Truyện Kiều (version 1866, 1871, and 1872) by Nguyễn Du. - A full set of 5 parts in History of Greater Vietnam or ĐáșĄi Việt Sá»­ KĂœ ToĂ n Thư (ĐVSKTT) composed by many historians from the Tráș§n to the Háș­u LĂȘ dynasty of Vietnam.

    With 2953 handwritten Pages (2956 - 3 Ignored Pages) collected from the Vietnamese NĂŽm Preservation Foundation for analyzing and semi-annotating the bounding boxes to generate additional 38,318 Patches (38,319 - 1 Ignored Patch) containing text along with HĂĄn-**NĂŽm** strings in digital form. This makes NomNaOCR currently become the biggest dataset for HĂĄn-NĂŽm script in Vietnam serving 2 main problems in Optical Character Recognition on HĂĄn-**NĂŽm** script: - Text Detection: Detect the image regions that contain text. The input is an image (or a Page), and the output is a bounding box surrounding the text area found. - Text Recognition: After detecting boxes or image regions containing text, each of these regions will be cropped from the original image, forming small parts called Patch. The input will now be a Patch, and the output will be the text in that Patch.

    A difference here is that our implementations were all done at the sequence level, which not only saves the cost for annotation but also helps us retain the context in the sequence instead of just performing on each individual character as in most previous works.

    https://github.com/ds4v/NomNaOCR/raw/main/Assets/ocr_pipeline1.jpg" alt="">

    ***Note**: There are characters that Kaggle cannot display => Use the NomNaTong font to be able to read the HĂĄn-**NĂŽm** content in the best way.

  6. e

    NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers -...

    • b2find.eudat.eu
    Updated Jul 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/bd615bf4-a43a-5ed0-bb26-59e8977c3ff8
    Explore at:
    Dataset updated
    Jul 24, 2025
    Description

    NCSE v2.0 Dataset RepositoryThis repository contains the NCSE v2.0 dataset and associated supporting data used in the paper "Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models".Dataset OverviewThe NCSE v2.0 is a digitized collection of six 19th-century English periodicals containing:82,690 pages1.4 million entries321 million words1.9 billion charactersThe dataset includes:1.1 million text entries198,000 titles17,000 figure descriptions16,000 tablesRepository ContentsNCSE v2.0 DatasetNCSE_v2.zip: a folder containing a parquet file for each of the periodicals as well as a readme file.Bounding Box DatasetA zip file called bounding_box.zip. Containspost_process: A folder of the processed periodical bounding box datapost_process_fill: A folder of the processed periodical bounding box data WITH column filling.bbox_readme.txt: a readme file and data description for the bounding boxesTest Setscropped_images.zip: 378 images cropped from the NCSE test set pages, all 2-bit png filesground_truth: 358 text files corresponding to the text from the cropped_images folderClassification Training DataThe below files are used for training the classification models. They contain 12000 observations 2000 from each periodical. The labels were classified using mistral-large-2411. This data is used to train the ModernBERT classifier described in the paper. The topics are taken from the International Press Telecommunications Council (IPTC) subject codes.silver_IPTC_class.parquet: IPTC topic classification silver setsilver_text_type.parquet: Text-type classification silver setClassified DataThe zip file "classification_data.zip" with all rows classified using the ModernBERT classifer described in the paper.IPTC_type_classified.zip: contains one parquet file per periodicaltext_type_classified.zip: contains one parquet file per periodicalclassification_readme.md: Description of the dataClassification MappingsData for mapping the classification codes to human readable names.class_mappings.zip: contains a json for each classification typeIPTC_class_mapping.jsontext_type_class_mapping.jsonOriginal ImagesThe original page images can be found at the King's College London Repositories:Monthly RepositoryNorthern StarLeaderEnglish Woman's JournalTomahawkPublishers' CircularOr via the project central archiveCitationIf you use this dataset, please cite:No citation data currently availableRelated CodeAll original code related to this project including the creation of the datasets and thier analysis can be found at:https://github.com/JonnoB/ereading_the_unreadableContactFor questions about the dataset, please create an issue in this repository.Usage RightsIn keeping with the original NCSE dataset, all data is made available under a Creative Commons Attribution 4.0 International License (CC BY).

  7. Labelled data for fine tuning a geological Named Entity Recognition and...

    • ckan.publishing.service.gov.uk
    • metadata.bgs.ac.uk
    • +1more
    Updated Aug 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.publishing.service.gov.uk (2025). Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model [Dataset]. https://ckan.publishing.service.gov.uk/dataset/labelled-data-for-fine-tuning-a-geological-named-entity-recognition-and-entity-relation-extract
    Explore at:
    Dataset updated
    Aug 19, 2025
    Dataset provided by
    CKANhttps://ckan.org/
    Description

    This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604

  8. h

    MMDocBench

    • huggingface.co
    Updated Sep 15, 2003
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TAT@NExT (2003). MMDocBench [Dataset]. https://huggingface.co/datasets/next-tat/MMDocBench
    Explore at:
    Dataset updated
    Sep 15, 2003
    Dataset authored and provided by
    TAT@NExT
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

    MMDocBench is an open-sourced benchmark with various OCR-free document understanding tasks for evaluating fine-grained visual perception and reasoning abilities. For more details, please refer to the project page: https://MMDocBench.github.io/.

      Dataset Structure
    

    MMDocBench consists of 15 main tasks and 48 sub-tasks, involving 2,400 document images, 4,338 QA pairs
 See the full description on the dataset page: https://huggingface.co/datasets/next-tat/MMDocBench.

  9. POPP Datasets : Datasets for handwriting recognition from French population...

    • zenodo.org
    zip
    Updated Jul 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas CONSTUM; Nicolas KEMPF; Thierry PAQUET; Pierrick TRANOUEZ; Clément CHATELAIN; Sandra BREE; François MERVEILLE; Thomas CONSTUM; Nicolas KEMPF; Thierry PAQUET; Pierrick TRANOUEZ; Clément CHATELAIN; Sandra BREE; François MERVEILLE (2025). POPP Datasets : Datasets for handwriting recognition from French population census [Dataset]. http://doi.org/10.5281/zenodo.6581158
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas CONSTUM; Nicolas KEMPF; Thierry PAQUET; Pierrick TRANOUEZ; Clément CHATELAIN; Sandra BREE; François MERVEILLE; Thomas CONSTUM; Nicolas KEMPF; Thierry PAQUET; Pierrick TRANOUEZ; Clément CHATELAIN; Sandra BREE; François MERVEILLE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    POPP datasets

    This repository contains 3 datasets created within the POPP project (Project for the Oceration of the Paris Population Census) for the task of handwriting text recognition. These datasets have been published in Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census at DAS 2022.

    The 3 datasets are called “Generic dataset”, “Belleville”, and “ChaussĂ©e d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines.

    The structure of each dataset is the following:

    • double-pages : images of the double pages
    • pages:
      • images: images of the pages
      • xml: METS and ALTO files of each page containing the coordinates of the bounding boxes of each line
    • lines: contains the labels in the file labels.json and the line images splitted into the folders train, valid and test. The double pages were scanned at a resolution of 200dpi and saved as PNG images with 256 gray levels. The line and page images are shared in the TIFF format, also with 256 gray levels.

    Since the lines are extracted from table rows, we defined 4 special characters to describe the structure of the text:

    • € : indicates an empty cell
    • / : indicates the separation into columns
    • ? : indicates that the content of the cell following this symbol is written above the regular baseline
    • ! : indicates that the content of the cell following this symbol is written below the regular baseline

    We provide a script format_dataset.py to define which special character you want to use in the ground-truth.

    The split for the Generic Dataset and Belleville have been made at the double-page level so that each writer only appears in one subset among train, evaluation and test. The following table summarizes the splits and the number of writers for each dataset:

    Datasettrain - # of linesvalidation - # of linestest - # of lines# of writers
    Generic3840 (128 pages)480 (16 pages)480 (16 pages)80
    Belleville1140 (38 pages)150 (5 pages)180 (6 pages)1
    ChaussĂ©e d’Antin625787710

    Generic dataset (or POPP dataset)

    • This dataset is made 4800 annotated lines extracted from 80 double pages of the 1926 Paris census.
    • There is one double page for each of the 80 districts of Paris
    • There is one writer per double page so the dataset contains 80 different writers.

    Belleville dataset

    This dataset is a mono-writer dataset made of 1470 lines (49 pages) from the Belleville district census of 1926.

    ChaussĂ©e d’Antin dataset

    This dataset is a multi-writer dataset made of 780 lines (26 pages) from the ChaussĂ©e d’Antin district census of 1926 and written by 10 different writers.

    Error reporting

    It is possible that errors persist in the ground truth, so any suggestions for correction are welcome. To do so, please make a merge request on the Github repository and include the correction in both the labels.json file and in the XML file concerned.

    Citation Request

    If you publish material based on this database, we request you to include a reference to paper T. Constum, N. Kempf, T. Paquet, P. Tranouez, C. Chatelain, S. Brée, and F. Merveille,Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census ,Document Analysis Systems (DAS), pp. 143- 157, La Rochelle, 2022.

  10. synthdog-ko

    • huggingface.co
    Updated Dec 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NAVER CLOVA INFORMATION EXTRACTION (2024). synthdog-ko [Dataset]. https://huggingface.co/datasets/naver-clova-ix/synthdog-ko
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Naver Corporationhttp://www.navercorp.com/
    Authors
    NAVER CLOVA INFORMATION EXTRACTION
    Description

    Donut đŸ© : OCR-Free Document Understanding Transformer (ECCV 2022) -- SynthDoG datasets

    For more information, please visit https://github.com/clovaai/donut

    The links to the SynthDoG-generated datasets are here:

    synthdog-en: English, 0.5M. synthdog-zh: Chinese, 0.5M. synthdog-ja: Japanese, 0.5M. synthdog-ko: Korean, 0.5M.

    To generate synthetic datasets with our SynthDoG, please see ./synthdog/README.md and our paper for details.

      How to Cite
    

    If you find this work useful
 See the full description on the dataset page: https://huggingface.co/datasets/naver-clova-ix/synthdog-ko.

  11. g

    Text from pdfs found on data.gouv.fr

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Text from pdfs found on data.gouv.fr [Dataset]. https://gimi9.com/dataset/eu_5ec45f516a58eec727e79af7/
    Explore at:
    Area covered
    France
    Description

    Text extracted from pdfs found on data.gouv.fr ## Description This dataset contains text extracted from 6602 files that have the ‘pdf’ extension in the resource catalog of data.gouv.fr. The dataset contains only the pdfs of 20 Mb or less and which are always available on the URL indicated. The extraction was done with PDFBox via its Python wrapper python-PDFBox. PDFs that are images (scans, maps, etc.) are detected with a simple heuristic: if after converting to text with ‘PDFBox’, the file size is less than 20 bytes, it is considered to be an image. In this case, OCRisation is carried out. This one is made with Tesseract via its Python wrapper pyocr. The result is ‘txt’ files from ‘pdfs’ sorted by organisation (the organisation that published the resource). There are 175 organisations in this dataset, so 175 files. The name of each file corresponds to the string ‘{id-du-dataset}--{id-de-la-resource}.txt’. #### Input Catalogue of data.gouv.fr resources. #### Output Text files of each ‘pdf’ resource found in the catalogue that was successfully converted and satisfied the above constraints. The tree is as follows: Bash . ACTION_Nogent-sur-Marne 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt |... Aeroport_La_Rochelle-Ile_de_Re Agency_de_services_and_payment_ASP Agency_du_Numerique ... “'” ## Distribution of texts [as of 20 May 2020] The top 10 organisations with the largest number of documents is: Python [(‘Les_Lilas’, 1294), (‘Ville_de_Pirae’, 1099), (‘Region_Hauts-de-France’, 592), (‘Ressourcerie_datalocale’, 297), (‘NA’, 268), (‘CORBION’, 244), (‘Education_Nationale’, 189), (‘Incubator_of_Services_Numeriques’, 157), (‘Ministere_des_Solidarites_and_de_la_Sante’, 148), (‘Communaute_dAgglomeration_Plaine_Vallee’, 142)] “'” And their preview in 2D is (HashFeatures+TruncatedSVD+[t-SNE]): Plot t-SNE of DGF texts ## Code The Python scripts used to do this extraction are here. ## Remarks Due to the quality of the original pdfs (low resolution scans, non-aligned pdfs,...) and the performance of the pdf->txt transformation methods, the results can be very loud.

  12. e

    Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset -...

    • b2find.eudat.eu
    Updated Aug 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/0406e495-3d78-5cb2-9e5b-32b9dbba1e82
    Explore at:
    Dataset updated
    Aug 3, 2025
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The corpus of Slovenian periodicals sPeriodika contains linguistically annotated periodicals published during the 18th, 19th, and beginning of 20th century (1771-1914). The periodical issues were retreived from Slovenia's national library's digital library service (https://dlib.si) in the form of OCR-ed PDF and TXT files. Before linguistically annotating the documents (lemmatisation, part-of-speech tagging, and named entity recognition) with CLASSLA-Stanza (https://github.com/clarinsi/classla), the OCR-ed texts were corrected with a lightweight and robust approach using cSMTiser (https://github.com/clarinsi/csmtiser), a text normalisation tool based on character-level machine translation. This OCR post-correction model was trained on a set of manually corrected samples (300 random paragraphs at least 100 characters in length) from the original texts, cf. http://hdl.handle.net/11356/1907. The documents in the collection are enriched with the following metadata obtained from dLib: - Document ID (URN) - Periodical name - Document (periodical issue) title - Volume number (if available) - Issue number (if available) - Year of publication - Date of publication (of varying granularity, based on original metadata available) - Source (URL of the original digitised document available at dlib.si) - Image (see below) - Quality (see below)

  13. e

    Carniolan Provincial Assembly corpus Kranjska 1.0 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Feb 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Carniolan Provincial Assembly corpus Kranjska 1.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1ca836f6-1157-5c30-b66e-68ae7b915119
    Explore at:
    Dataset updated
    Feb 14, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The corpus contains meeting proceedings of the Carniolan Provincial Assembly from 1861 to 1913 (Obravnave deĆŸelnega zbora kranjskega / Bericht ĂŒber die Verhandlungen des krainischen Landtages). The corpus comprises 694 sessions (15353 pages, approximately 10 million words). The source data (scanned and OCR processed pdf documents) originally come from The Digital Library of Slovenia dLib.si (http://www.dlib.si) and History of Slovenia - SIstory (https://www.sistory.si) portals. The documents are bilingual, in Slovenian and German, depending on the speaker. German was first typeset in the Gothic script and later on in Latin. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Language was detected on the sentence level, roughly 58% sentences are in Slovenian and 42% in German. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using Trankit (https://github.com/nlp-uoregon/trankit) for Slovenian and German, while Lingua (https://github.com/pemistahl/lingua-py) is used for language detection. The documents are in the Parla-CLARIN (https://github.com/clarin-eric/parla-clarin) compliant TEI XML format. Each session in one file.

  14. blip3-ocr-200m

    • huggingface.co
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salesforce (2024). blip3-ocr-200m [Dataset]. https://huggingface.co/datasets/Salesforce/blip3-ocr-200m
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 5, 2024
    Dataset provided by
    Salesforce Inchttp://salesforce.com/
    Authors
    Salesforce
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    BLIP3-OCR-200M Dataset

      Overview
    

    The BLIP3-OCR-200M dataset is designed to address the limitations of current Vision-Language Models (VLMs) in processing and interpreting text-rich images, such as documents and charts. Traditional image-text datasets often struggle to capture nuanced textual information, which is crucial for tasks requiring complex text comprehension and reasoning.

      Key Features
    

    OCR Integration: The dataset incorporates Optical Character
 See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/blip3-ocr-200m.

  15. Patrologia Graeca (OCRized and analyzed texts)

    • zenodo.org
    zip
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jean-Marie Auwers; Chahan Vidal-GorÚne; Chahan Vidal-GorÚne; Bastien Kindt; Véronique Somers; Jean-Marie Auwers; Bastien Kindt; Véronique Somers (2025). Patrologia Graeca (OCRized and analyzed texts) [Dataset]. http://doi.org/10.5281/zenodo.15780625
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 1, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jean-Marie Auwers; Chahan Vidal-GorÚne; Chahan Vidal-GorÚne; Bastien Kindt; Véronique Somers; Jean-Marie Auwers; Bastien Kindt; Véronique Somers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The CGPG project (Calfa GRE*g*ORI Patrologia Graeca), led by Jean-Marie Auwers (UCLouvain), aims to OCRize the remaining non-digital versions of the Patrologia Graeca volumes. The project relies on the expertise of GREgORI and Calfa.

    The project is sponsored by the ASBL *Byzantion*, the Fondation *Sedes Sapientiae*, the Institut *Religions, Spiritualités, Cultures, Sociétés* (RSCS, UCLouvain) and the Centre d'études orientales (CIOL, UCLouvain) and by a generous donor who wishes to remain anonymous. Other sponsors have recently expressed their willingness to support the project.

    Webpage of the project

    This repository contains the sketch engine XML files, with linguistic markups.

    Raw data are available on Github : https://github.com/calfa-co/Patrologia-Graeca

    For an optimal use in Sketch Engine, configure the corpus (Manage Corpus/Configure/Expert settings) as below

    DOCSTRUCTURE "doc"
    ENCODING "UTF-8"
    INFO ""
    LANGUAGE "Ancient Greek"
    NAME "CGPG_20250629"
    PATH "/corpora/ca/user_data/sso_1392/manatee/cgpg_20250629"
    VERTICAL "| ca_getvertical '/corpora/ca/user_data/sso_1392/registry/cgpg_20250629' 'docx'"
    ATTRIBUTE "word" {
    MAPTO "lemma"
    }
    ATTRIBUTE "intuitive_form" {
    }
    ATTRIBUTE "lemma" {
    }
    ATTRIBUTE "intuitive_lemma" {
    }
    ATTRIBUTE "pos" {
    }
    ATTRIBUTE "headword" {
    }
    STRUCTURE "w" {
    DEFAULTLOCALE "C"
    ENCODING "UTF-8"
    LANGUAGE ""
    NESTED ""
    ATTRIBUTE "id" {
    DYNLIB ""
    DYNTYPE "index"
    ENCODING "UTF-8"
    LOCALE "C"
    MULTISEP ","
    MULTIVALUE "n"
    TYPE "MD_MI"
    }
    }
    STRUCTURE "doc" {
    DEFAULTLOCALE "C"
    ENCODING "UTF-8"
    LANGUAGE ""
    NESTED ""
    ATTRIBUTE "id" {
    DYNLIB ""
    DYNTYPE "index"
    ENCODING "UTF-8"
    LOCALE "C"
    MULTISEP ","
    MULTIVALUE "n"
    TYPE "MD_MI"
    }
    }
    STRUCTURE "docx" {
    DEFAULTLOCALE "C"
    ENCODING "UTF-8"
    LANGUAGE ""
    NESTED ""
    ATTRIBUTE "id" {
    DYNLIB ""
    DYNTYPE "index"
    ENCODING "UTF-8"
    LABEL "File ID"
    LOCALE "C"
    MULTISEP ","
    MULTIVALUE "n"
    TYPE "MD_MI"
    UNIQUE "1"
    }
    ATTRIBUTE "filename" {
    DYNLIB ""
    DYNTYPE "index"
    ENCODING "UTF-8"
    LABEL "File name"
    LOCALE "C"
    MULTISEP ","
    MULTIVALUE "n"
    TYPE "MD_MI"
    }
    }

    Bibliography

    1. KINDT B., AUWERS J.-M., La Fondation Sedes Sapientiae soutient le projet de valorisation numérique de la Patrologie grecque, dans Bulletin de la Fondation Sedes Sapientiae, 45 (janvier 2024), p. 19-21 (https://cdn.uclouvain.be/groups/cms-editors-teco/angelique/fondation-sedes-sapientiae/UCL-TECO-Sedes Sapientiae-Bulletin 2024-WEB.pdf).
    2. KINDT B., VIDAL-GORÈNE C., DELLE DONNE S., Analyse automatique du grec ancien par rĂ©seau de neurones. Évaluation sur le corpus De Thessalonica Capta, dans BABELAO, 10-11 (2022), p. 525-550 (https://ojs.uclouvain.be/index.php/babelao/article/view/65073).
    3. KINDT B., VIDAL-GORÈNE C., From manuscript to tagged corpora. An automated process for Ancient Armenian or other under resourced languages of the Christian East, in Armeniaca. International Journal of Armenian Studies, 1 (2022), p. 73-96 (https://edizionicafoscari.unive.it/en/edizioni4/riviste/armeniaca/2022/1/from-manuscript-to-tagged-corpora/).
    4. VIDAL-GORÈNE C., CAFIERO F., KINDT B., Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac, 2025, published online on the HAL Science ouverte portal (https://hal.science/hal-05119485).
    5. VIDAL-GORÈNE C., La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées, Programming Historian en français, 5 (2023) (https://doi.org/10.46430/phfr0023).
    6. VIDAL-GORÈNE C., Reconhecimento automĂĄtico de manuscritos para o teste de idiomas nĂŁo latinos, O Programming Historian em portugĂȘs, 5 (2024) (https://doi.org/10.46430/phpt0046).
  16. A biodiversity dataset graph: Biodiversity Heritage Library (BHL)

    • zenodo.org
    application/gzip, bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorrit H. Poelen; Jorrit H. Poelen (2020). A biodiversity dataset graph: Biodiversity Heritage Library (BHL) [Dataset]. http://doi.org/10.5281/zenodo.3251134
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jorrit H. Poelen; Jorrit H. Poelen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A biodiversity dataset graph: Biodiversity Heritage Library

    Biodiversity datasets, or descriptions of biodiversity datasets, are increasingly available through open digital data infrastructures such as the Biodiversity Heritage Library (BHL, https://biodiversitylibrary.org). "The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community." - https://biodiversitylibrary.org , June 2019.

    However, little is known about how these networks, and the data accessed through them, change over time. This dataset provide snapshots of all OCR item texts (e.g., individual items) available through BHL as tracked by Preston (https://github.com/bio-guoda/preston , https://doi.org/10.5281/zenodo.1410543 ) over period May - June 2019.

    This snapshot contains about 120GB of uncompressed OCR texts across 227k OCR BHL items. Also, a snapshot of the BHL item catalog at https://www.biodiversitylibrary.org/data/item.txt is included.

    The archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to eestablish a versioning mechanism. Provenance files describe how, when and where the BHL OCR text items were retrieved. For more information, please visit https://preston.guoda.bio or https://doi.org/10.5281/zenodo.1410543).

    To retrieve and verify the downloaded BHL biodiversity dataset graph, first concatenate all the downloaded preston-*.tar.gz files (e.g., cat preston-*.tar.gz > preston.tar.gz). Then, extract the archives into a "data" folder. After that, verify the index of the archive by reproducing the following result:

    $ java -jar preston.jar history
    <0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion>

    To check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.

    $ java -jar preston.jar verify
    hash://sha256/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca file:/home/preston/preston-bhl/data/e0/c1/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca OK CONTENT_PRESENT_VALID_HASH 49458087
    hash://sha256/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99 file:/home/preston/preston-bhl/data/1a/57/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99 OK CONTENT_PRESENT_VALID_HASH 25745
    hash://sha256/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c file:/home/preston/preston-bhl/data/85/ef/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c OK CONTENT_PRESENT_VALID_HASH 519892

    Note that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston".

    Files in this data publication:

    README - this file

    preston-[00-ff].tar.gz - preston archives containing BHL OCR item texts, their provenance and a provenance index.

    9e8c86243df39dd4fe82a3f814710eccf73aa9291d050415408e346fa2b09e70 - preston index file
    2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a - preston index file

    89926f33157c0ef057b6de73f6c8be0060353887b47db251bfd28222f2fd801a - preston provenance file
    41b19aa9456fc709de1d09d7a59c87253bc1f86b68289024b7320cef78b3e3a4 - preston provenance file

    This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.

  17. BiblioPage Dataset

    • zenodo.org
    zip
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan KohĂșt; Jan KohĂșt; Michal HradiĆĄ; Michal HradiĆĄ (2025). BiblioPage Dataset [Dataset]. http://doi.org/10.5281/zenodo.15683417
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan KohĂșt; Jan KohĂșt; Michal HradiĆĄ; Michal HradiĆĄ
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BiblioPage Dataset

    BiblioPage is a dataset of scanned title pages annotated with structured bibliographic metadata and bounding boxes. It supports research in document understanding, bibliographic metadata extraction, and OCR alignment.

    📄 Reference: BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction

    Structure

    The ZIP archive contains:

    images/
    ├── train/     # Development set images (.jpg)
    └── test/     # Test set images (.jpg)
    
    labels/
    ├── train/     # Metadata only (.json)
    └── test/
    
    labels.with_geometry/
    ├── train/     # Metadata + bounding boxes (.json)
    └── test/
    

    Files are named as:
    library_id.document_uuid.page_uuid.extension
    Example:
    mzk.e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f.59e59f06-c2ce-4c10-aa9d-33de3b8b41be.json

    Metadata Format

    Each label contains up to 16 bibliographic attributes. The following attributes may contain multiple values: author, illustrator, translator, editor, publisher. All others are single-value only.

    labels/ example:

    {
     "task_id": "238776",
     "library_id": "mzk.e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f.59e59f06-c2ce-4c10-aa9d-33de3b8b41be",
     "title": "TĚLOCVIK pro ĆĄkoly obecnĂ© a měƥƄanskĂ©.",
     "placeTerm": "PRAZE.",
     "dateIssued": "1895.",
     "publisher": ["„Nov. kalendáƙe učitelskĂ©ho.“"],
     "author": ["V. BEƠƀÁK."],
     "illustrator": ["K. SUCHÝ."],
     "editor": ["FR. PITRÁK", "A. HOLUB."]
    }
    

    labels.with_geometry/ example:

    {
     "task_id": "238776",
     "library_id": "mzk.e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f.59e59f06-c2ce-4c10-aa9d-33de3b8b41be",
     "title": ["TĚLOCVIK pro ĆĄkoly obecnĂ© a měƥƄanskĂ©.", [74, 447, 1111, 322]],
     "placeTerm": ["PRAZE.", [550, 1982, 227, 50]],
     "dateIssued": ["1895.", [580, 2111, 89, 40]],
     "publisher": [["„Nov. kalendáƙe učitelskĂ©ho.“", [560, 2051, 491, 46]]],
     "author": [["V. BEƠƀÁK.", [445, 970, 375, 61]]],
     "illustrator": [["K. SUCHÝ.", [461, 1314, 331, 57]]],
     "editor": [
      ["FR. PITRÁK", [242, 1140, 371, 59]],
      ["A. HOLUB.", [689, 1149, 324, 49]]
     ]
    }
    

    Bounding boxes use pixel coordinates: [x_left, y_top, width, height].

    Dataset Summary

    • 2,118 scanned title pages from 14 Czech libraries

    • Time span: 1485–21st century

    • Development and test split, test set fully manually verified

    License

    Released for research and non-commercial use only.

    Citation

    @article{kohut2024bibliopage,
     title={BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction},
     author={KohĂșt, Jan and Dočekal, Martin and HradiĆĄ, Michal and VaĆĄko, Marek},
     journal={arXiv preprint arXiv:2503.19658},
     year={2024}
    }
    

    Contact

    📧 ikohut@fit.vutbr.cz
    🔗 https://github.com/DCGM/biblio-dataset

    Note on Source Access

    Title pages can also be accessed via the original digital library using:

    https://www.digitalniknihovna.cz/mzk/view/uuid:{doc_id}?page=uuid:{page_id}
    

    For example:
    https://www.digitalniknihovna.cz/mzk/view/uuid:e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f?page=uuid:59e59f06-c2ce-4c10-aa9d-33de3b8b41be

    ⚠ Note: Resolution may differ from dataset images. Always use the provided files for analysis. Use source links only for additional context or browsing.

  18. g

    Deliberations of the bodies of the city of Nantes and Nantes Métropole |...

    • gimi9.com
    Updated Jan 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Deliberations of the bodies of the city of Nantes and Nantes Métropole | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_https-data-nantesmetropole-fr-explore-dataset-244400404_deliberations-instances-metropole-nantes-/
    Explore at:
    Dataset updated
    Jan 11, 2024
    Area covered
    Nantes Métropole, Nantes
    Description

    Deliberations of the Municipal Council of the City of Nantes, the Metropolitan Council, the Metropolitan Bureau of Nantes MĂ©tropole and the Communal Centre for Social Action of the City of Nantes. * * * * This dataset aggregates the information obtained from the deliberations of the various bodies of the CollectivitĂ© Nantes MĂ©tropole and the City. A description of each instance, as well as all the agendas and reports are available on the Community’s institutional website on the dedicated pages: * **to City Council ** * to the Metropolitan Council * at the Metropolitan Office * **at CCAS ** The data of the open deliberations in this game are extracted from the files transmitted by the community to the Prefecture for the control of legality through the FAST – Acts service. Deliberations are part of the common core of local data, i.e. a set of data that communities agree to publish as a matter of priority, following a way of organising information. As a result, the file is modeled to correspond to the standard schema defined under the umbrella of the Open Data France association. Specification of the textual content of the deliberations included to facilitate the search: Currently, the deliberations of the community bodies are validated on paper and signed in handwritten form. The final versions published on the community’s website are scans of these documents. In the case of scanned images, their content is only visually accessible and their content is not indexed by search engines. To facilitate the search in this database, a free optical character recognition engine (Tesseract 4) is used, which is based on artificial intelligence (LSTM-type neural network, see Tesseract documentation). The content has a very high level of reliability, but occasional errors may remain. For functions other than search, it is always necessary to refer to the pdf documents which alone are authentic.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Thomas Hegghammer; Thomas Hegghammer (2021). Noisy OCR Dataset (NOD) [Dataset]. http://doi.org/10.5281/zenodo.5068735
Organization logo

Noisy OCR Dataset (NOD)

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
binAvailable download formats
Dataset updated
Jul 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas Hegghammer; Thomas Hegghammer
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).

Source images

The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.

Artificial noise application

The dataset was created as follows:
- First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
- Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
- Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.

This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.

The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.

References:

Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https:
//github.com/PedroBarcha/old-books-dataset.

Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk
Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science
and Information Technology (CSIT)
, 150–54. IEEE.

Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs

Search
Clear search
Close search
Google apps
Main menu