18 datasets found

Noisy OCR Dataset (NOD)
zenodo.org
bin
Updated Jul 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Hegghammer; Thomas Hegghammer (2021). Noisy OCR Dataset (NOD) [Dataset]. http://doi.org/10.5281/zenodo.5068735
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5068735
Dataset updated
Jul 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas Hegghammer; Thomas Hegghammer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).

Source images

The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.

Artificial noise application

The dataset was created as follows:
- First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
- Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
- Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.

This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.

The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.

References:

Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https:
//github.com/PedroBarcha/old-books-dataset.

Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk
Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science
and Information Technology (CSIT), 150–54. IEEE.

Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs
Tesseract OCR of IIT-CDIP Dataset
zenodo.org
application/gzip
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Davis; Brian Davis (2022). Tesseract OCR of IIT-CDIP Dataset [Dataset]. http://doi.org/10.5281/zenodo.6540454
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6540454
Dataset updated
May 13, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Brian Davis; Brian Davis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is Tesseract generated transcriptions (no images) of (most of) the IIT-CDIP dataset. To download the images of the IIT-CDIP dataset go to https://data.nist.gov/od/id/mds2-2531

The directory struture of this dataset is the same as the IIT-CDIP dataset (although has everything in one tar, with "a.a", "a.b", ... directories) and can thus be combine with the image IIT-CDIP dataset using rsync or similar tool. This dataset contains a "X.layout.json" for each "X.png" in the IIT-CDIP dataset (doesn't have sections 'a', 'w', 'x', 'y', and 'z').

The jsons contain block/paragraph, line and word bounding boxes, with transcriptions for the words following the Tesseract format. The line and word annotations are directly taken from Tesseract. The block and paragraph output of Tesseract was discarded. The images were then run through both the Publaynet and PrimaNet models available on LayoutParser (https://layout-parser.github.io/). The combine output of these models became the block/paragraph annotations (we kept the Tesseract output format, but each block has 1 paragraph of exactly the same shape).

Important: There is also a "rotation" value in the json (0, 90, 180, or 270) indicating the json may be for a rotated version of the IIT-CDIP image by the given amount (attempted to rotated documents to upright position to get better OCR results).

These are the annotations used to pre-train Dessurt (https://arxiv.org/abs/2203.16618).

These annotations will be worse than those that would be obtained using a commercial OCR system (like those used to pre-train LayoutLMv2/v3).

The code used to produce these annotations is available here: https://github.com/herobd/ocr
e
Scrambled text: training Language Models to correct OCR errors using...
b2find.eudat.eu
Updated Oct 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Scrambled text: training Language Models to correct OCR errors using synthetic data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1ea0205e-de3a-54e7-a918-fde36ad3156f
Explore at:
Dataset updated
Oct 27, 2024
Description
This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as followsncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the papersynth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data
h
chinese_text_recognition
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
priyank, chinese_text_recognition [Dataset]. https://huggingface.co/datasets/priyank-m/chinese_text_recognition
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
priyank
License
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Description
Source of data: https://github.com/FudanVI/benchmarking-chinese-text-recognition
NomNaOCR
kaggle.com
Updated Oct 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quan Dang (2022). NomNaOCR [Dataset]. https://www.kaggle.com/quandang/nomnaocr/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Quan Dang
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
A Dataset for Optical Character Recognition on old Vietnamese Handwritten

GitHub: https://github.com/ds4v/NomNaOCR

Paper: https://ieeexplore.ieee.org/document/10013842

Demo: https://youtu.be/o5xpfwalEWw

Here, we introduce the NomNaOCR dataset for the old Vietnamese Hán-**Nôm** script based on 3 tremendous and valuable historical works of Vietnam, including: - Lục Vân Tiên by Nguyễn Đình Chiểu. - Tale of Kiều or Truyện Kiều (version 1866, 1871, and 1872) by Nguyễn Du. - A full set of 5 parts in History of Greater Vietnam or Đại Việt Sử Ký Toàn Thư (ĐVSKTT) composed by many historians from the Trần to the Hậu Lê dynasty of Vietnam.

With 2953 handwritten Pages (2956 - 3 Ignored Pages) collected from the Vietnamese Nôm Preservation Foundation for analyzing and semi-annotating the bounding boxes to generate additional 38,318 Patches (38,319 - 1 Ignored Patch) containing text along with Hán-**Nôm** strings in digital form. This makes NomNaOCR currently become the biggest dataset for Hán-Nôm script in Vietnam serving 2 main problems in Optical Character Recognition on Hán-**Nôm** script: - Text Detection: Detect the image regions that contain text. The input is an image (or a Page), and the output is a bounding box surrounding the text area found. - Text Recognition: After detecting boxes or image regions containing text, each of these regions will be cropped from the original image, forming small parts called Patch. The input will now be a Patch, and the output will be the text in that Patch.

A difference here is that our implementations were all done at the sequence level, which not only saves the cost for annotation but also helps us retain the context in the sequence instead of just performing on each individual character as in most previous works.

https://github.com/ds4v/NomNaOCR/raw/main/Assets/ocr_pipeline1.jpg" alt="">

***Note**: There are characters that Kaggle cannot display => Use the NomNaTong font to be able to read the Hán-**Nôm** content in the best way.
e
NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers -...
b2find.eudat.eu
Updated Jul 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/bd615bf4-a43a-5ed0-bb26-59e8977c3ff8
Explore at:
Dataset updated
Jul 24, 2025
Description
NCSE v2.0 Dataset RepositoryThis repository contains the NCSE v2.0 dataset and associated supporting data used in the paper "Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models".Dataset OverviewThe NCSE v2.0 is a digitized collection of six 19th-century English periodicals containing:82,690 pages1.4 million entries321 million words1.9 billion charactersThe dataset includes:1.1 million text entries198,000 titles17,000 figure descriptions16,000 tablesRepository ContentsNCSE v2.0 DatasetNCSE_v2.zip: a folder containing a parquet file for each of the periodicals as well as a readme file.Bounding Box DatasetA zip file called bounding_box.zip. Containspost_process: A folder of the processed periodical bounding box datapost_process_fill: A folder of the processed periodical bounding box data WITH column filling.bbox_readme.txt: a readme file and data description for the bounding boxesTest Setscropped_images.zip: 378 images cropped from the NCSE test set pages, all 2-bit png filesground_truth: 358 text files corresponding to the text from the cropped_images folderClassification Training DataThe below files are used for training the classification models. They contain 12000 observations 2000 from each periodical. The labels were classified using mistral-large-2411. This data is used to train the ModernBERT classifier described in the paper. The topics are taken from the International Press Telecommunications Council (IPTC) subject codes.silver_IPTC_class.parquet: IPTC topic classification silver setsilver_text_type.parquet: Text-type classification silver setClassified DataThe zip file "classification_data.zip" with all rows classified using the ModernBERT classifer described in the paper.IPTC_type_classified.zip: contains one parquet file per periodicaltext_type_classified.zip: contains one parquet file per periodicalclassification_readme.md: Description of the dataClassification MappingsData for mapping the classification codes to human readable names.class_mappings.zip: contains a json for each classification typeIPTC_class_mapping.jsontext_type_class_mapping.jsonOriginal ImagesThe original page images can be found at the King's College London Repositories:Monthly RepositoryNorthern StarLeaderEnglish Woman's JournalTomahawkPublishers' CircularOr via the project central archiveCitationIf you use this dataset, please cite:No citation data currently availableRelated CodeAll original code related to this project including the creation of the datasets and thier analysis can be found at:https://github.com/JonnoB/ereading_the_unreadableContactFor questions about the dataset, please create an issue in this repository.Usage RightsIn keeping with the original NCSE dataset, all data is made available under a Creative Commons Attribution 4.0 International License (CC BY).
Labelled data for fine tuning a geological Named Entity Recognition and...
ckan.publishing.service.gov.uk
metadata.bgs.ac.uk
+1more
Updated Aug 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2025). Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model [Dataset]. https://ckan.publishing.service.gov.uk/dataset/labelled-data-for-fine-tuning-a-geological-named-entity-recognition-and-entity-relation-extract
Explore at:
Dataset updated
Aug 19, 2025
Dataset provided by
CKANhttps://ckan.org/
Description
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604
h
MMDocBench
huggingface.co
Updated Sep 15, 2003
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TAT@NExT (2003). MMDocBench [Dataset]. https://huggingface.co/datasets/next-tat/MMDocBench
Explore at:
Dataset updated
Sep 15, 2003
Dataset authored and provided by
TAT@NExT
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

MMDocBench is an open-sourced benchmark with various OCR-free document understanding tasks for evaluating fine-grained visual perception and reasoning abilities. For more details, please refer to the project page: https://MMDocBench.github.io/.

Dataset Structure

MMDocBench consists of 15 main tasks and 48 sub-tasks, involving 2,400 document images, 4,338 QA pairs… See the full description on the dataset page: https://huggingface.co/datasets/next-tat/MMDocBench.

POPP Datasets : Datasets for handwriting recognition from French population...

zenodo.org

zip

Updated Jul 17, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Thomas CONSTUM; Nicolas KEMPF; Thierry PAQUET; Pierrick TRANOUEZ; Clément CHATELAIN; Sandra BREE; François MERVEILLE; Thomas CONSTUM; Nicolas KEMPF; Thierry PAQUET; Pierrick TRANOUEZ; Clément CHATELAIN; Sandra BREE; François MERVEILLE (2025). POPP Datasets : Datasets for handwriting recognition from French population census [Dataset]. http://doi.org/10.5281/zenodo.6581158

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6581158

Dataset updated

Jul 17, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

French

Description

POPP datasets

This repository contains 3 datasets created within the POPP project (Project for the Oceration of the Paris Population Census) for the task of handwriting text recognition. These datasets have been published in Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census at DAS 2022.

The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines.

The structure of each dataset is the following:

double-pages : images of the double pages
pages:
- images: images of the pages
- xml: METS and ALTO files of each page containing the coordinates of the bounding boxes of each line
lines: contains the labels in the file labels.json and the line images splitted into the folders train, valid and test. The double pages were scanned at a resolution of 200dpi and saved as PNG images with 256 gray levels. The line and page images are shared in the TIFF format, also with 256 gray levels.

Since the lines are extracted from table rows, we defined 4 special characters to describe the structure of the text:

¤ : indicates an empty cell
/ : indicates the separation into columns
? : indicates that the content of the cell following this symbol is written above the regular baseline
! : indicates that the content of the cell following this symbol is written below the regular baseline

We provide a script format_dataset.py to define which special character you want to use in the ground-truth.

The split for the Generic Dataset and Belleville have been made at the double-page level so that each writer only appears in one subset among train, evaluation and test. The following table summarizes the splits and the number of writers for each dataset:

Dataset	train - # of lines	validation - # of lines	test - # of lines	# of writers
Generic	3840 (128 pages)	480 (16 pages)	480 (16 pages)	80
Belleville	1140 (38 pages)	150 (5 pages)	180 (6 pages)	1
Chaussée d’Antin	625	78	77	10

Generic dataset (or POPP dataset)

This dataset is made 4800 annotated lines extracted from 80 double pages of the 1926 Paris census.
There is one double page for each of the 80 districts of Paris
There is one writer per double page so the dataset contains 80 different writers.

Belleville dataset

This dataset is a mono-writer dataset made of 1470 lines (49 pages) from the Belleville district census of 1926.

Chaussée d’Antin dataset

This dataset is a multi-writer dataset made of 780 lines (26 pages) from the Chaussée d’Antin district census of 1926 and written by 10 different writers.

Error reporting

It is possible that errors persist in the ground truth, so any suggestions for correction are welcome. To do so, please make a merge request on the Github repository and include the correction in both the labels.json file and in the XML file concerned.

Citation Request

If you publish material based on this database, we request you to include a reference to paper T. Constum, N. Kempf, T. Paquet, P. Tranouez, C. Chatelain, S. Brée, and F. Merveille,Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census ,Document Analysis Systems (DAS), pp. 143- 157, La Rochelle, 2022.

synthdog-ko
huggingface.co
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NAVER CLOVA INFORMATION EXTRACTION (2024). synthdog-ko [Dataset]. https://huggingface.co/datasets/naver-clova-ix/synthdog-ko
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2024
Dataset provided by
Naver Corporationhttp://www.navercorp.com/
Authors
NAVER CLOVA INFORMATION EXTRACTION
Description
Donut 🍩 : OCR-Free Document Understanding Transformer (ECCV 2022) -- SynthDoG datasets

For more information, please visit https://github.com/clovaai/donut

The links to the SynthDoG-generated datasets are here:

synthdog-en: English, 0.5M. synthdog-zh: Chinese, 0.5M. synthdog-ja: Japanese, 0.5M. synthdog-ko: Korean, 0.5M.

To generate synthetic datasets with our SynthDoG, please see ./synthdog/README.md and our paper for details.

How to Cite

If you find this work useful… See the full description on the dataset page: https://huggingface.co/datasets/naver-clova-ix/synthdog-ko.
g
Text from pdfs found on data.gouv.fr
gimi9.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Text from pdfs found on data.gouv.fr [Dataset]. https://gimi9.com/dataset/eu_5ec45f516a58eec727e79af7/
Explore at:
Area covered
France
Description
Text extracted from pdfs found on data.gouv.fr ## Description This dataset contains text extracted from 6602 files that have the ‘pdf’ extension in the resource catalog of data.gouv.fr. The dataset contains only the pdfs of 20 Mb or less and which are always available on the URL indicated. The extraction was done with PDFBox via its Python wrapper python-PDFBox. PDFs that are images (scans, maps, etc.) are detected with a simple heuristic: if after converting to text with ‘PDFBox’, the file size is less than 20 bytes, it is considered to be an image. In this case, OCRisation is carried out. This one is made with Tesseract via its Python wrapper pyocr. The result is ‘txt’ files from ‘pdfs’ sorted by organisation (the organisation that published the resource). There are 175 organisations in this dataset, so 175 files. The name of each file corresponds to the string ‘{id-du-dataset}--{id-de-la-resource}.txt’. #### Input Catalogue of data.gouv.fr resources. #### Output Text files of each ‘pdf’ resource found in the catalogue that was successfully converted and satisfied the above constraints. The tree is as follows: Bash . ACTION_Nogent-sur-Marne 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt |... Aeroport_La_Rochelle-Ile_de_Re Agency_de_services_and_payment_ASP Agency_du_Numerique ... “'” ## Distribution of texts [as of 20 May 2020] The top 10 organisations with the largest number of documents is: Python [(‘Les_Lilas’, 1294), (‘Ville_de_Pirae’, 1099), (‘Region_Hauts-de-France’, 592), (‘Ressourcerie_datalocale’, 297), (‘NA’, 268), (‘CORBION’, 244), (‘Education_Nationale’, 189), (‘Incubator_of_Services_Numeriques’, 157), (‘Ministere_des_Solidarites_and_de_la_Sante’, 148), (‘Communaute_dAgglomeration_Plaine_Vallee’, 142)] “'” And their preview in 2D is (HashFeatures+TruncatedSVD+[t-SNE]): Plot t-SNE of DGF texts ## Code The Python scripts used to do this extraction are here. ## Remarks Due to the quality of the original pdfs (low resolution scans, non-aligned pdfs,...) and the performance of the pdf->txt transformation methods, the results can be very loud.
e
Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset -...
b2find.eudat.eu
Updated Aug 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/0406e495-3d78-5cb2-9e5b-32b9dbba1e82
Explore at:
Dataset updated
Aug 3, 2025
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The corpus of Slovenian periodicals sPeriodika contains linguistically annotated periodicals published during the 18th, 19th, and beginning of 20th century (1771-1914). The periodical issues were retreived from Slovenia's national library's digital library service (https://dlib.si) in the form of OCR-ed PDF and TXT files. Before linguistically annotating the documents (lemmatisation, part-of-speech tagging, and named entity recognition) with CLASSLA-Stanza (https://github.com/clarinsi/classla), the OCR-ed texts were corrected with a lightweight and robust approach using cSMTiser (https://github.com/clarinsi/csmtiser), a text normalisation tool based on character-level machine translation. This OCR post-correction model was trained on a set of manually corrected samples (300 random paragraphs at least 100 characters in length) from the original texts, cf. http://hdl.handle.net/11356/1907. The documents in the collection are enriched with the following metadata obtained from dLib: - Document ID (URN) - Periodical name - Document (periodical issue) title - Volume number (if available) - Issue number (if available) - Year of publication - Date of publication (of varying granularity, based on original metadata available) - Source (URL of the original digitised document available at dlib.si) - Image (see below) - Quality (see below)
e
Carniolan Provincial Assembly corpus Kranjska 1.0 - Dataset - B2FIND
b2find.eudat.eu
Updated Feb 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Carniolan Provincial Assembly corpus Kranjska 1.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1ca836f6-1157-5c30-b66e-68ae7b915119
Explore at:
Dataset updated
Feb 14, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The corpus contains meeting proceedings of the Carniolan Provincial Assembly from 1861 to 1913 (Obravnave deželnega zbora kranjskega / Bericht über die Verhandlungen des krainischen Landtages). The corpus comprises 694 sessions (15353 pages, approximately 10 million words). The source data (scanned and OCR processed pdf documents) originally come from The Digital Library of Slovenia dLib.si (http://www.dlib.si) and History of Slovenia - SIstory (https://www.sistory.si) portals. The documents are bilingual, in Slovenian and German, depending on the speaker. German was first typeset in the Gothic script and later on in Latin. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Language was detected on the sentence level, roughly 58% sentences are in Slovenian and 42% in German. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using Trankit (https://github.com/nlp-uoregon/trankit) for Slovenian and German, while Lingua (https://github.com/pemistahl/lingua-py) is used for language detection. The documents are in the Parla-CLARIN (https://github.com/clarin-eric/parla-clarin) compliant TEI XML format. Each session in one file.
blip3-ocr-200m
huggingface.co
Updated Sep 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salesforce (2024). blip3-ocr-200m [Dataset]. https://huggingface.co/datasets/Salesforce/blip3-ocr-200m
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 5, 2024
Dataset provided by
Salesforce Inchttp://salesforce.com/
Authors
Salesforce
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
BLIP3-OCR-200M Dataset

Overview

The BLIP3-OCR-200M dataset is designed to address the limitations of current Vision-Language Models (VLMs) in processing and interpreting text-rich images, such as documents and charts. Traditional image-text datasets often struggle to capture nuanced textual information, which is crucial for tasks requiring complex text comprehension and reasoning.

Key Features

OCR Integration: The dataset incorporates Optical Character… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/blip3-ocr-200m.
Patrologia Graeca (OCRized and analyzed texts)
zenodo.org
zip
Updated Jul 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jean-Marie Auwers; Chahan Vidal-Gorène; Chahan Vidal-Gorène; Bastien Kindt; Véronique Somers; Jean-Marie Auwers; Bastien Kindt; Véronique Somers (2025). Patrologia Graeca (OCRized and analyzed texts) [Dataset]. http://doi.org/10.5281/zenodo.15780625
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15780625
Dataset updated
Jul 1, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jean-Marie Auwers; Chahan Vidal-Gorène; Chahan Vidal-Gorène; Bastien Kindt; Véronique Somers; Jean-Marie Auwers; Bastien Kindt; Véronique Somers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CGPG project (Calfa GRE*g*ORI Patrologia Graeca), led by Jean-Marie Auwers (UCLouvain), aims to OCRize the remaining non-digital versions of the Patrologia Graeca volumes. The project relies on the expertise of GREgORI and Calfa.

The project is sponsored by the ASBL *Byzantion*, the Fondation *Sedes Sapientiae*, the Institut *Religions, Spiritualités, Cultures, Sociétés* (RSCS, UCLouvain) and the Centre d'études orientales (CIOL, UCLouvain) and by a generous donor who wishes to remain anonymous. Other sponsors have recently expressed their willingness to support the project.

Webpage of the project

This repository contains the sketch engine XML files, with linguistic markups.

Raw data are available on Github : https://github.com/calfa-co/Patrologia-Graeca

For an optimal use in Sketch Engine, configure the corpus (Manage Corpus/Configure/Expert settings) as below

DOCSTRUCTURE "doc"
ENCODING "UTF-8"
INFO ""
LANGUAGE "Ancient Greek"
NAME "CGPG_20250629"
PATH "/corpora/ca/user_data/sso_1392/manatee/cgpg_20250629"
VERTICAL "| ca_getvertical '/corpora/ca/user_data/sso_1392/registry/cgpg_20250629' 'docx'"
ATTRIBUTE "word" {
MAPTO "lemma"
}
ATTRIBUTE "intuitive_form" {
}
ATTRIBUTE "lemma" {
}
ATTRIBUTE "intuitive_lemma" {
}
ATTRIBUTE "pos" {
}
ATTRIBUTE "headword" {
}
STRUCTURE "w" {
DEFAULTLOCALE "C"
ENCODING "UTF-8"
LANGUAGE ""
NESTED ""
ATTRIBUTE "id" {
DYNLIB ""
DYNTYPE "index"
ENCODING "UTF-8"
LOCALE "C"
MULTISEP ","
MULTIVALUE "n"
TYPE "MD_MI"
}
}
STRUCTURE "doc" {
DEFAULTLOCALE "C"
ENCODING "UTF-8"
LANGUAGE ""
NESTED ""
ATTRIBUTE "id" {
DYNLIB ""
DYNTYPE "index"
ENCODING "UTF-8"
LOCALE "C"
MULTISEP ","
MULTIVALUE "n"
TYPE "MD_MI"
}
}
STRUCTURE "docx" {
DEFAULTLOCALE "C"
ENCODING "UTF-8"
LANGUAGE ""
NESTED ""
ATTRIBUTE "id" {
DYNLIB ""
DYNTYPE "index"
ENCODING "UTF-8"
LABEL "File ID"
LOCALE "C"
MULTISEP ","
MULTIVALUE "n"
TYPE "MD_MI"
UNIQUE "1"
}
ATTRIBUTE "filename" {
DYNLIB ""
DYNTYPE "index"
ENCODING "UTF-8"
LABEL "File name"
LOCALE "C"
MULTISEP ","
MULTIVALUE "n"
TYPE "MD_MI"
}
}

Bibliography

KINDT B., AUWERS J.-M., La Fondation Sedes Sapientiae soutient le projet de valorisation numérique de la Patrologie grecque, dans Bulletin de la Fondation Sedes Sapientiae, 45 (janvier 2024), p. 19-21 (https://cdn.uclouvain.be/groups/cms-editors-teco/angelique/fondation-sedes-sapientiae/UCL-TECO-Sedes Sapientiae-Bulletin 2024-WEB.pdf).

KINDT B., VIDAL-GORÈNE C., DELLE DONNE S., Analyse automatique du grec ancien par réseau de neurones. Évaluation sur le corpus De Thessalonica Capta, dans BABELAO, 10-11 (2022), p. 525-550 (https://ojs.uclouvain.be/index.php/babelao/article/view/65073).

KINDT B., VIDAL-GORÈNE C., From manuscript to tagged corpora. An automated process for Ancient Armenian or other under resourced languages of the Christian East, in Armeniaca. International Journal of Armenian Studies, 1 (2022), p. 73-96 (https://edizionicafoscari.unive.it/en/edizioni4/riviste/armeniaca/2022/1/from-manuscript-to-tagged-corpora/).

VIDAL-GORÈNE C., CAFIERO F., KINDT B., Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac, 2025, published online on the HAL Science ouverte portal (https://hal.science/hal-05119485).

VIDAL-GORÈNE C., La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées, Programming Historian en français, 5 (2023) (https://doi.org/10.46430/phfr0023).

VIDAL-GORÈNE C., Reconhecimento automático de manuscritos para o teste de idiomas não latinos, O Programming Historian em portugês, 5 (2024) (https://doi.org/10.46430/phpt0046).
A biodiversity dataset graph: Biodiversity Heritage Library (BHL)
zenodo.org
application/gzip, bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorrit H. Poelen; Jorrit H. Poelen (2020). A biodiversity dataset graph: Biodiversity Heritage Library (BHL) [Dataset]. http://doi.org/10.5281/zenodo.3251134
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3251134
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jorrit H. Poelen; Jorrit H. Poelen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A biodiversity dataset graph: Biodiversity Heritage Library

Biodiversity datasets, or descriptions of biodiversity datasets, are increasingly available through open digital data infrastructures such as the Biodiversity Heritage Library (BHL, https://biodiversitylibrary.org). "The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community." - https://biodiversitylibrary.org , June 2019.

However, little is known about how these networks, and the data accessed through them, change over time. This dataset provide snapshots of all OCR item texts (e.g., individual items) available through BHL as tracked by Preston (https://github.com/bio-guoda/preston , https://doi.org/10.5281/zenodo.1410543 ) over period May - June 2019.

This snapshot contains about 120GB of uncompressed OCR texts across 227k OCR BHL items. Also, a snapshot of the BHL item catalog at https://www.biodiversitylibrary.org/data/item.txt is included.

The archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to eestablish a versioning mechanism. Provenance files describe how, when and where the BHL OCR text items were retrieved. For more information, please visit https://preston.guoda.bio or https://doi.org/10.5281/zenodo.1410543).

To retrieve and verify the downloaded BHL biodiversity dataset graph, first concatenate all the downloaded preston-*.tar.gz files (e.g., cat preston-*.tar.gz > preston.tar.gz). Then, extract the archives into a "data" folder. After that, verify the index of the archive by reproducing the following result:

$ java -jar preston.jar history
<0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion>

To check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.

$ java -jar preston.jar verify
hash://sha256/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca file:/home/preston/preston-bhl/data/e0/c1/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca OK CONTENT_PRESENT_VALID_HASH 49458087
hash://sha256/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99 file:/home/preston/preston-bhl/data/1a/57/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99 OK CONTENT_PRESENT_VALID_HASH 25745
hash://sha256/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c file:/home/preston/preston-bhl/data/85/ef/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c OK CONTENT_PRESENT_VALID_HASH 519892

Note that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston".

Files in this data publication:

README - this file

preston-[00-ff].tar.gz - preston archives containing BHL OCR item texts, their provenance and a provenance index.

9e8c86243df39dd4fe82a3f814710eccf73aa9291d050415408e346fa2b09e70 - preston index file
2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a - preston index file

89926f33157c0ef057b6de73f6c8be0060353887b47db251bfd28222f2fd801a - preston provenance file
41b19aa9456fc709de1d09d7a59c87253bc1f86b68289024b7320cef78b3e3a4 - preston provenance file

This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.
BiblioPage Dataset
zenodo.org
zip
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Kohút; Jan Kohút; Michal Hradiš; Michal Hradiš (2025). BiblioPage Dataset [Dataset]. http://doi.org/10.5281/zenodo.15683417
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15683417
Dataset updated
Jun 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Kohút; Jan Kohút; Michal Hradiš; Michal Hradiš
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BiblioPage Dataset

BiblioPage is a dataset of scanned title pages annotated with structured bibliographic metadata and bounding boxes. It supports research in document understanding, bibliographic metadata extraction, and OCR alignment.

📄 Reference: BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction

Structure

The ZIP archive contains:

images/ ├── train/ # Development set images (.jpg) └── test/ # Test set images (.jpg) labels/ ├── train/ # Metadata only (.json) └── test/ labels.with_geometry/ ├── train/ # Metadata + bounding boxes (.json) └── test/

Files are named as:
library_id.document_uuid.page_uuid.extension
Example:
mzk.e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f.59e59f06-c2ce-4c10-aa9d-33de3b8b41be.json

Metadata Format

Each label contains up to 16 bibliographic attributes. The following attributes may contain multiple values: author, illustrator, translator, editor, publisher. All others are single-value only.

labels/ example:

{ "task_id": "238776", "library_id": "mzk.e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f.59e59f06-c2ce-4c10-aa9d-33de3b8b41be", "title": "TĚLOCVIK pro školy obecné a měšťanské.", "placeTerm": "PRAZE.", "dateIssued": "1895.", "publisher": ["„Nov. kalendáře učitelského.“"], "author": ["V. BEŠŤÁK."], "illustrator": ["K. SUCHÝ."], "editor": ["FR. PITRÁK", "A. HOLUB."] }

labels.with_geometry/ example:

{ "task_id": "238776", "library_id": "mzk.e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f.59e59f06-c2ce-4c10-aa9d-33de3b8b41be", "title": ["TĚLOCVIK pro školy obecné a měšťanské.", [74, 447, 1111, 322]], "placeTerm": ["PRAZE.", [550, 1982, 227, 50]], "dateIssued": ["1895.", [580, 2111, 89, 40]], "publisher": [["„Nov. kalendáře učitelského.“", [560, 2051, 491, 46]]], "author": [["V. BEŠŤÁK.", [445, 970, 375, 61]]], "illustrator": [["K. SUCHÝ.", [461, 1314, 331, 57]]], "editor": [ ["FR. PITRÁK", [242, 1140, 371, 59]], ["A. HOLUB.", [689, 1149, 324, 49]] ] }

Bounding boxes use pixel coordinates: [x_left, y_top, width, height].

Dataset Summary

2,118 scanned title pages from 14 Czech libraries

Time span: 1485–21st century

Development and test split, test set fully manually verified

License

Released for research and non-commercial use only.

Citation

@article{kohut2024bibliopage, title={BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction}, author={Kohút, Jan and Dočekal, Martin and Hradiš, Michal and Vaško, Marek}, journal={arXiv preprint arXiv:2503.19658}, year={2024} }

Contact

📧 ikohut@fit.vutbr.cz
🔗 https://github.com/DCGM/biblio-dataset

Note on Source Access

Title pages can also be accessed via the original digital library using:

https://www.digitalniknihovna.cz/mzk/view/uuid:{doc_id}?page=uuid:{page_id}

For example:
https://www.digitalniknihovna.cz/mzk/view/uuid:e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f?page=uuid:59e59f06-c2ce-4c10-aa9d-33de3b8b41be

⚠️ Note: Resolution may differ from dataset images. Always use the provided files for analysis. Use source links only for additional context or browsing.
g
Deliberations of the bodies of the city of Nantes and Nantes Métropole |...
gimi9.com
Updated Jan 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Deliberations of the bodies of the city of Nantes and Nantes Métropole | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_https-data-nantesmetropole-fr-explore-dataset-244400404_deliberations-instances-metropole-nantes-/
Explore at:
Dataset updated
Jan 11, 2024
Area covered
Nantes Métropole, Nantes
Description
Deliberations of the Municipal Council of the City of Nantes, the Metropolitan Council, the Metropolitan Bureau of Nantes Métropole and the Communal Centre for Social Action of the City of Nantes. * * * * This dataset aggregates the information obtained from the deliberations of the various bodies of the Collectivité Nantes Métropole and the City. A description of each instance, as well as all the agendas and reports are available on the Community’s institutional website on the dedicated pages: * **to City Council ** * to the Metropolitan Council * at the Metropolitan Office * **at CCAS ** The data of the open deliberations in this game are extracted from the files transmitted by the community to the Prefecture for the control of legality through the FAST – Acts service. Deliberations are part of the common core of local data, i.e. a set of data that communities agree to publish as a matter of priority, following a way of organising information. As a result, the file is modeled to correspond to the standard schema defined under the umbrella of the Open Data France association. Specification of the textual content of the deliberations included to facilitate the search: Currently, the deliberations of the community bodies are validated on paper and signed in handwritten form. The final versions published on the community’s website are scans of these documents. In the case of scanned images, their content is only visually accessible and their content is not indexed by search engines. To facilitate the search in this database, a free optical character recognition engine (Tesseract 4) is used, which is based on artificial intelligence (LSTM-type neural network, see Tesseract documentation). The content has a very high level of reliability, but occasional errors may remain. For functions other than search, it is always necessary to refer to the pdf documents which alone are authentic.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Thomas Hegghammer; Thomas Hegghammer (2021). Noisy OCR Dataset (NOD) [Dataset]. http://doi.org/10.5281/zenodo.5068735

Noisy OCR Dataset (NOD)

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5068735

Dataset updated

Jul 6, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Thomas Hegghammer; Thomas Hegghammer

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).

Source images

The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.

Artificial noise application

The dataset was created as follows:
- First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
- Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
- Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.

This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.

The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.

References:

Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https:
//github.com/PedroBarcha/old-books-dataset.

Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk
Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science
and Information Technology (CSIT), 150–54. IEEE.

Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs

Clear search

Close search

Google apps

Main menu

Noisy OCR Dataset (NOD)

Tesseract OCR of IIT-CDIP Dataset

Scrambled text: training Language Models to correct OCR errors using...

chinese_text_recognition

NomNaOCR

A Dataset for Optical Character Recognition on old Vietnamese Handwritten

NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers -...

Labelled data for fine tuning a geological Named Entity Recognition and...

MMDocBench

POPP Datasets : Datasets for handwriting recognition from French population...

synthdog-ko

Text from pdfs found on data.gouv.fr

Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset -...

Carniolan Provincial Assembly corpus Kranjska 1.0 - Dataset - B2FIND

blip3-ocr-200m

Patrologia Graeca (OCRized and analyzed texts)

A biodiversity dataset graph: Biodiversity Heritage Library (BHL)

BiblioPage Dataset

BiblioPage Dataset

Structure

Metadata Format

`labels/` example:

`labels.with_geometry/` example:

Dataset Summary

License

Citation

Contact

Note on Source Access

Deliberations of the bodies of the city of Nantes and Nantes Métropole |...

Noisy OCR Dataset (NOD)

Noisy OCR Dataset (NOD)

Tesseract OCR of IIT-CDIP Dataset

Scrambled text: training Language Models to correct OCR errors using...

chinese_text_recognition

NomNaOCR

A Dataset for Optical Character Recognition on old Vietnamese Handwritten

NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers -...

Labelled data for fine tuning a geological Named Entity Recognition and...

MMDocBench

POPP Datasets : Datasets for handwriting recognition from French population...

synthdog-ko

Text from pdfs found on data.gouv.fr

Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset -...

Carniolan Provincial Assembly corpus Kranjska 1.0 - Dataset - B2FIND

blip3-ocr-200m

Patrologia Graeca (OCRized and analyzed texts)

A biodiversity dataset graph: Biodiversity Heritage Library (BHL)

BiblioPage Dataset

BiblioPage Dataset

Structure

Metadata Format

labels/ example:

labels.with_geometry/ example:

Dataset Summary

License

Citation

Contact

Note on Source Access

Deliberations of the bodies of the city of Nantes and Nantes Métropole |...

Noisy OCR Dataset (NOD)

`labels/` example:

`labels.with_geometry/` example: