Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).
Source images
The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.
Artificial noise application
The dataset was created as follows:
- First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
- Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
- Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.
This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.
The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.
References:
Barcha, Pedro. 2017. âOld Books Dataset.â GitHub Repository. GitHub. https:
//github.com/PedroBarcha/old-books-dataset.
Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. âYarmouk
Arabic OCR Dataset.â In 2018 8th International Conference on Computer Science
and Information Technology (CSIT), 150â54. IEEE.
Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is Tesseract generated transcriptions (no images) of (most of) the IIT-CDIP dataset. To download the images of the IIT-CDIP dataset go to https://data.nist.gov/od/id/mds2-2531
The directory struture of this dataset is the same as the IIT-CDIP dataset (although has everything in one tar, with "a.a", "a.b", ... directories) and can thus be combine with the image IIT-CDIP dataset using rsync or similar tool. This dataset contains a "X.layout.json" for each "X.png" in the IIT-CDIP dataset (doesn't have sections 'a', 'w', 'x', 'y', and 'z').
The jsons contain block/paragraph, line and word bounding boxes, with transcriptions for the words following the Tesseract format. The line and word annotations are directly taken from Tesseract. The block and paragraph output of Tesseract was discarded. The images were then run through both the Publaynet and PrimaNet models available on LayoutParser (https://layout-parser.github.io/). The combine output of these models became the block/paragraph annotations (we kept the Tesseract output format, but each block has 1 paragraph of exactly the same shape).
Important: There is also a "rotation" value in the json (0, 90, 180, or 270) indicating the json may be for a rotated version of the IIT-CDIP image by the given amount (attempted to rotated documents to upright position to get better OCR results).
These are the annotations used to pre-train Dessurt (https://arxiv.org/abs/2203.16618).
These annotations will be worse than those that would be obtained using a commercial OCR system (like those used to pre-train LayoutLMv2/v3).
The code used to produce these annotations is available here: https://github.com/herobd/ocr
This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as followsncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the papersynth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Source of data: https://github.com/FudanVI/benchmarking-chinese-text-recognition
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
GitHub: https://github.com/ds4v/NomNaOCR
Here, we introduce the NomNaOCR dataset for the old Vietnamese HĂĄn-**NĂŽm** script based on 3 tremendous and valuable historical works of Vietnam, including: - LỄc VĂąn TiĂȘn by Nguyá» n ÄĂŹnh Chiá»u. - Tale of Kiá»u or Truyá»n Kiá»u (version 1866, 1871, and 1872) by Nguyá» n Du. - A full set of 5 parts in History of Greater Vietnam or ÄáșĄi Viá»t Sá» KĂœ ToĂ n Thư (ÄVSKTT) composed by many historians from the Tráș§n to the Háșu LĂȘ dynasty of Vietnam.
With 2953 handwritten Pages (2956 - 3 Ignored Pages) collected from the Vietnamese NĂŽm Preservation Foundation for analyzing and semi-annotating the bounding boxes to generate additional 38,318 Patches (38,319 - 1 Ignored Patch) containing text along with HĂĄn-**NĂŽm** strings in digital form. This makes NomNaOCR currently become the biggest dataset for HĂĄn-NĂŽm script in Vietnam serving 2 main problems in Optical Character Recognition on HĂĄn-**NĂŽm** script:
- Text Detection: Detect the image regions that contain text. The input is an image (or a Page
), and the output is a bounding box
surrounding the text area found.
- Text Recognition: After detecting boxes or image regions containing text, each of these regions will be cropped from the original image, forming small parts called Patch
. The input will now be a Patch
, and the output will be the text in that Patch
.
A difference here is that our implementations were all done at the sequence level, which not only saves the cost for annotation but also helps us retain the context in the sequence instead of just performing on each individual character as in most previous works.
https://github.com/ds4v/NomNaOCR/raw/main/Assets/ocr_pipeline1.jpg" alt="">
***Note**: There are characters that Kaggle cannot display => Use the NomNaTong font to be able to read the HĂĄn-**NĂŽm** content in the best way.
NCSE v2.0 Dataset RepositoryThis repository contains the NCSE v2.0 dataset and associated supporting data used in the paper "Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models".Dataset OverviewThe NCSE v2.0 is a digitized collection of six 19th-century English periodicals containing:82,690 pages1.4 million entries321 million words1.9 billion charactersThe dataset includes:1.1 million text entries198,000 titles17,000 figure descriptions16,000 tablesRepository ContentsNCSE v2.0 DatasetNCSE_v2.zip: a folder containing a parquet file for each of the periodicals as well as a readme file.Bounding Box DatasetA zip file called bounding_box.zip. Containspost_process: A folder of the processed periodical bounding box datapost_process_fill: A folder of the processed periodical bounding box data WITH column filling.bbox_readme.txt: a readme file and data description for the bounding boxesTest Setscropped_images.zip: 378 images cropped from the NCSE test set pages, all 2-bit png filesground_truth: 358 text files corresponding to the text from the cropped_images folderClassification Training DataThe below files are used for training the classification models. They contain 12000 observations 2000 from each periodical. The labels were classified using mistral-large-2411. This data is used to train the ModernBERT classifier described in the paper. The topics are taken from the International Press Telecommunications Council (IPTC) subject codes.silver_IPTC_class.parquet: IPTC topic classification silver setsilver_text_type.parquet: Text-type classification silver setClassified DataThe zip file "classification_data.zip" with all rows classified using the ModernBERT classifer described in the paper.IPTC_type_classified.zip: contains one parquet file per periodicaltext_type_classified.zip: contains one parquet file per periodicalclassification_readme.md: Description of the dataClassification MappingsData for mapping the classification codes to human readable names.class_mappings.zip: contains a json for each classification typeIPTC_class_mapping.jsontext_type_class_mapping.jsonOriginal ImagesThe original page images can be found at the King's College London Repositories:Monthly RepositoryNorthern StarLeaderEnglish Woman's JournalTomahawkPublishers' CircularOr via the project central archiveCitationIf you use this dataset, please cite:No citation data currently availableRelated CodeAll original code related to this project including the creation of the datasets and thier analysis can be found at:https://github.com/JonnoB/ereading_the_unreadableContactFor questions about the dataset, please create an issue in this repository.Usage RightsIn keeping with the original NCSE dataset, all data is made available under a Creative Commons Attribution 4.0 International License (CC BY).
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
MMDocBench is an open-sourced benchmark with various OCR-free document understanding tasks for evaluating fine-grained visual perception and reasoning abilities. For more details, please refer to the project page: https://MMDocBench.github.io/.
Dataset Structure
MMDocBench consists of 15 main tasks and 48 sub-tasks, involving 2,400 document images, 4,338 QA pairs⊠See the full description on the dataset page: https://huggingface.co/datasets/next-tat/MMDocBench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
POPP datasets
This repository contains 3 datasets created within the POPP project (Project for the Oceration of the Paris Population Census) for the task of handwriting text recognition. These datasets have been published in Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census at DAS 2022.
The 3 datasets are called âGeneric datasetâ, âBellevilleâ, and âChaussĂ©e dâAntinâ and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines.
The structure of each dataset is the following:
labels.json
and the line images splitted into the folders train, valid and test. The double pages were scanned at a resolution of 200dpi and saved as PNG images with 256 gray levels. The line and page images are shared in the TIFF format, also with 256 gray levels.Since the lines are extracted from table rows, we defined 4 special characters to describe the structure of the text:
We provide a script format_dataset.py
to define which special character you want to use in the ground-truth.
The split for the Generic Dataset and Belleville have been made at the double-page level so that each writer only appears in one subset among train, evaluation and test. The following table summarizes the splits and the number of writers for each dataset:
Dataset | train - # of lines | validation - # of lines | test - # of lines | # of writers |
---|---|---|---|---|
Generic | 3840 (128 pages) | 480 (16 pages) | 480 (16 pages) | 80 |
Belleville | 1140 (38 pages) | 150 (5 pages) | 180 (6 pages) | 1 |
ChaussĂ©e dâAntin | 625 | 78 | 77 | 10 |
Generic dataset (or POPP dataset)
Belleville dataset
This dataset is a mono-writer dataset made of 1470 lines (49 pages) from the Belleville district census of 1926.
ChaussĂ©e dâAntin dataset
This dataset is a multi-writer dataset made of 780 lines (26 pages) from the ChaussĂ©e dâAntin district census of 1926 and written by 10 different writers.
Error reporting
It is possible that errors persist in the ground truth, so any suggestions for correction are welcome. To do so, please make a merge request on the Github repository and include the correction in both the labels.json file and in the XML file concerned.
Citation Request
If you publish material based on this database, we request you to include a reference to paper T. Constum, N. Kempf, T. Paquet, P. Tranouez, C. Chatelain, S. Brée, and F. Merveille,Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census ,Document Analysis Systems (DAS), pp. 143- 157, La Rochelle, 2022.
Donut đ© : OCR-Free Document Understanding Transformer (ECCV 2022) -- SynthDoG datasets
For more information, please visit https://github.com/clovaai/donut
The links to the SynthDoG-generated datasets are here:
synthdog-en: English, 0.5M. synthdog-zh: Chinese, 0.5M. synthdog-ja: Japanese, 0.5M. synthdog-ko: Korean, 0.5M.
To generate synthetic datasets with our SynthDoG, please see ./synthdog/README.md and our paper for details.
How to Cite
If you find this work useful⊠See the full description on the dataset page: https://huggingface.co/datasets/naver-clova-ix/synthdog-ko.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The corpus of Slovenian periodicals sPeriodika contains linguistically annotated periodicals published during the 18th, 19th, and beginning of 20th century (1771-1914). The periodical issues were retreived from Slovenia's national library's digital library service (https://dlib.si) in the form of OCR-ed PDF and TXT files. Before linguistically annotating the documents (lemmatisation, part-of-speech tagging, and named entity recognition) with CLASSLA-Stanza (https://github.com/clarinsi/classla), the OCR-ed texts were corrected with a lightweight and robust approach using cSMTiser (https://github.com/clarinsi/csmtiser), a text normalisation tool based on character-level machine translation. This OCR post-correction model was trained on a set of manually corrected samples (300 random paragraphs at least 100 characters in length) from the original texts, cf. http://hdl.handle.net/11356/1907. The documents in the collection are enriched with the following metadata obtained from dLib: - Document ID (URN) - Periodical name - Document (periodical issue) title - Volume number (if available) - Issue number (if available) - Year of publication - Date of publication (of varying granularity, based on original metadata available) - Source (URL of the original digitised document available at dlib.si) - Image (see below) - Quality (see below)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The corpus contains meeting proceedings of the Carniolan Provincial Assembly from 1861 to 1913 (Obravnave deĆŸelnega zbora kranjskega / Bericht ĂŒber die Verhandlungen des krainischen Landtages). The corpus comprises 694 sessions (15353 pages, approximately 10 million words). The source data (scanned and OCR processed pdf documents) originally come from The Digital Library of Slovenia dLib.si (http://www.dlib.si) and History of Slovenia - SIstory (https://www.sistory.si) portals. The documents are bilingual, in Slovenian and German, depending on the speaker. German was first typeset in the Gothic script and later on in Latin. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Language was detected on the sentence level, roughly 58% sentences are in Slovenian and 42% in German. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using Trankit (https://github.com/nlp-uoregon/trankit) for Slovenian and German, while Lingua (https://github.com/pemistahl/lingua-py) is used for language detection. The documents are in the Parla-CLARIN (https://github.com/clarin-eric/parla-clarin) compliant TEI XML format. Each session in one file.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
BLIP3-OCR-200M Dataset
Overview
The BLIP3-OCR-200M dataset is designed to address the limitations of current Vision-Language Models (VLMs) in processing and interpreting text-rich images, such as documents and charts. Traditional image-text datasets often struggle to capture nuanced textual information, which is crucial for tasks requiring complex text comprehension and reasoning.
Key Features
OCR Integration: The dataset incorporates Optical Character⊠See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/blip3-ocr-200m.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CGPG project (Calfa GRE*g*ORI Patrologia Graeca), led by Jean-Marie Auwers (UCLouvain), aims to OCRize the remaining non-digital versions of the Patrologia Graeca volumes. The project relies on the expertise of GREgORI and Calfa.
The project is sponsored by the ASBL *Byzantion*, the Fondation *Sedes Sapientiae*, the Institut *Religions, Spiritualités, Cultures, Sociétés* (RSCS, UCLouvain) and the Centre d'études orientales (CIOL, UCLouvain) and by a generous donor who wishes to remain anonymous. Other sponsors have recently expressed their willingness to support the project.
This repository contains the sketch engine XML files, with linguistic markups.
Raw data are available on Github : https://github.com/calfa-co/Patrologia-Graeca
For an optimal use in Sketch Engine, configure the corpus (Manage Corpus/Configure/Expert settings) as below
DOCSTRUCTURE "doc"
ENCODING "UTF-8"
INFO ""
LANGUAGE "Ancient Greek"
NAME "CGPG_20250629"
PATH "/corpora/ca/user_data/sso_1392/manatee/cgpg_20250629"
VERTICAL "| ca_getvertical '/corpora/ca/user_data/sso_1392/registry/cgpg_20250629' 'docx'"
ATTRIBUTE "word" {
MAPTO "lemma"
}
ATTRIBUTE "intuitive_form" {
}
ATTRIBUTE "lemma" {
}
ATTRIBUTE "intuitive_lemma" {
}
ATTRIBUTE "pos" {
}
ATTRIBUTE "headword" {
}
STRUCTURE "w" {
DEFAULTLOCALE "C"
ENCODING "UTF-8"
LANGUAGE ""
NESTED ""
ATTRIBUTE "id" {
DYNLIB ""
DYNTYPE "index"
ENCODING "UTF-8"
LOCALE "C"
MULTISEP ","
MULTIVALUE "n"
TYPE "MD_MI"
}
}
STRUCTURE "doc" {
DEFAULTLOCALE "C"
ENCODING "UTF-8"
LANGUAGE ""
NESTED ""
ATTRIBUTE "id" {
DYNLIB ""
DYNTYPE "index"
ENCODING "UTF-8"
LOCALE "C"
MULTISEP ","
MULTIVALUE "n"
TYPE "MD_MI"
}
}
STRUCTURE "docx" {
DEFAULTLOCALE "C"
ENCODING "UTF-8"
LANGUAGE ""
NESTED ""
ATTRIBUTE "id" {
DYNLIB ""
DYNTYPE "index"
ENCODING "UTF-8"
LABEL "File ID"
LOCALE "C"
MULTISEP ","
MULTIVALUE "n"
TYPE "MD_MI"
UNIQUE "1"
}
ATTRIBUTE "filename" {
DYNLIB ""
DYNTYPE "index"
ENCODING "UTF-8"
LABEL "File name"
LOCALE "C"
MULTISEP ","
MULTIVALUE "n"
TYPE "MD_MI"
}
}
Bibliography
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A biodiversity dataset graph: Biodiversity Heritage Library
Biodiversity datasets, or descriptions of biodiversity datasets, are increasingly available through open digital data infrastructures such as the Biodiversity Heritage Library (BHL, https://biodiversitylibrary.org). "The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community." - https://biodiversitylibrary.org , June 2019.
However, little is known about how these networks, and the data accessed through them, change over time. This dataset provide snapshots of all OCR item texts (e.g., individual items) available through BHL as tracked by Preston (https://github.com/bio-guoda/preston , https://doi.org/10.5281/zenodo.1410543 ) over period May - June 2019.
This snapshot contains about 120GB of uncompressed OCR texts across 227k OCR BHL items. Also, a snapshot of the BHL item catalog at https://www.biodiversitylibrary.org/data/item.txt is included.
The archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to eestablish a versioning mechanism. Provenance files describe how, when and where the BHL OCR text items were retrieved. For more information, please visit https://preston.guoda.bio or https://doi.org/10.5281/zenodo.1410543).
To retrieve and verify the downloaded BHL biodiversity dataset graph, first concatenate all the downloaded preston-*.tar.gz files (e.g., cat preston-*.tar.gz > preston.tar.gz). Then, extract the archives into a "data" folder. After that, verify the index of the archive by reproducing the following result:
$ java -jar preston.jar history
<0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion>
To check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.
$ java -jar preston.jar verify
hash://sha256/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca file:/home/preston/preston-bhl/data/e0/c1/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca OK CONTENT_PRESENT_VALID_HASH 49458087
hash://sha256/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99 file:/home/preston/preston-bhl/data/1a/57/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99 OK CONTENT_PRESENT_VALID_HASH 25745
hash://sha256/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c file:/home/preston/preston-bhl/data/85/ef/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c OK CONTENT_PRESENT_VALID_HASH 519892
Note that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston".
Files in this data publication:
README - this file
preston-[00-ff].tar.gz - preston archives containing BHL OCR item texts, their provenance and a provenance index.
9e8c86243df39dd4fe82a3f814710eccf73aa9291d050415408e346fa2b09e70 - preston index file
2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a - preston index file
89926f33157c0ef057b6de73f6c8be0060353887b47db251bfd28222f2fd801a - preston provenance file
41b19aa9456fc709de1d09d7a59c87253bc1f86b68289024b7320cef78b3e3a4 - preston provenance file
This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BiblioPage is a dataset of scanned title pages annotated with structured bibliographic metadata and bounding boxes. It supports research in document understanding, bibliographic metadata extraction, and OCR alignment.
đ Reference: BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction
The ZIP archive contains:
images/
âââ train/ # Development set images (.jpg)
âââ test/ # Test set images (.jpg)
labels/
âââ train/ # Metadata only (.json)
âââ test/
labels.with_geometry/
âââ train/ # Metadata + bounding boxes (.json)
âââ test/
Files are named as:library_id.document_uuid.page_uuid.extension
Example:mzk.e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f.59e59f06-c2ce-4c10-aa9d-33de3b8b41be.json
Each label contains up to 16 bibliographic attributes. The following attributes may contain multiple values: author
, illustrator
, translator
, editor
, publisher
. All others are single-value only.
labels/
example:{
"task_id": "238776",
"library_id": "mzk.e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f.59e59f06-c2ce-4c10-aa9d-33de3b8b41be",
"title": "TÄLOCVIK pro ĆĄkoly obecnĂ© a mÄĆĄĆ„anskĂ©.",
"placeTerm": "PRAZE.",
"dateIssued": "1895.",
"publisher": ["âNov. kalendĂĄĆe uÄitelskĂ©ho.â"],
"author": ["V. BEà ÀĂK."],
"illustrator": ["K. SUCHĂ."],
"editor": ["FR. PITRĂK", "A. HOLUB."]
}
labels.with_geometry/
example:{
"task_id": "238776",
"library_id": "mzk.e85a4ad0-e261-11ed-9d56-5ef3fc9bb22f.59e59f06-c2ce-4c10-aa9d-33de3b8b41be",
"title": ["TÄLOCVIK pro ĆĄkoly obecnĂ© a mÄĆĄĆ„anskĂ©.", [74, 447, 1111, 322]],
"placeTerm": ["PRAZE.", [550, 1982, 227, 50]],
"dateIssued": ["1895.", [580, 2111, 89, 40]],
"publisher": [["âNov. kalendĂĄĆe uÄitelskĂ©ho.â", [560, 2051, 491, 46]]],
"author": [["V. BEà ÀĂK.", [445, 970, 375, 61]]],
"illustrator": [["K. SUCHĂ.", [461, 1314, 331, 57]]],
"editor": [
["FR. PITRĂK", [242, 1140, 371, 59]],
["A. HOLUB.", [689, 1149, 324, 49]]
]
}
Bounding boxes use pixel coordinates: [x_left, y_top, width, height]
.
2,118 scanned title pages from 14 Czech libraries
Time span: 1485â21st century
Development and test split, test set fully manually verified
Released for research and non-commercial use only.
@article{kohut2024bibliopage,
title={BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction},
author={KohĂșt, Jan and DoÄekal, Martin and HradiĆĄ, Michal and VaĆĄko, Marek},
journal={arXiv preprint arXiv:2503.19658},
year={2024}
}
đ§ ikohut@fit.vutbr.cz
đ https://github.com/DCGM/biblio-dataset
Title pages can also be accessed via the original digital library using:
https://www.digitalniknihovna.cz/mzk/view/uuid:{doc_id}?page=uuid:{page_id}
â ïž Note: Resolution may differ from dataset images. Always use the provided files for analysis. Use source links only for additional context or browsing.
Deliberations of the Municipal Council of the City of Nantes, the Metropolitan Council, the Metropolitan Bureau of Nantes MĂ©tropole and the Communal Centre for Social Action of the City of Nantes. * * * * This dataset aggregates the information obtained from the deliberations of the various bodies of the CollectivitĂ© Nantes MĂ©tropole and the City. A description of each instance, as well as all the agendas and reports are available on the Communityâs institutional website on the dedicated pages: * **to City Council ** * to the Metropolitan Council * at the Metropolitan Office * **at CCAS ** The data of the open deliberations in this game are extracted from the files transmitted by the community to the Prefecture for the control of legality through the FAST â Acts service. Deliberations are part of the common core of local data, i.e. a set of data that communities agree to publish as a matter of priority, following a way of organising information. As a result, the file is modeled to correspond to the standard schema defined under the umbrella of the Open Data France association. Specification of the textual content of the deliberations included to facilitate the search: Currently, the deliberations of the community bodies are validated on paper and signed in handwritten form. The final versions published on the communityâs website are scans of these documents. In the case of scanned images, their content is only visually accessible and their content is not indexed by search engines. To facilitate the search in this database, a free optical character recognition engine (Tesseract 4) is used, which is based on artificial intelligence (LSTM-type neural network, see Tesseract documentation). The content has a very high level of reliability, but occasional errors may remain. For functions other than search, it is always necessary to refer to the pdf documents which alone are authentic.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).
Source images
The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.
Artificial noise application
The dataset was created as follows:
- First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
- Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
- Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.
This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.
The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.
References:
Barcha, Pedro. 2017. âOld Books Dataset.â GitHub Repository. GitHub. https:
//github.com/PedroBarcha/old-books-dataset.
Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. âYarmouk
Arabic OCR Dataset.â In 2018 8th International Conference on Computer Science
and Information Technology (CSIT), 150â54. IEEE.
Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs