Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861.
The purpose of this dataset is to evaluate the performance of Optical Character Recognition (OCR) and Named Entity Recognition (NER) on 19th century French documents.
This dataset is divided into two parts:
For the labeled dataset, we provide:
For the unlabeled dataset, we provide:
How to cite this dataset
Please cite this dataset as:
N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464.
@dataset{abadie_dataset_22,
author = {Abadie, Nathalie and
Bacciochi, St{\'e}phane and
Carlinet, Edwin and
Chazalon, Joseph and
Cristofoli, Pascal and
Dum{\'e}nieu, Bertrand and
Perret, Julien},
title = {{A} {D}ataset of {F}rench {T}rade {D}irectories from the 19th {C}entury ({FTD})},
month = mar,
year = 2022,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.6394464},
url = {https://doi.org/10.5281/zenodo.6394464}
}
You may also be interested in our paper presented at DAS 2022 (15th IAPR International Workshop on Document Analysis Systems), which compares the performance of OCR and NER systems on this dataset:
N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu, A Benchmark of Named Entity Recognition Approaches in Historical Documents — Application to 19th Century French Directories, May 2022, La Rochelle, France, Springer.
@inproceedings{abadie_das_22,
author = {Abadie, Nathalie and
Carlinet, Edwin and
Chazalon, Joseph and
Dum{\'e}nieu, Bertrand},
title = {{A} {B}enchmark of {N}amed {E}ntity {R}ecognition {A}pproaches in {H}istorical {D}ocuments — {A}pplication to 19th {C}entury {F}rench {D}irectories},
month = may,
year = 2022,
publisher = {Springer},
place = {La Rochelle, France}
}
Copyright and License
The images were extracted from the original source https://gallica.bnf.fr, owned by the Bibliothèque nationale de France (French national library).
Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.
Researchers do not have to pay any fee for reusing the original contents in research publications or academic works.
Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.
The original contents were significantly transformed before being included in this dataset.
All derived content is licensed under the permissive Creative Commons Attribution 4.0 International license.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The collection "Fiction littéraire de Gallica" includes 19,240 public domain documents from the digital platform of the French National Library that were originally classified as novels or, more broadly, as literary fiction in prose. It consists of 372 tables of data in tsv format for each year of publication from 1600 to 1996 (all the missing years are in the 17th and 20th centuries). Each table is structured at the page-level of each novel (5,723,986 pages in all). It contains the complete text with the addition of some metadata. It can be opened in Excel or, preferably, with the new data analysis environments in R or Python (tidyverse, pandas…)
This corpus can be used for large-scale quantitative analyses in computational humanities. The OCR text is presented in a raw format without any correction or enrichment in order to be directly processed for text mining purposes.
The extraction is based on a historical categorization of the novels: the Y2 or Ybis classification. This classification, invented in 1730, is the only one that has been continuously applied to the BNF collections now available in the public domain (mainly before 1950). Consequently, the dataset is based on a definition of "novel" that is generally contemporary of the publication.
A French data paper (in PDF and HTML) presents the construction process of the Y2 category and describes the structuring of the corpus. It also gives several examples of possible uses for computational humanities projects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861.
The purpose of this dataset is to evaluate the performance of Nested Named Entity Recognition approaches on 19th century French documents, regarding both clean and noisy texts (due to the OCR engine).
Source dataset
This dataset has been built from this source dataset :
N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464.
Our experiments // Paper
Details about our experiments on nested NER approaches are given in our paper (the pre-print version is available here).
Tual, S., Abadie, N., Chazalon, J., Duménieu, B., & Carlinet, E. (2023). A Benchmark of Nested NER Approaches in Historical Structured Documents. Proceedings of the 17th International Conference on Document Analysis and Recognition, San José, California, USA. 2023. Springer. https://hal.science/hal-03994759v2
Our code is available on Git-Hub.
Dataset overview
The following list describes the keys of the .JSON file which contain the complete materials of our experiments.
- id : Entry unique ID in a given page
- box : Bounding box of the entry in the scanned directory page
- book : Source directory of the entry (*see more information bellow*)
- page : Page ID in a given directory
- valid_box : Is the bbox of the entry valid ? (*all bbox are valid here*)
- text_ocr_ref` : OCR extracted and manually corrected text of the entry
- nested_ner_xml_ref : text_ocr_ref with nested ner entities
- text_ocr_pero : OCR extracted text of the entry with PERO-OCR engine (best engine according to Abadie et al. experiment)
- has_valid_ner_xml_pero : Is entities mapping between nested-ner entities annotated by hand on the ref text and pero ocr text correct ? (in our experiments, we only use entries with True value)
- nested_ner_xml_pero : Annotated noisy entries produced with PERO OCR
- text_ocr_tess : OCR extracted text of the entry with Tesseract engine (*not used in our expriments*)
- nested_ner_xml_tess : Is entities mapping between nested-ner entities annotated by hand on the ref text and tesseract text correct? (not used in our experiments)
- has_valid_ner_xml_tess : Annotated noisy entries produced with Tesseract. (not used in our experiments)
Nested entities are annotated using XML tags. Our hierachy of entities is a *Part Of* a two-levels hierarchy. It means that bottom entities are contained in a top level entity.
Source documents // Copyright and licence
This section has been copied from the original dataset description.
The images were extracted from the original source https://gallica.bnf.fr, owned by the *Bibliothèque nationale de France* (French national library).
Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.
=> Researchers do not have to pay any fee for reusing the original contents in research publications or academic works.
Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.
The original contents were significantly transformed before being included in this dataset.
All derived content is licensed under the permissive *Creative Commons Attribution 4.0 International* license.
Links to original contents are given in the window bellow :
🇫🇷 French Public Domain Books 🇫🇷
French-Public Domain-Book or French-PD-Books is a large collection aiming to agregate all the French monographies in the public domain. The collection has been originally compiled by Pierre-Carl Langlais, on the basis of a large corpus curated by Benoît de Courson, Benjamin Azoulay for Gallicagram and in cooperation with OpenLLMFrance. Gallicagram is leading cultural analytics project giving access to word and ngram search on very large cultural… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/French-PD-Books.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A corpus containing all digitized French novels from the beginning of print (the first entry is from 1473) to the 18th century.
French novels of the period have been identified using the Y2 quote of the French National Library Catalog that has served to classify past and present collections of novels in France from 1730 to 1996. Combined use of digitized sources from Gallica, Google Books, Archive.org and other digital library made it possible to attain a high representativeness: 78% of the novels of the 1450-1600 and 68% of the novels of the 1600-1700 have been retrieved.
The corpus is part of a planned collection of French Fiction (1050-1920) that will also integrate Geste (a medieval corpus curated by Jean-Baptiste Camps) and Fictions littéraires de Gallica (a 1600-1950 corpus extracted from Gallica with Pierre-Carl Langlais, with a strong focus on the 19th century). While it aims to bridge the two pre-existing part of the collection, it is also a more ambitious experiment of systematic collection of existing digital sources.
The project remains very much a work-in-progress at this stage. Occasional errors in the metadata and the identification of the unique work are still possible. Besides, the identification of multi-volumes remain challenging in digital sources beyond Gallica.
The repository includes the following files:
The metadata of available and unavailable file for all novels identified in the 16th century (corpus_roman_metadata_16.tsv) and the 17th century (corpus_roman_metadata_17.tsv). All the editions have been temptatively assigned to a unique work (work_id) based on theo title, the author and additional metadata. This dataset includes both information on a specific digitized volume (volume_file, volume_title, volume_date, volume_edition_id) and on the earliest edition of the work recorded by the French national library (first_edition, first_edition_titre, first_edition_date), as well as the identification of the author (prenom_auteur, nom_auteur) and the complete list of all available edition (list_edition_bnf). When digitized files are not available for a given work, the information on the volume is replaced with a missing data mark (NA).An edition-based dataset was initially contemplated, but it turned out to be much harder than expected: the French National Catalog do not record all the available editions and runs of the period and it would have been necessary to check and create unique edition IDs for numerous Google Books volume.
The complete text of the novels when available (corpus_roman_16_text.tsv and corpus_roman_17_text.tsv). The use of contemporary OCR software on early modern text have long yielded poor results as words, typographies, and even letters were markedly different than the corpus theses software were trained on. Consequently, numerous volumes from Gallica have simply no OCR, as the results were below the quality requirement of the digital library. New historical OCR models will mmake it possible to create a reliable OCR on the entire corpus. The dataset includes all the text at the page-level whenever there is some text on the page. Page numbering is based on the absolute numbering of the file, not on the original numbering of the edition.
A classified dataset of 159 novels from the 17th century in four major genres of the period: chilvaric novel, love novel, historical novel and comic novel. The classification is based on an exceptional source of 1731, the catalog of novels from Nicolas Lenglet du Fresnoy (published as the second volume of De l'usage des romans). The classified dataset include both the text (as in corpus_roman_17_text.tsv) at the page-level and the lemmatization realized with a trained syntaxic model on 17th century French (https://github.com/e-ditiones/LEM17)
A classification model created with the classified dataset. This "Fresnoy" model has a high accuracy (93%) which can be parrtly attributed to overfitting (as there is a limited amount of novels per genre). The model can be reused with Tidysupervise, a small R extension to create supervised text models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for Logical-layout analysis on French Historical Newspapers
This is a dataset for training and testing logical-layout analysis and recognition system on French historical documents. The original data is part of the "Fond régional: Franche Comté", which is curated by Gallica, the digital portal of the Bibliothèque nationale de France (BnF).
Description
This dataset is divided into a train and a test set. The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche Comté" dataset. To do so, we have divided them into three layout-types:
1c: documents where the text is displayed in one column, as in books;
2c: documents where the text is displayed into two columns;
3c+: documents where there are at least 3 columns of text, as in newspapers.
Each of these folders contain subfolders starting with the letters ‘cb’. These are the identifier of a newspaper collection such as « Le Petit Semeur ». An XML describing the collection is contained in each of these folder, but is not linked to the logical-layout analysis purpose. They also contain subfolders starting with the letters ‘bpt’, which contain the following files:
XXX.xml : the original XML film as gathered from Gallica.
truelabels_block: A CSV file where the True labels for each TextBlock tag is given. Each line contains the page, the block_id, the first and last line of text of the block and its label
truelabels_line: A CSV file where the True labels for each TextLine tag is given. Each line contains the page, the line_id, the text of the line and its label
XXX_docbook.xml: the document after having been processed by a Logical Layout recognition system.
The original XML gathers multiple information about the document, especially metadata (described using the DublinCore schema), the page numbering and the OCR which is described with the XML ALTO format. As such, the files already provide the physical layout analysis and the reading order of the documents.
The XML ALTO format provides the text content and physical layout of documents in the following manner. The OCR output for the whole document is available in a PrintSpace tag. Lines of text are contained in TextLine tags, which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags. Sometimes, TextBlock tags are also grouped into ComposedBlock tags. TextBlock and TextLine tags have the following attributes:
Id: the tag's identifier
Height, Width: the text height and width
Vpos: the vertical position of the text on the page. The higher the value, the lower the word is on the page
Hpos: the horizontal position of the text on the page. The higher the value, the further on the right the text is on the page
Language: the language of the text (only for TextBlock tags).
The blocks of text are labelled either as Text, Title, Header or Other. The lines of text are labelled either as Text, Firstline (to indicate the first line of a paragraph), Title, Header or Other. These labels are used in the truelabel_lines.csv, trulabel_blocks.csv and XXX_docbook.xml files.
You can access the original scan of every document on the Gallica website. To do so, use the following URL by replacing the part with the id of the document (eg: bpt6k76208717) : https://gallica.bnf.fr/ark:/12148/
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This repository contains the dataset of the article "Towards a general open dataset and models for late medieval Castilian writing (HTR/OCR)" submitted to the Journal of Data Mining and Digital Humanities (JDMDH). I refer to the paper (https://doi.org/10.5281/zenodo.7387376) for the description of the corpus and the models.
The dataset is in version V2: it contains the allographetic AND graphematic transcriptions (files *.normalized.xml
) and models.
Caveat: the allographetic transcriptions and models only are described in the data paper mentionned above. The graphematic transcriptions are produced using a Chocomuffin conversion table (see corpus/conversion_table.csv
) to reduce each allograph to its corresponding grapheme. The abbreviations are not expanded.
Please cite the following paper if you use this dataset or the models:
@article{gille_levenson_2023_towards, author = {Gille Levenson, Matthias}, date = {2023}, journaltitle = {Journal of Data Mining and Digital Humanities}, doi = {10.46298/jdmdh.10416}, editor = {Pinche, Ariane and Stokes, Peter}, issuetitle = {Special Issue: Historical documents and automatic text recognition}, title = {Towards a general open dataset and models for late medieval Castilian text recognition(HTR/OCR)},
GILLE LEVENSON , Matthias, « Towards a general open dataset and models for late medieval Castiliantext recognition (HTR/OCR) », Journal of Data Mining and Digital Humanities (2023) : SpecialIssue : Historical documents and automatic text recognition, eds. Ariane PINCHE and PeterSTOKES, DOI : 10.46298/jdmdh.10416.
The image of the manuscript M (Esc_M) has not yet been uploaded, pending permission from the library that keeps the manuscript.
All images are kept in a directory named after the place where the manuscript is kept, and the sigla of the witness for the in-domain dataset.
The global licence for the dataset (except for images) is CC-BY-NC-SA. All manuscripts reproductions are published with the authorization of the libraries. ©Biblioteca General Histórica de Salamanca Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2709 (L) Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2097 (J) Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2673 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2011 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2654 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2086 ©Museo Lázaro Galdiano. Madrid Inv. 15304, Fundación Lázaro Galdiano (A) ©Universidad de Valladolid Ms. 251, Biblioteca Santa Cruz (S) ©Real Biblioteca del Escorial Ms. K.I.5, Biblioteca del Real Monasterio del Escorial (Q) Ms. h.I.8, Biblioteca del Real Monasterio del Escorial (M): to be published Ms. Z-I-12 Ms.Z-III-9 Ms. X-III-4 Ms. h-III-9 Ms. b-IV-15 Ms. b-II-11 Ms. a-II-17 Ms. T-III-5 ©Rosenbach Foundation Ms. 482/2 (U) © Gallica.bnf.fr Espagnol 12 Espagnol 36 Espagnol 218 © Bodleian Library Ms. Span. d. 1 Ms. Span. d. 2/1 © Biblioteca Real, Madrid Ms. II/215 (G) © Biblioteca Nacional de España Mss/4183 Inc/901 (Z) © Biblioteca Universitaria, Sevilla Ms. 332/131 (R)
Edit: add result files
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset in corpus.zip file contains OCR texts for 16th century Italian books available from BNF (folder gallica) and archive.org (folder internetarchive). Source URL is given in 1st line of each text file. The pages within each document may come in order different than in the source, however, the order of lines in each page is preserved. The archive is protected with password, which will be made public immediately after the publication of the related research paper under the same title.
Samples of three documents along with images of their initial pages are provided at the current time.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861.
The purpose of this dataset is to evaluate the performance of Optical Character Recognition (OCR) and Named Entity Recognition (NER) on 19th century French documents.
This dataset is divided into two parts:
For the labeled dataset, we provide:
For the unlabeled dataset, we provide:
How to cite this dataset
Please cite this dataset as:
N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464.
@dataset{abadie_dataset_22,
author = {Abadie, Nathalie and
Bacciochi, St{\'e}phane and
Carlinet, Edwin and
Chazalon, Joseph and
Cristofoli, Pascal and
Dum{\'e}nieu, Bertrand and
Perret, Julien},
title = {{A} {D}ataset of {F}rench {T}rade {D}irectories from the 19th {C}entury ({FTD})},
month = mar,
year = 2022,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.6394464},
url = {https://doi.org/10.5281/zenodo.6394464}
}
You may also be interested in our paper presented at DAS 2022 (15th IAPR International Workshop on Document Analysis Systems), which compares the performance of OCR and NER systems on this dataset:
N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu, A Benchmark of Named Entity Recognition Approaches in Historical Documents — Application to 19th Century French Directories, May 2022, La Rochelle, France, Springer.
@inproceedings{abadie_das_22,
author = {Abadie, Nathalie and
Carlinet, Edwin and
Chazalon, Joseph and
Dum{\'e}nieu, Bertrand},
title = {{A} {B}enchmark of {N}amed {E}ntity {R}ecognition {A}pproaches in {H}istorical {D}ocuments — {A}pplication to 19th {C}entury {F}rench {D}irectories},
month = may,
year = 2022,
publisher = {Springer},
place = {La Rochelle, France}
}
Copyright and License
The images were extracted from the original source https://gallica.bnf.fr, owned by the Bibliothèque nationale de France (French national library).
Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.
Researchers do not have to pay any fee for reusing the original contents in research publications or academic works.
Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.
The original contents were significantly transformed before being included in this dataset.
All derived content is licensed under the permissive Creative Commons Attribution 4.0 International license.