8 datasets found
  1. A Dataset of French Trade Directories from the 19th Century (FTD)

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, zip
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathalie Abadie; Nathalie Abadie; Stéphane Baciocchi; Stéphane Baciocchi; Edwin Carlinet; Edwin Carlinet; Joseph Chazalon; Joseph Chazalon; Pascal Cristofoli; Pascal Cristofoli; Bertrand Duménieu; Bertrand Duménieu; Julien Perret; Julien Perret (2023). A Dataset of French Trade Directories from the 19th Century (FTD) [Dataset]. http://doi.org/10.5281/zenodo.6394464
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nathalie Abadie; Nathalie Abadie; Stéphane Baciocchi; Stéphane Baciocchi; Edwin Carlinet; Edwin Carlinet; Joseph Chazalon; Joseph Chazalon; Pascal Cristofoli; Pascal Cristofoli; Bertrand Duménieu; Bertrand Duménieu; Julien Perret; Julien Perret
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861.

    The purpose of this dataset is to evaluate the performance of Optical Character Recognition (OCR) and Named Entity Recognition (NER) on 19th century French documents.


    This dataset is divided into two parts:

    1. A labeled dataset, which contains 8765 manually corrected entries from 78 pages (18 different directories), and which is designed for supervised training.
    2. An unlabeled dataset, containing 1058196 raw entries from 6887 pages (13 different directories), and which is designed for self-supervised pre-training.

    For the labeled dataset, we provide:

    • Original pages and cropped images
    • Human-corrected positions, transcriptions and entity tagging for each entry
    • OCR prediction from 3 systems (Tesseract v4, PERO OCR v2020 and Kraken)
    • Projected NER reference from clean text to OCR predictions, making it suitable to evaluate the performance of NER systems on real, noisy OCR predictions

    For the unlabeled dataset, we provide:

    • Automatically detected positions for each entry (lot of noise)
    • OCR predictions for each entry (PERO OCR engine)

    How to cite this dataset
    Please cite this dataset as:

    N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464.

    @dataset{abadie_dataset_22,
    author = {Abadie, Nathalie and
    Bacciochi, St{\'e}phane and
    Carlinet, Edwin and
    Chazalon, Joseph and
    Cristofoli, Pascal and
    Dum{\'e}nieu, Bertrand and
    Perret, Julien},
    title = {{A} {D}ataset of {F}rench {T}rade {D}irectories from the 19th {C}entury ({FTD})},
    month = mar,
    year = 2022,
    publisher = {Zenodo},
    version = {v1.0.0},
    doi = {10.5281/zenodo.6394464},
    url = {https://doi.org/10.5281/zenodo.6394464}
    }


    You may also be interested in our paper presented at DAS 2022 (15th IAPR International Workshop on Document Analysis Systems), which compares the performance of OCR and NER systems on this dataset:

    N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu, A Benchmark of Named Entity Recognition Approaches in Historical Documents — Application to 19th Century French Directories, May 2022, La Rochelle, France, Springer.

    @inproceedings{abadie_das_22,
    author = {Abadie, Nathalie and
    Carlinet, Edwin and
    Chazalon, Joseph and
    Dum{\'e}nieu, Bertrand},
    title = {{A} {B}enchmark of {N}amed {E}ntity {R}ecognition {A}pproaches in {H}istorical {D}ocuments — {A}pplication to 19th {C}entury {F}rench {D}irectories},
    month = may,
    year = 2022,
    publisher = {Springer},
    place = {La Rochelle, France}
    }


    Copyright and License
    The images were extracted from the original source https://gallica.bnf.fr, owned by the Bibliothèque nationale de France (French national library).
    Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.
    Researchers do not have to pay any fee for reusing the original contents in research publications or academic works.
    Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.

    The original contents were significantly transformed before being included in this dataset.
    All derived content is licensed under the permissive Creative Commons Attribution 4.0 International license.

  2. Z

    Fictions littéraires de Gallica / Literary fictions of Gallica

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Langlais, Pierre-Carl (2024). Fictions littéraires de Gallica / Literary fictions of Gallica [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4660197
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset authored and provided by
    Langlais, Pierre-Carl
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The collection "Fiction littéraire de Gallica" includes 19,240 public domain documents from the digital platform of the French National Library that were originally classified as novels or, more broadly, as literary fiction in prose. It consists of 372 tables of data in tsv format for each year of publication from 1600 to 1996 (all the missing years are in the 17th and 20th centuries). Each table is structured at the page-level of each novel (5,723,986 pages in all). It contains the complete text with the addition of some metadata. It can be opened in Excel or, preferably, with the new data analysis environments in R or Python (tidyverse, pandas…)

    This corpus can be used for large-scale quantitative analyses in computational humanities. The OCR text is presented in a raw format without any correction or enrichment in order to be directly processed for text mining purposes.

    The extraction is based on a historical categorization of the novels: the Y2 or Ybis classification. This classification, invented in 1730, is the only one that has been continuously applied to the BNF collections now available in the public domain (mainly before 1950). Consequently, the dataset is based on a definition of "novel" that is generally contemporary of the publication.

    A French data paper (in PDF and HTML) presents the construction process of the Y2 category and describes the structuring of the corpus. It also gives several examples of possible uses for computational humanities projects.

  3. A Dataset of French Trade Directories from the 19th Century for Nested NER...

    • zenodo.org
    bin, json
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Solenn Tual; Solenn Tual; Nathalie Abadie; Nathalie Abadie; Joseph Chazalon; Joseph Chazalon; Bertrand Duménieu; Bertrand Duménieu; Edwin Carlinet; Edwin Carlinet (2023). A Dataset of French Trade Directories from the 19th Century for Nested NER task [Dataset]. http://doi.org/10.5281/zenodo.8167628
    Explore at:
    json, binAvailable download formats
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Solenn Tual; Solenn Tual; Nathalie Abadie; Nathalie Abadie; Joseph Chazalon; Joseph Chazalon; Bertrand Duménieu; Bertrand Duménieu; Edwin Carlinet; Edwin Carlinet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    France, French
    Description

    This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861.

    The purpose of this dataset is to evaluate the performance of Nested Named Entity Recognition approaches on 19th century French documents, regarding both clean and noisy texts (due to the OCR engine).

    Source dataset

    This dataset has been built from this source dataset :

    N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464.

    Our experiments // Paper

    Details about our experiments on nested NER approaches are given in our paper (the pre-print version is available here).

    Tual, S., Abadie, N., Chazalon, J., Duménieu, B., & Carlinet, E. (2023). A Benchmark of Nested NER Approaches in Historical Structured Documents. Proceedings of the 17th International Conference on Document Analysis and Recognition, San José, California, USA. 2023. Springer. https://hal.science/hal-03994759v2

    Our code is available on Git-Hub.

    Dataset overview

    The following list describes the keys of the .JSON file which contain the complete materials of our experiments.

    - id : Entry unique ID in a given page

    - box : Bounding box of the entry in the scanned directory page

    - book : Source directory of the entry (*see more information bellow*)

    - page : Page ID in a given directory

    - valid_box : Is the bbox of the entry valid ? (*all bbox are valid here*)

    - text_ocr_ref` : OCR extracted and manually corrected text of the entry

    - nested_ner_xml_ref : text_ocr_ref with nested ner entities

    - text_ocr_pero : OCR extracted text of the entry with PERO-OCR engine (best engine according to Abadie et al. experiment)

    - has_valid_ner_xml_pero : Is entities mapping between nested-ner entities annotated by hand on the ref text and pero ocr text correct ? (in our experiments, we only use entries with True value)

    - nested_ner_xml_pero : Annotated noisy entries produced with PERO OCR

    - text_ocr_tess : OCR extracted text of the entry with Tesseract engine (*not used in our expriments*)

    - nested_ner_xml_tess : Is entities mapping between nested-ner entities annotated by hand on the ref text and tesseract text correct? (not used in our experiments)

    - has_valid_ner_xml_tess : Annotated noisy entries produced with Tesseract. (not used in our experiments)

    Nested entities are annotated using XML tags. Our hierachy of entities is a *Part Of* a two-levels hierarchy. It means that bottom entities are contained in a top level entity.

    Source documents // Copyright and licence

    This section has been copied from the original dataset description.

    The images were extracted from the original source https://gallica.bnf.fr, owned by the *Bibliothèque nationale de France* (French national library).

    Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.

    => Researchers do not have to pay any fee for reusing the original contents in research publications or academic works.

    Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.

    The original contents were significantly transformed before being included in this dataset.

    All derived content is licensed under the permissive *Creative Commons Attribution 4.0 International* license.

    Links to original contents are given in the window bellow :

  4. h

    French-PD-Books

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs, French-PD-Books [Dataset]. https://huggingface.co/datasets/PleIAs/French-PD-Books
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    PleIAs
    Area covered
    French
    Description

    🇫🇷 French Public Domain Books 🇫🇷

    French-Public Domain-Book or French-PD-Books is a large collection aiming to agregate all the French monographies in the public domain. The collection has been originally compiled by Pierre-Carl Langlais, on the basis of a large corpus curated by Benoît de Courson, Benjamin Azoulay for Gallicagram and in cooperation with OpenLLMFrance. Gallicagram is leading cultural analytics project giving access to word and ngram search on very large cultural… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/French-PD-Books.

  5. Z

    French Fiction of the 16-18th century

    • data.niaid.nih.gov
    Updated Dec 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre-Carl Langlais (2021). French Fiction of the 16-18th century [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5770865
    Explore at:
    Dataset updated
    Dec 10, 2021
    Dataset authored and provided by
    Pierre-Carl Langlais
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    A corpus containing all digitized French novels from the beginning of print (the first entry is from 1473) to the 18th century.

    French novels of the period have been identified using the Y2 quote of the French National Library Catalog that has served to classify past and present collections of novels in France from 1730 to 1996. Combined use of digitized sources from Gallica, Google Books, Archive.org and other digital library made it possible to attain a high representativeness: 78% of the novels of the 1450-1600 and 68% of the novels of the 1600-1700 have been retrieved.

    The corpus is part of a planned collection of French Fiction (1050-1920) that will also integrate Geste (a medieval corpus curated by Jean-Baptiste Camps) and Fictions littéraires de Gallica (a 1600-1950 corpus extracted from Gallica with Pierre-Carl Langlais, with a strong focus on the 19th century). While it aims to bridge the two pre-existing part of the collection, it is also a more ambitious experiment of systematic collection of existing digital sources.

    The project remains very much a work-in-progress at this stage. Occasional errors in the metadata and the identification of the unique work are still possible. Besides, the identification of multi-volumes remain challenging in digital sources beyond Gallica.

    The repository includes the following files:

    The metadata of available and unavailable file for all novels identified in the 16th century (corpus_roman_metadata_16.tsv) and the 17th century (corpus_roman_metadata_17.tsv). All the editions have been temptatively assigned to a unique work (work_id) based on theo title, the author and additional metadata. This dataset includes both information on a specific digitized volume (volume_file, volume_title, volume_date, volume_edition_id) and on the earliest edition of the work recorded by the French national library (first_edition, first_edition_titre, first_edition_date), as well as the identification of the author (prenom_auteur, nom_auteur) and the complete list of all available edition (list_edition_bnf). When digitized files are not available for a given work, the information on the volume is replaced with a missing data mark (NA).An edition-based dataset was initially contemplated, but it turned out to be much harder than expected: the French National Catalog do not record all the available editions and runs of the period and it would have been necessary to check and create unique edition IDs for numerous Google Books volume.

    The complete text of the novels when available (corpus_roman_16_text.tsv and corpus_roman_17_text.tsv). The use of contemporary OCR software on early modern text have long yielded poor results as words, typographies, and even letters were markedly different than the corpus theses software were trained on. Consequently, numerous volumes from Gallica have simply no OCR, as the results were below the quality requirement of the digital library. New historical OCR models will mmake it possible to create a reliable OCR on the entire corpus. The dataset includes all the text at the page-level whenever there is some text on the page. Page numbering is based on the absolute numbering of the file, not on the original numbering of the edition.

    A classified dataset of 159 novels from the 17th century in four major genres of the period: chilvaric novel, love novel, historical novel and comic novel. The classification is based on an exceptional source of 1731, the catalog of novels from Nicolas Lenglet du Fresnoy (published as the second volume of De l'usage des romans). The classified dataset include both the text (as in corpus_roman_17_text.tsv) at the page-level and the lemmatization realized with a trained syntaxic model on 17th century French (https://github.com/e-ditiones/LEM17)

    A classification model created with the classified dataset. This "Fresnoy" model has a high accuracy (93%) which can be parrtly attributed to overfitting (as there is a limited amount of novels per genre). The model can be reused with Tidysupervise, a small R extension to create supervised text models.

  6. Z

    Dataset for Logigal-layout analysis on historical newspapers

    • data.niaid.nih.gov
    Updated Dec 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atanassova Iana (2021). Dataset for Logigal-layout analysis on historical newspapers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5560765
    Explore at:
    Dataset updated
    Dec 3, 2021
    Dataset provided by
    Gutehrlé Nicolas
    Atanassova Iana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for Logical-layout analysis on French Historical Newspapers

    This is a dataset for training and testing logical-layout analysis and recognition system on French historical documents. The original data is part of the "Fond régional: Franche Comté", which is curated by Gallica, the digital portal of the Bibliothèque nationale de France (BnF).

    Description

    This dataset is divided into a train and a test set. The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche Comté" dataset. To do so, we have divided them into three layout-types:

    1c: documents where the text is displayed in one column, as in books;

    2c: documents where the text is displayed into two columns;

    3c+: documents where there are at least 3 columns of text, as in newspapers.

    Each of these folders contain subfolders starting with the letters ‘cb’. These are the identifier of a newspaper collection such as « Le Petit Semeur ». An XML describing the collection is contained in each of these folder, but is not linked to the logical-layout analysis purpose. They also contain subfolders starting with the letters ‘bpt’, which contain the following files:

    XXX.xml : the original XML film as gathered from Gallica.

    truelabels_block: A CSV file where the True labels for each TextBlock tag is given. Each line contains the page, the block_id, the first and last line of text of the block and its label

    truelabels_line: A CSV file where the True labels for each TextLine tag is given. Each line contains the page, the line_id, the text of the line and its label

    XXX_docbook.xml: the document after having been processed by a Logical Layout recognition system.

    The original XML gathers multiple information about the document, especially metadata (described using the DublinCore schema), the page numbering and the OCR which is described with the XML ALTO format. As such, the files already provide the physical layout analysis and the reading order of the documents.

    The XML ALTO format provides the text content and physical layout of documents in the following manner. The OCR output for the whole document is available in a PrintSpace tag. Lines of text are contained in TextLine tags, which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags. Sometimes, TextBlock tags are also grouped into ComposedBlock tags. TextBlock and TextLine tags have the following attributes:

    Id: the tag's identifier

    Height, Width: the text height and width

    Vpos: the vertical position of the text on the page. The higher the value, the lower the word is on the page

    Hpos: the horizontal position of the text on the page. The higher the value, the further on the right the text is on the page

    Language: the language of the text (only for TextBlock tags).

    The blocks of text are labelled either as Text, Title, Header or Other. The lines of text are labelled either as Text, Firstline (to indicate the first line of a paragraph), Title, Header or Other. These labels are used in the truelabel_lines.csv, trulabel_blocks.csv and XXX_docbook.xml files.

    You can access the original scan of every document on the Gallica website. To do so, use the following URL by replacing the part with the id of the document (eg: bpt6k76208717) : https://gallica.bnf.fr/ark:/12148/

  7. Z

    Towards a general open dataset and model for late medieval Castilian text...

    • data.niaid.nih.gov
    Updated Oct 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthias Gille Levenson (2023). Towards a general open dataset and model for late medieval Castilian text recognition (HTR/OCR). Datasets and scripts [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7386489
    Explore at:
    Dataset updated
    Oct 16, 2023
    Dataset authored and provided by
    Matthias Gille Levenson
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This repository contains the dataset of the article "Towards a general open dataset and models for late medieval Castilian writing (HTR/OCR)" submitted to the Journal of Data Mining and Digital Humanities (JDMDH). I refer to the paper (https://doi.org/10.5281/zenodo.7387376) for the description of the corpus and the models. The dataset is in version V2: it contains the allographetic AND graphematic transcriptions (files *.normalized.xml) and models. Caveat: the allographetic transcriptions and models only are described in the data paper mentionned above. The graphematic transcriptions are produced using a Chocomuffin conversion table (see corpus/conversion_table.csv) to reduce each allograph to its corresponding grapheme. The abbreviations are not expanded. Please cite the following paper if you use this dataset or the models: @article{gille_levenson_2023_towards, author = {Gille Levenson, Matthias}, date = {2023}, journaltitle = {Journal of Data Mining and Digital Humanities}, doi = {10.46298/jdmdh.10416}, editor = {Pinche, Ariane and Stokes, Peter}, issuetitle = {Special Issue: Historical documents and automatic text recognition}, title = {Towards a general open dataset and models for late medieval Castilian text recognition(HTR/OCR)}, GILLE LEVENSON , Matthias, « Towards a general open dataset and models for late medieval Castiliantext recognition (HTR/OCR) », Journal of Data Mining and Digital Humanities (2023) : SpecialIssue : Historical documents and automatic text recognition, eds. Ariane PINCHE and PeterSTOKES, DOI : 10.46298/jdmdh.10416. The image of the manuscript M (Esc_M) has not yet been uploaded, pending permission from the library that keeps the manuscript. All images are kept in a directory named after the place where the manuscript is kept, and the sigla of the witness for the in-domain dataset.

    The global licence for the dataset (except for images) is CC-BY-NC-SA. All manuscripts reproductions are published with the authorization of the libraries. ©Biblioteca General Histórica de Salamanca Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2709 (L) Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2097 (J) Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2673 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2011 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2654 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2086 ©Museo Lázaro Galdiano. Madrid Inv. 15304, Fundación Lázaro Galdiano (A) ©Universidad de Valladolid Ms. 251, Biblioteca Santa Cruz (S) ©Real Biblioteca del Escorial Ms. K.I.5, Biblioteca del Real Monasterio del Escorial (Q) Ms. h.I.8, Biblioteca del Real Monasterio del Escorial (M): to be published Ms. Z-I-12 Ms.Z-III-9 Ms. X-III-4 Ms. h-III-9 Ms. b-IV-15 Ms. b-II-11 Ms. a-II-17 Ms. T-III-5 ©Rosenbach Foundation Ms. 482/2 (U) © Gallica.bnf.fr Espagnol 12 Espagnol 36 Espagnol 218 © Bodleian Library Ms. Span. d. 1 Ms. Span. d. 2/1 © Biblioteca Real, Madrid Ms. II/215 (G) © Biblioteca Nacional de España Mss/4183 Inc/901 (Z) © Biblioteca Universitaria, Sevilla Ms. 332/131 (R)

    Edit: add result files

  8. Methodology of diachronic analysis of old prints and its validation by...

    • zenodo.org
    jpeg, txt, zip
    Updated Dec 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariusz Kamola; Mariusz Kamola; Joanna Dimke-Kamola; Joanna Dimke-Kamola (2024). Methodology of diachronic analysis of old prints and its validation by tracing the changing meaning of key concepts in the intellectual debate of 16th century Italy [Dataset]. http://doi.org/10.5281/zenodo.14502233
    Explore at:
    jpeg, txt, zipAvailable download formats
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mariusz Kamola; Mariusz Kamola; Joanna Dimke-Kamola; Joanna Dimke-Kamola
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Italy
    Description

    This dataset in corpus.zip file contains OCR texts for 16th century Italian books available from BNF (folder gallica) and archive.org (folder internetarchive). Source URL is given in 1st line of each text file. The pages within each document may come in order different than in the source, however, the order of lines in each page is preserved. The archive is protected with password, which will be made public immediately after the publication of the related research paper under the same title.

    Samples of three documents along with images of their initial pages are provided at the current time.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nathalie Abadie; Nathalie Abadie; Stéphane Baciocchi; Stéphane Baciocchi; Edwin Carlinet; Edwin Carlinet; Joseph Chazalon; Joseph Chazalon; Pascal Cristofoli; Pascal Cristofoli; Bertrand Duménieu; Bertrand Duménieu; Julien Perret; Julien Perret (2023). A Dataset of French Trade Directories from the 19th Century (FTD) [Dataset]. http://doi.org/10.5281/zenodo.6394464
Organization logo

A Dataset of French Trade Directories from the 19th Century (FTD)

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
bin, zipAvailable download formats
Dataset updated
Feb 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nathalie Abadie; Nathalie Abadie; Stéphane Baciocchi; Stéphane Baciocchi; Edwin Carlinet; Edwin Carlinet; Joseph Chazalon; Joseph Chazalon; Pascal Cristofoli; Pascal Cristofoli; Bertrand Duménieu; Bertrand Duménieu; Julien Perret; Julien Perret
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861.

The purpose of this dataset is to evaluate the performance of Optical Character Recognition (OCR) and Named Entity Recognition (NER) on 19th century French documents.


This dataset is divided into two parts:

  1. A labeled dataset, which contains 8765 manually corrected entries from 78 pages (18 different directories), and which is designed for supervised training.
  2. An unlabeled dataset, containing 1058196 raw entries from 6887 pages (13 different directories), and which is designed for self-supervised pre-training.

For the labeled dataset, we provide:

  • Original pages and cropped images
  • Human-corrected positions, transcriptions and entity tagging for each entry
  • OCR prediction from 3 systems (Tesseract v4, PERO OCR v2020 and Kraken)
  • Projected NER reference from clean text to OCR predictions, making it suitable to evaluate the performance of NER systems on real, noisy OCR predictions

For the unlabeled dataset, we provide:

  • Automatically detected positions for each entry (lot of noise)
  • OCR predictions for each entry (PERO OCR engine)

How to cite this dataset
Please cite this dataset as:

N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464.

@dataset{abadie_dataset_22,
author = {Abadie, Nathalie and
Bacciochi, St{\'e}phane and
Carlinet, Edwin and
Chazalon, Joseph and
Cristofoli, Pascal and
Dum{\'e}nieu, Bertrand and
Perret, Julien},
title = {{A} {D}ataset of {F}rench {T}rade {D}irectories from the 19th {C}entury ({FTD})},
month = mar,
year = 2022,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.6394464},
url = {https://doi.org/10.5281/zenodo.6394464}
}


You may also be interested in our paper presented at DAS 2022 (15th IAPR International Workshop on Document Analysis Systems), which compares the performance of OCR and NER systems on this dataset:

N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu, A Benchmark of Named Entity Recognition Approaches in Historical Documents — Application to 19th Century French Directories, May 2022, La Rochelle, France, Springer.

@inproceedings{abadie_das_22,
author = {Abadie, Nathalie and
Carlinet, Edwin and
Chazalon, Joseph and
Dum{\'e}nieu, Bertrand},
title = {{A} {B}enchmark of {N}amed {E}ntity {R}ecognition {A}pproaches in {H}istorical {D}ocuments — {A}pplication to 19th {C}entury {F}rench {D}irectories},
month = may,
year = 2022,
publisher = {Springer},
place = {La Rochelle, France}
}


Copyright and License
The images were extracted from the original source https://gallica.bnf.fr, owned by the Bibliothèque nationale de France (French national library).
Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.
Researchers do not have to pay any fee for reusing the original contents in research publications or academic works.
Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.

The original contents were significantly transformed before being included in this dataset.
All derived content is licensed under the permissive Creative Commons Attribution 4.0 International license.

Search
Clear search
Close search
Google apps
Main menu