72 datasets found
  1. Data from: ICDAR 2021 Competition on Historical Map Segmentation — Dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin
    Updated May 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Chazalon; Joseph Chazalon; Edwin Carlinet; Edwin Carlinet; Yizi Chen; Yizi Chen; Julien Perret; Julien Perret; Bertrand Duménieu; Bertrand Duménieu; Clément Mallet; Clément Mallet; Thierry Géraud; Thierry Géraud (2021). ICDAR 2021 Competition on Historical Map Segmentation — Dataset [Dataset]. http://doi.org/10.5281/zenodo.4817662
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    May 30, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joseph Chazalon; Joseph Chazalon; Edwin Carlinet; Edwin Carlinet; Yizi Chen; Yizi Chen; Julien Perret; Julien Perret; Bertrand Duménieu; Bertrand Duménieu; Clément Mallet; Clément Mallet; Thierry Géraud; Thierry Géraud
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ICDAR 2021 Competition on Historical Map Segmentation — Dataset

    This is the dataset of the ICDAR 2021 Competition on Historical Map Segmentation (“MapSeg”).
    This competition ran from November 2020 to April 2021.
    Evaluation tools are freely available but distributed separately.

    Official competition website: https://icdar21-mapseg.github.io/

    The competition report can be cited as:

    Joseph Chazalon, Edwin Carlinet, Yizi Chen, Julien Perret, Bertrand Duménieu, Clément Mallet, Thierry Géraud, Vincent Nguyen, Nam Nguyen, Josef Baloun, Ladislav Lenc, and Pavel Král, "ICDAR 2021 Competition on Historical Map Segmentation", in Proceedings of the 16th International Conference on Document Analysis and Recognition (ICDAR'21), September 5-10, 2021, Lausanne, Switzerland.

    BibTeX entry:

    @InProceedings{chazalon.21.icdar.mapseg,
     author  = {Joseph Chazalon and Edwin Carlinet and Yizi Chen and Julien Perret and Bertrand Duménieu and Clément Mallet and Thierry Géraud and Vincent Nguyen and Nam Nguyen and Josef Baloun and Ladislav Lenc and and Pavel Král},
     title   = {ICDAR 2021 Competition on Historical Map Segmentation},
     booktitle = {Proceedings of the 16th International Conference on Document Analysis and Recognition (ICDAR'21)},
     year   = {2021},
     address  = {Lausanne, Switzerland},
    }

    We thank the City of Paris for granting us with the permission to use and reproduce the atlases used in this work.

    The images of this dataset are extracted from a series of 9 atlases of the City of Paris produced between 1894 and 1937 by the Map Service (“Service du plan”) of the City of Paris, France, for the purpose of urban management and planning. For each year, a set of approximately 20 sheets forms a tiled view of the city, drawn at 1/5000 scale using trigonometric triangulation.

    Sample citation of original documents:

    Atlas municipal des vingt arrondissements de Paris. 1894, 1895, 1898, 1905, 1909, 1912, 1925, 1929, and 1937. Bibliothèque de l’Hôtel de Ville. City of Paris. France.

    Motivation

    This competition aims as encouraging research in the digitization of historical maps. In order to be usable in historical studies, information contained in such images need to be extracted. The general pipeline involves multiples stages; we list some essential ones here:

    • segment map content: locate the area of the image which contains map content;
    • extract map object from different layers: detect objects like roads, buildings, building blocks, rivers, etc. to create geometric data;
    • georeference the map: by detecting objects at known geographic coordinate, compute the transformation to turn geometric objects into geographic ones (which can be overlaid on current maps).

    Task overview

    • Task 1: Detection of building blocks
    • Task 2: Segmentation of map content within map sheets
    • Task 3: Localization of graticule lines intersections

    Please refer to the enclosed README.md file or to the official website for the description of tasks and file formats.

    Evaluation metrics and tools

    Evaluation metrics are described in the competition report and tools are available at https://github.com/icdar21-mapseg/icdar21-mapseg-eval and should also be archived using Zenodo.

  2. NIST TREC Document Database: Disk 4 - NIST Special Database 22

    • data.nist.gov
    • data.wu.ac.at
    Updated Jan 1, 1996
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellen Voorhees (1996). NIST TREC Document Database: Disk 4 - NIST Special Database 22 [Dataset]. https://data.nist.gov/pdr/lps/FF429BC178638B3EE0431A570681E858210
    Explore at:
    Dataset updated
    Jan 1, 1996
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Ellen Voorhees
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    A collection of full-text documents from various sources including the Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), and the Federal Register (1994). These documents are part of the document set for several TREC information retrieval test collections.

  3. NIST TREC Document Database: Disk 5 - NIST Special Database 23

    • data.nist.gov
    Updated Jan 1, 1997
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellen Voorhees (1997). NIST TREC Document Database: Disk 5 - NIST Special Database 23 [Dataset]. https://data.nist.gov/pdr/lps/FF429BC178648B3EE0431A570681E858211
    Explore at:
    Dataset updated
    Jan 1, 1997
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Ellen Voorhees
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    A collection of full-text English documents from various sources including the Foreign Broadcast Information Service (1996) and the Los Angeles Times (1989, 1990). These documents make up part of the document set for several TREC information retrieval test collections.

  4. Dataset Versatile Layout Understanding via Conjugate Graph (ICDAR 2019)

    • zenodo.org
    • data.europa.eu
    zip
    Updated May 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hervé Déjean; Hervé Déjean (2020). Dataset Versatile Layout Understanding via Conjugate Graph (ICDAR 2019) [Dataset]. http://doi.org/10.5281/zenodo.3828954
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 16, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hervé Déjean; Hervé Déjean
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contains the datasets with tables used in the following paper:

    Versatile Layout Understanding via Conjugate Graph. Animesh Prasad, Hervé Déjean and Jean-Luc Meunier, International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019.

  5. h

    CEP-7K

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kexin Technology, CEP-7K [Dataset]. https://huggingface.co/datasets/Kexin-Technology/CEP-7K
    Explore at:
    Dataset authored and provided by
    Kexin Technology
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CEP-7K

      Dataset Details
    

    Dataset Full Name: Chinese College Entrance Exam Papers Dataset Size: 7K Language: Chinese License: MIT

      Dataset Description
    

    CEP-7K (Chinese College Entrance Exam Papers-7K) is the competition dataset for ICDAR 2025 Competition on Understanding Chinese College Entrance Exam Papers from The 19th International Conference on Document Analysis and Recognition (ICDAR2025). This dataset consists of 7,000 question-answer pairs derived from… See the full description on the dataset page: https://huggingface.co/datasets/Kexin-Technology/CEP-7K.

  6. Data from: Sparse Machine Learning Methods for Understanding Large Text...

    • data.nasa.gov
    • gimi9.com
    • +3more
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://data.nasa.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

  7. F

    Data from: A Neural Approach for Text Extraction from Scholarly Figures

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A Neural Approach for Text Extraction from Scholarly Figures

    This is the readme for the supplemental data for our ICDAR 2019 paper.

    You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

    If you found this dataset useful, please consider citing our paper:

    @inproceedings{DBLP:conf/icdar/MorrisTE19,
     author  = {David Morris and
            Peichen Tang and
            Ralph Ewerth},
     title   = {A Neural Approach for Text Extraction from Scholarly Figures},
     booktitle = {2019 International Conference on Document Analysis and Recognition,
            {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
     pages   = {1438--1443},
     publisher = {{IEEE}},
     year   = {2019},
     url    = {https://doi.org/10.1109/ICDAR.2019.00231},
     doi    = {10.1109/ICDAR.2019.00231},
     timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
     biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
     bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

    This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

    Datasets

    We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

    Testing

    These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

    Validation

    The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

    Training

    We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

    Code

    We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

    Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

    We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

    We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

    Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

  8. ICDAR2015 competition on smartphone document capture and OCR (SmartDoc) -...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jean-Christophe Burie; Joseph Chazalon; Mickael Coustaty; Sebastien Eskenazi; Muhammad Muzzamil Luqman; Muhammad Muzzamil Luqman; Maroua Mehri; Nibal Nayef; Jean-Marc OGIER; Sophea Prum; Marçal Rusinol; Jean-Christophe Burie; Joseph Chazalon; Mickael Coustaty; Sebastien Eskenazi; Maroua Mehri; Nibal Nayef; Jean-Marc OGIER; Sophea Prum; Marçal Rusinol (2020). ICDAR2015 competition on smartphone document capture and OCR (SmartDoc) - Challenge 2 [Dataset]. http://doi.org/10.5281/zenodo.2572929
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jean-Christophe Burie; Joseph Chazalon; Mickael Coustaty; Sebastien Eskenazi; Muhammad Muzzamil Luqman; Muhammad Muzzamil Luqman; Maroua Mehri; Nibal Nayef; Jean-Marc OGIER; Sophea Prum; Marçal Rusinol; Jean-Christophe Burie; Joseph Chazalon; Mickael Coustaty; Sebastien Eskenazi; Maroua Mehri; Nibal Nayef; Jean-Marc OGIER; Sophea Prum; Marçal Rusinol
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ICDAR2015 competition on smartphone document capture and OCR (SmartDoc)

    Challenge 2: MOBILE OCR COMPETITION

    The goal of the competition is to extract the textual content from document images which are captured by mobile phones. The images are taken under varying conditions to provide a challenging input. The dataset was prepared for ICDAR2015-SmartDoc competition. For more details about the dataset please visit the competition's website:

    https://sites.google.com/site/icdar15smartdoc/home

    http://smartdoc.univ-lr.fr

    You may also refer to the following paper for more details on the ICDAR2015-SmartDoc competition:

    Jean-Christophe Burie, Joseph Chazalon, Mickaël Coustaty, Sébastien Eskenazi, Muhammad Muzzamil Luqman, Maroua Mehri, Nibal Nayef, Jean-Marc OGIER, Sophea Prum and Marçal Rusinol: “ICDAR2015 Competition on Smartphone Document Capture and OCR (SmartDoc)”, In 13th International Conference on Document Analysis and Recognition (ICDAR), 2015.

    If you use this dataset, please send us a short email at

  9. Find it again! Dataset

    • kaggle.com
    zip
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikita2998 (2025). Find it again! Dataset [Dataset]. https://www.kaggle.com/datasets/nikita2998/find-it-again-dataset/code
    Explore at:
    zip(673906266 bytes)Available download formats
    Dataset updated
    Mar 28, 2025
    Authors
    Nikita2998
    Description

    The Find it again! dataset contains 988 scanned images of receipts and their transcriptions, originating from the scanned receipts OCR and information extraction (SROIE) dataset. Among these images, 163 have undergone realistic fraudulent modifications. The dataset includes ground truth information for distinguishing between forged and authentic receipts. It also provides annotations on the fraudulent modifications, including details about the entities that have been modified and the location of the forgeries.

    Data Collection and Forgery

    Find it again! aims to address the limitations of existing forgery detection datasets by providing a collection of labeled and annotated documents suitable for both image-based and content-based forgery detection approaches. This novel dataset contains diverse receipts, encompassing different layouts, fonts, styles and document characteristics encountered in real-world scenarios that have undergone pseudo-realistic manual forgeries.

    SROIE Dataset Annotation

    One characteristic of the scanned receipts of this dataset is that some have been modified, either digitally or manually, with different types of annotations. These annotations are not considered as forgeries. Even though the documents have been modified, they are still authentic, as they have not undergone any forgery, and the modification doesn't compromise the meaning of the receipts.

    Provided Annotations

    The annotations in the Find it again! dataset provide detailed information about the forged and authentic receipts, including:

    Location of fraudulent modifications on the receipts (Bboxes) Type of the entities that have been modified in the forgeries Ground truth labels indicating whether a receipt is authentic or forged Annotations on whether the authentic SROIE receipt contains a manual or digital alteration

    Reference

    Beatriz Martínez Tornés, Théo Taburet, Emanuela Boros, Kais Rouis, Petra Gomez-Krämer, Nicolas Sidere, Antoine Doucet and Vincent Poulain d'Andecy. Receipt Dataset for Document Forgery Detection. In Proceedings of The 17th International Conference on Document Analysis and Recognition, August 21-26, 2023 — San José, California, USA.

  10. KIE-HVQA

    • huggingface.co
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bytedance-research (2025). KIE-HVQA [Dataset]. https://huggingface.co/datasets/bytedance-research/KIE-HVQA
    Explore at:
    Dataset updated
    Aug 23, 2025
    Dataset provided by
    ByteDancehttps://www.bytedance.com/
    Authors
    bytedance-research
    Description

    KIE-HVQA

      Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models
    

    Data for the paper Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

      What's New
    

    [2025/09/19] The paper has been accepted by NeurIPS 2025 Main Conference.

      Introduction
    

    Recent advancements in multimodal large language models have significantly improved document understanding by integrating textual and visual… See the full description on the dataset page: https://huggingface.co/datasets/bytedance-research/KIE-HVQA.

  11. NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set

    • data.nist.gov
    • catalog.data.gov
    Updated Jan 1, 1996
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellen M. Voorhees (1996). NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set [Dataset]. http://doi.org/10.18434/t47g6m
    Explore at:
    Dataset updated
    Jan 1, 1996
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Ellen M. Voorhees
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    A collection of full-text documents from various sources including the Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), the Federal Register (1994), the Foreign Broadcast Information Service (1996), and the Los Angeles Times (1989, 1990). These documents are the document set for several TREC information retrieval test collections. (Data contains document text only.)

  12. h

    HIV AND AIDS RESEARCH AND BEST PRACTICES DISSEMINATION CONFERENCE - Dataset...

    • dms.hiv.health.gov.mw
    Updated Sep 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). HIV AND AIDS RESEARCH AND BEST PRACTICES DISSEMINATION CONFERENCE - Dataset - The Document Management System [Dataset]. https://dms.hiv.health.gov.mw/dataset/hiv-and-aids-research-and-best-practices-dissemination-conference
    Explore at:
    Dataset updated
    Sep 19, 2022
    Description

    HIV and AIDS Research Papers and Best practices

  13. TREC 1999 Adhoc Dataset

    • data.nist.gov
    • gimi9.com
    • +1more
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian Soboroff (2024). TREC 1999 Adhoc Dataset [Dataset]. http://doi.org/10.18434/mds2-3620
    Explore at:
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Ian Soboroff
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The ad hoc retrieval task investigates the performance of systems that search a static set of documents using new questions (called topics in TREC). This task is similar to how a researcher might use a library - the collection is known but the questions likely to be asked are not known. NIST provides the participants approximately 2 gigabytes worth of documents and a set of 50 natural language topic statements. The participants produce a set of queries from the topic statements and run those queries against the documents. The output from this run is the official test result for the ad hoc task. Participants return the best 1000 documents retrieved for each topic to NIST for evaluation. The dataset comprises the documents, the topics, and the annotations of relevant documents.

  14. RVLCDIP

    • kaggle.com
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abdellatif sassioui (2023). RVLCDIP [Dataset]. https://www.kaggle.com/datasets/abdellatifsassioui/rvlcdip
    Explore at:
    zip(38800183304 bytes)Available download formats
    Dataset updated
    Jul 25, 2023
    Authors
    abdellatif sassioui
    Description

    RVL-CDIP Dataset

    The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels.

    For questions and comments please contact Adam Harley (aharley@scs.ryerson.ca).

    CHANGELOG

    05/JUN/2015 First version of the dataset

    DETAILS

    The label files list the images and their categories in the following format:

    path/to/the/image.tif category

    where the categories are numbered 0 to 15, in the following order:

    0 letter 1 form 2 email 3 handwritten 4 advertisement 5 scientific report 6 scientific publication 7 specification 8 file folder 9 news article 10 budget 11 invoice 12 presentation 13 questionnaire 14 resume 15 memo

    CITATION

    If you use this dataset, please cite:

    A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015

    Bibtex format:

    @inproceedings{harley2015icdar, title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval}, author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis}, booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}}, year = {2015} }

    FURTHER INFORMATION

    This dataset is a subset of the IIT-CDIP Test Collection 1.0 [1]. The file structure of this dataset is the same as in the IIT collection, so it is possible to refer to that dataset for OCR and additional metadata. The IIT-CDIP dataset is itself a subset of the Legacy Tobacco Document Library [2].

    [1] D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, "Building a test collection for complex document information processing," in Proc. 29th Annual Int. ACM SIGIR Conference (SIGIR 2006), pp. 665-666, 2006 [2] The Legacy Tobacco Document Library (LTDL), University of California, San Francisco, 2007. http://legacy.library.ucsf.edu/.

    More information about this dataset can be obtained at the following URL: http://scs.ryerson.ca/~aharley/rvl-cdip/

  15. E

    Dataset of ICDAR 2019 Competition on Post-OCR Text Correction

    • live.european-language-grid.eu
    • zenodo.org
    • +1more
    txt
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Dataset of ICDAR 2019 Competition on Post-OCR Text Correction [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7738
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 12, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Corpus for the ICDAR2019 Competition on Post-OCR Text Correction (October 2019)Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreuxhttp://l3i.univ-larochelle.fr/ICDAR2019PostOCR-------------------------------------------------------------------------------These are the supplementary materials for the ICDAR 2019 paper ICDAR 2019 Competition on Post-OCR Text CorrectionPlease use the following citation:@inproceedings{rigaud2019pocr,title=""ICDAR 2019 Competition on Post-OCR Text Correction"",author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},year={2019},booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}}

    Description: The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GS comes both from BnF's internal projects and external initiatives such as Europeana Newspapers, IMPACT, Project Gutenberg, Perseus and Wikisource. Repartition of the dataset- ICDAR2019_Post_OCR_correction_training_18M.zip: 80% of the full dataset, provided to train participants' methods.- ICDAR2019_Post_OCR_correction_evaluation_4M: 20% of the full dataset used for the evaluation (with Gold Standard made publicly after the competition).- ICDAR2019_Post_OCR_correction_full_22M: full dataset made publicly available after the competition. Special case for Finnish language Material from the National Library of Finland (Finnish dataset FI > FI1) are not allowed to be re-shared on other website. Please follow these guidelines to get and format the data from the original website.1. Go to https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en;2. Download OCR Ground Truth Pages (Finnish Fraktur) [v1](4.8GB) from Digitalia (2015-17) package;3. Convert the Excel file ""~/metadata/nlf_ocr_gt_tescomb5_2017.xlsx"" as Comma Separated Format (.csv) by using save as function in a spreadsheet software (e.g. Excel, Calc) and copy it into ""FI/FI1/HOWTO_get_data/input/"";4. Go to ""FI/FI1/HOWTO_get_data/"" and run ""script_1.py"" to generate the full ""FI1"" dataset in ""output/full/"";4. Run ""script_2.py"" to split the ""output/full/"" dataset into ""output/training/"" and ""output/evaluation/"" sub sets.At the end of the process, you should have a ""training"", ""evaluation"" and ""full"" folder with 1579528, 380817 and 1960345 characters respectively.

    Licenses: free to use for non-commercial uses, according to sources in details- BG1: IMPACT - National Library of Bulgaria: CC BY NC ND- CZ1: IMPACT - National Library of the Czech Republic: CC BY NC SA- DE1: Front pages of Swiss newspaper NZZ: Creative Commons Attribution 4.0 International (https://zenodo.org/record/3333627)- DE2: IMPACT - German National Library: CC BY NC ND- DE3: GT4Hist-dta19 dataset: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE4: GT4Hist - EarlyModernLatin: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE5: GT4Hist - Kallimachos: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE6: GT4Hist - RefCorpus-ENHG-Incunabula: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE7: GT4Hist - RIDGES-Fraktur: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- EN1: IMPACT - British Library: CC BY NC SA 3.0- ES1: IMPACT - National Library of Spain: CC BY NC SA- FI1: National Library of Finland: no re-sharing allowed, follow the above section to get the data. (https://digi.kansalliskirjasto.fi/opendata)- FR1: HIMANIS Project: CC0 (https://www.himanis.org)- FR2: IMPACT - National Library of France: CC BY NC SA 3.0- FR3: RECEIPT dataset: CC0 (http://findit.univ-lr.fr)- NL1: IMPACT - National library of the Netherlands: CC BY- PL1: IMPACT - National Library of Poland: CC BY- SL1: IMPACT - Slovak National Library: CC BY NCText post-processing such as cleaning and alignment have been applied on the resources mentioned above, so that the Gold Standard and the OCRs provided are not necessarily identical to the originals.

    Structure- **Content** [./lang_type/sub_folder/#.txt] - ""[OCR_toInput] "" => Raw OCRed text to be de-noised. - ""[OCR_aligned] "" => Aligned OCRed text. - ""[ GS_aligned] "" => Aligned Gold Standard text.The aligned OCRed/GS texts are provided for training and test purposes. The alignment was made at the character level using ""@"" symbols. ""#"" symbols correspond to the absence of GS either related to alignment uncertainties or related to unreadable characters in the source document. For a better view of the alignment, make sure to disable the ""word wrap"" option in your text editor.The Error Rate and the quality of the alignment vary according to the nature and the state of degradation of the source documents. Periodicals (mostly historical newspapers) for example, due to their complex layout and their original fonts have been reported to be especially challenging. In addition, it should be mentioned that the quality of Gold Standard also varies as the dataset aggregates resources from different projects that have their own annotation procedure, and obviously contains some errors.

    ICDAR2019 competitionInformation related to the tasks, formats and the evaluation metrics are details on :https://sites.google.com/view/icdar2019-postcorrectionocr/evaluation

    References - IMPACT, European Commission's 7th Framework Program, grant agreement 215064 - Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. - https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland- EU Horizon 2020 research and innovation programme grant agreement No 770299

    Contact- christophe.rigaud(at)univ-lr.fr- antoine.doucet(at)univ-lr.fr- mickael.coustaty(at)univ-lr.fr- jean-philippe.moreux(at)bnf.frL3i - University of la Rochelle, http://l3i.univ-larochelle.frBnF - French National Library, http://www.bnf.fr

  16. a

    Data from: ICDAR 2013

    • datasets.activeloop.ai
    • opendatalab.com
    deeplake
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura., ICDAR 2013 [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/icdar-2013-dataset/
    Explore at:
    deeplakeAvailable download formats
    Authors
    Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura.
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The ICDAR 2013 Dataset is a dataset of 462 images of text in natural scenes. The dataset is a popular benchmark for text detection and recognition research. The dataset was created by the International Conference on Document Analysis and Recognition (ICDAR) in 2013.

  17. E

    Text Mining of Conference Abstracts: Documentation of a Working Process in...

    • live.european-language-grid.eu
    csv
    Updated Apr 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Text Mining of Conference Abstracts: Documentation of a Working Process in the Digital Humanities. A workshop report [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18325
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 11, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies a research of the mining of conference abstracts. The article describes the digital workflow of automatically extracting various entities from a corpus of conference abstracts and performing network analysis on the results in order to find relationships. The aim is to identify the spread and diversity of tools, the representation of institutions and federations, and the thematic and personal networks of speakers at a German-speaking DH conference. Particular emphasis is placed on the documentation of the entire process, which can be interpreted as an initiative to develop interdisciplinary standards in this field.

    Included in the dataset are

    • a training corpus (Trainingskorpuserweitert.csv)
    • a configuration file (Dariah6.prop)
    • a model file gained from the Stanford parser (dariah—6ner-model2.ser.gz)

  18. Z

    Data from: MDIW-13: New Database and Benchmark for Script Identification

    • data-staging.niaid.nih.gov
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miguel A. Ferrer; Abhijit Das; Moises Diaz; Aythami Morales; Cristina Carmona - Duarte; Umapada Pal (2024). MDIW-13: New Database and Benchmark for Script Identification [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6343657
    Explore at:
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Griffith University, Gold Coast, Queensland, Australia and Information Sciences institute,University of Southern California, USA
    Universidad Autonoma de Madrid, Spain
    Instituto Universitario para el Desarrollo Tecnológico y la Innovación en Comunicaciones,Universidad de Las Palmas de Gran Canaria, Campus de Tafira, Las Palmas de Gran Canarialmas de Gran Canaria, Spain, (8) (PDF) MDIW-13: New Database and Benchmark for Script Identification. Available from: https://www.researchgate
    Instituto Universitario para el Desarrollo Tecnológico y la Innovación en Comunicaciones,Universidad de Las Palmas de Gran Canaria, Campus de Tafira, Las Palmas de Gran Canaria
    Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India
    Authors
    Miguel A. Ferrer; Abhijit Das; Moises Diaz; Aythami Morales; Cristina Carmona - Duarte; Umapada Pal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Script identification is a necessary step in some applications involving document analysis in a multi-script and multi-language environment. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspapers and handwritten letters and notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given.

    https://www.dropbox.com/s/vtmy0l4gjxun0oe/Multiscript_SIW_Database_Feb25_acceptedPaper.zip?dl=0

    Please, cite our work if you find useful the database:

    M. A. Ferrer, A. Das, M. Diaz, A. Morales, C. Carmona-Duarte, U. Pal (2022), "MDIW-13: New Database and Benchmark for Script Identification", Multimedia Tools and Applications, Pages 1-14. Accepted

    A. Das, M. A. Ferrer, A. Morales, M. Diaz, U. Pal, et al. "SIW 2021: ICDAR Competition on Script Identification in the Wild". 16th International Conference on Document Analysis and Recognition (ICDAR 2021). Lecture Notes in Computer Science, vol 12824. Springer. Sep. 5-10, 2021, Lausanne, Switzerland, pp. 738-753. doi: 10.1007/978-3-030-86337-1_49

  19. ICDAR'15 SMARTPHONE DOCUMENT CAPTURE AND OCR COMPETITION (SmartDoc) -...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Chazalon; Joseph Chazalon; Marçal Rusiñol; Marçal Rusiñol (2020). ICDAR'15 SMARTPHONE DOCUMENT CAPTURE AND OCR COMPETITION (SmartDoc) - Challenge 1 (original version) [Dataset]. http://doi.org/10.5281/zenodo.1230218
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joseph Chazalon; Joseph Chazalon; Marçal Rusiñol; Marçal Rusiñol
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CHALLENGE 1: SMARTPHONE DOCUMENT CAPTURE COMPETITION

    Smartphones are replacing personal scanners. They are portable, connected, powerful and affordable. They are on their way to become the new entry point in business processing applications like document archival, ID scanning, check digitization, just to name a few. In order keep our workflows streamlined, we need to make those new capture device as reliable as batch scanners.

    We believe an efficient capture process should be able to:

    1. detect and segment the relevant document object during the preview phase;
    2. assess the quality of the capture conditions and help the user improve them;
    3. optionally, trigger the capture at the perfect moment;
    4. and produce a high-quality, controlled output based on the high resolution captured image.

    This competition is focused on the first step of this process: efficiently detect and segment document regions, as illustrated by following video showing the ideal output for the preview phase of some acquisition session: Click here to watch the video. This video shows the ideal document object detection ‎(ie the ground truth, as a red frame)‎.

    For this challenge, the input consists in a set of videoclips containing a document from a predefined set, and the output should be an xml file containing the quadrilateral coordinates in which we can find the document per each frame of the video. Click here for detailed information about the dataset.

    Licence for the dataset of challenge 1 (page outline detection in preview frames) :

    This work is licensed under a Creative Commons Attribution 4.0 International License <http://creativecommons.org/licenses/by/4.0/>. Author attribution should be given by citing the following conference paper: Jean-Christophe Burie, Joseph Chazalon, Mickaël Coustaty, Sébastien Eskenazi, Muhammad Muzzamil Luqman, Maroua Mehri, Nibal Nayef, Jean-Marc OGIER, Sophea Prum and Marçal Rusinol: “ICDAR2015 Competition on Smartphone Document Capture and OCR (SmartDoc)”, In 13th International Conference on Document Analysis and Recognition (ICDAR), 2015.

    If you use this dataset, please send us a short email at

  20. Stamp Verification (StaVer) Dataset

    • kaggle.com
    zip
    Updated Apr 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael Tatman (2018). Stamp Verification (StaVer) Dataset [Dataset]. https://www.kaggle.com/datasets/rtatman/stamp-verification-staver-dataset/code
    Explore at:
    zip(1994977452 bytes)Available download formats
    Dataset updated
    Apr 11, 2018
    Authors
    Rachael Tatman
    Description

    Context:

    An automatic system for stamp segmentation and further verification is needed especially for environments like insurance companies where a huge volume of documents is processed daily. However, detection of a general stamp is not a trivial task as it can have different shapes and colors and, moreover, it can be imprinted with a variable quality and rotation. This dataset was collected to help researchers build such a system.

    Content:

    This dataset contains 400 scanned document images. The documents are automatically generated invoices that were printed, stamped and scanned with 200 dpi resolution. They include color logos and color texts which makes the evaluation results more realistic. There are stamps of many different shapes and colors including black ones in the data set, sometimes the stamps are overlapped with signatures or a text. In some documents there are multiple stamps or none at all. The groundtruth consists of binary images with masks of the stamp strokes which allows for accurate pixel-wise evaluation. This dataset contains the following folders, each with 400 items (one for each image):

    • scans: scans of the stamped genuine documents
    • ground-truth-maps: maps defining the region of the stamp(s)
    • ground-truth-pixel: pixel-level ground truth
    • info: contains text files with the info for each file. Each info file contains the following information:
      • signature [0|1]: signature present [0] or not [1]
      • textOverlap [0|1]: stamps overlap with printed text [1]
      • numStamps [0|...|n]: number of stamps on the page
      • bwStamp[1|...|n]: stamp[1|...|n] is black stamp [1] or colored [1]

    In addition, there is a .pdf file will all the images in one file. The complete dataset (including scans with higher resolution) can be found here.

    Acknowledgements:

    This dataset was collected by Barbora Micenkova´ and Joost van Beusekom. If you use this dataset in your work, please cite the following paper:

    Micenkov, B., & van Beusekom, J. (2011, September). Stamp detection in color document images. In Document Analysis and Recognition (ICDAR), 2011 International Conference on(pp. 1125-1129). IEEE.

    Inspiration:

    • Can you segment just the stamps from the background text?
    • Can you use OCR techniques to identify the stamped text?
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Joseph Chazalon; Joseph Chazalon; Edwin Carlinet; Edwin Carlinet; Yizi Chen; Yizi Chen; Julien Perret; Julien Perret; Bertrand Duménieu; Bertrand Duménieu; Clément Mallet; Clément Mallet; Thierry Géraud; Thierry Géraud (2021). ICDAR 2021 Competition on Historical Map Segmentation — Dataset [Dataset]. http://doi.org/10.5281/zenodo.4817662
Organization logo

Data from: ICDAR 2021 Competition on Historical Map Segmentation — Dataset

Related Article
Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
bin, application/gzipAvailable download formats
Dataset updated
May 30, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joseph Chazalon; Joseph Chazalon; Edwin Carlinet; Edwin Carlinet; Yizi Chen; Yizi Chen; Julien Perret; Julien Perret; Bertrand Duménieu; Bertrand Duménieu; Clément Mallet; Clément Mallet; Thierry Géraud; Thierry Géraud
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ICDAR 2021 Competition on Historical Map Segmentation — Dataset

This is the dataset of the ICDAR 2021 Competition on Historical Map Segmentation (“MapSeg”).
This competition ran from November 2020 to April 2021.
Evaluation tools are freely available but distributed separately.

Official competition website: https://icdar21-mapseg.github.io/

The competition report can be cited as:

Joseph Chazalon, Edwin Carlinet, Yizi Chen, Julien Perret, Bertrand Duménieu, Clément Mallet, Thierry Géraud, Vincent Nguyen, Nam Nguyen, Josef Baloun, Ladislav Lenc, and Pavel Král, "ICDAR 2021 Competition on Historical Map Segmentation", in Proceedings of the 16th International Conference on Document Analysis and Recognition (ICDAR'21), September 5-10, 2021, Lausanne, Switzerland.

BibTeX entry:

@InProceedings{chazalon.21.icdar.mapseg,
 author  = {Joseph Chazalon and Edwin Carlinet and Yizi Chen and Julien Perret and Bertrand Duménieu and Clément Mallet and Thierry Géraud and Vincent Nguyen and Nam Nguyen and Josef Baloun and Ladislav Lenc and and Pavel Král},
 title   = {ICDAR 2021 Competition on Historical Map Segmentation},
 booktitle = {Proceedings of the 16th International Conference on Document Analysis and Recognition (ICDAR'21)},
 year   = {2021},
 address  = {Lausanne, Switzerland},
}

We thank the City of Paris for granting us with the permission to use and reproduce the atlases used in this work.

The images of this dataset are extracted from a series of 9 atlases of the City of Paris produced between 1894 and 1937 by the Map Service (“Service du plan”) of the City of Paris, France, for the purpose of urban management and planning. For each year, a set of approximately 20 sheets forms a tiled view of the city, drawn at 1/5000 scale using trigonometric triangulation.

Sample citation of original documents:

Atlas municipal des vingt arrondissements de Paris. 1894, 1895, 1898, 1905, 1909, 1912, 1925, 1929, and 1937. Bibliothèque de l’Hôtel de Ville. City of Paris. France.

Motivation

This competition aims as encouraging research in the digitization of historical maps. In order to be usable in historical studies, information contained in such images need to be extracted. The general pipeline involves multiples stages; we list some essential ones here:

  • segment map content: locate the area of the image which contains map content;
  • extract map object from different layers: detect objects like roads, buildings, building blocks, rivers, etc. to create geometric data;
  • georeference the map: by detecting objects at known geographic coordinate, compute the transformation to turn geometric objects into geographic ones (which can be overlaid on current maps).

Task overview

  • Task 1: Detection of building blocks
  • Task 2: Segmentation of map content within map sheets
  • Task 3: Localization of graticule lines intersections

Please refer to the enclosed README.md file or to the official website for the description of tasks and file formats.

Evaluation metrics and tools

Evaluation metrics are described in the competition report and tools are available at https://github.com/icdar21-mapseg/icdar21-mapseg-eval and should also be archived using Zenodo.

Search
Clear search
Close search
Google apps
Main menu