35 datasets found
  1. h

    ocr-pdf-degraded

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    racine.ai, ocr-pdf-degraded [Dataset]. https://huggingface.co/datasets/racineai/ocr-pdf-degraded
    Explore at:
    Dataset provided by
    racine.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OCR-PDF-Degraded Dataset

      Overview
    

    This dataset contains synthetically degraded document images paired with their ground truth OCR text. It addresses a critical gap in OCR model training by providing realistic document degradations that simulate real-world conditions encountered in production environments.

      Purpose
    

    Most OCR models are trained on relatively clean, perfectly scanned documents. However, in real-world applications, especially in the military/defense… See the full description on the dataset page: https://huggingface.co/datasets/racineai/ocr-pdf-degraded.

  2. Noisy OCR Dataset (NOD)

    • zenodo.org
    bin
    Updated Jul 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Hegghammer; Thomas Hegghammer (2021). Noisy OCR Dataset (NOD) [Dataset]. http://doi.org/10.5281/zenodo.5068735
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas Hegghammer; Thomas Hegghammer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).

    Source images

    The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.

    Artificial noise application

    The dataset was created as follows:
    - First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
    - Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
    - Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.

    This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.

    The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.

    References:

    Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https:
    //github.com/PedroBarcha/old-books-dataset.

    Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk
    Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science
    and Information Technology (CSIT)
    , 150–54. IEEE.

    Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs

  3. h

    pdfa-eng-wds

    • huggingface.co
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pixel Parsing (2024). pdfa-eng-wds [Dataset]. https://huggingface.co/datasets/pixparse/pdfa-eng-wds
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2024
    Dataset authored and provided by
    Pixel Parsing
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for PDF Association dataset (PDFA)

      Dataset Summary
    

    PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf documents analysis. The purpose of that subset differs in that regard, as focus has been done on making the dataset machine learning-ready for vision-language models.

    An example page of one pdf document, with added bounding boxes… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/pdfa-eng-wds.
    
  4. h

    OMR-scanned-documents

    • huggingface.co
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sananse (2023). OMR-scanned-documents [Dataset]. https://huggingface.co/datasets/saurabh1896/OMR-scanned-documents
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 2, 2023
    Authors
    sananse
    Description

    A medical forms dataset containing scanned documents is a valuable resource for healthcare professionals, researchers, and institutions seeking to streamline and improve their administrative and patient care processes. This dataset comprises digitized versions of various medical forms, such as patient intake forms, consent forms, health assessment questionnaires, and more, which have been scanned for electronic storage and easy access. These scanned medical forms preserve the layout and… See the full description on the dataset page: https://huggingface.co/datasets/saurabh1896/OMR-scanned-documents.

  5. Invoices for Document AI

    • kaggle.com
    Updated Aug 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holt Skinner (2022). Invoices for Document AI [Dataset]. https://www.kaggle.com/datasets/holtskinner/invoices-document-ai
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Holt Skinner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Invoices in TIFF Format processed through Document AI Invoice Parser in Document.json format.

    Source of TIFF Files: https://www.kaggle.com/datasets/manishthem/text-extraction-for-ocr

    Document.json Structure

    {
      "mimeType": string,
      "text": string,
      "pages": [
        {
          "pageNumber": integer,
          "image": {
            "content": string,
            "mimeType": string,
            "width": integer,
            "height": integer
          },
          "dimension": {
            "width": number,
            "height": number,
            "unit": string
          },
          "layout": {
            "textAnchor": {
              "textSegments": [
                {
                  "startIndex": string,
                  "endIndex": string
                }
              ],
            },
            "boundingPoly": {
              "vertices": [
                {
                  "x": integer,
                  "y": integer
                }
              ],
              "normalizedVertices": [
                {
                  "x": number,
                  "y": number
                }
              ]
            },
            "orientation": enum
          },
          "detectedLanguages": [
            {
              "languageCode": string,
              "confidence": number
            }
          ],
          "blocks": [
            {
              "layout": {}
            }
          ],
          "paragraphs": [
            {
              "layout": {}
            }
          ],
          "lines": [
            {
              "layout": {}
            }
          ],
          "tokens": [
            {
              "layout": {}
            }
          ]
        }
      ],
      "entities": [
        {
          "textAnchor": {},
          "type": string,
          "mentionText": string,
          "mentionId": string,
          "confidence": number,
          "pageAnchor": {
            "pageRefs": [
              {
                "page": string,
                "layoutType": enum,
                "layoutId": string,
                "boundingPoly": {},
                "confidence": number
              }
            ]
          },
          "id": string,
          "normalizedValue": {
            "text": string,
            "moneyValue": {},
            "dateValue": {},
            "datetimeValue": {},
            "addressValue": {},
            "booleanValue": boolean,
            "integerValue": integer,
            "floatValue": number
          },
          "properties": [
            {}
          ]
        }
      ]
    }
    
  6. e

    Post-OCR correction training dataset sPeriodika-postOCR - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Aug 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Post-OCR correction training dataset sPeriodika-postOCR - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b9062910-c8b6-55ca-9c2e-f0957867a1bb
    Explore at:
    Dataset updated
    Aug 13, 2025
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The post-OCR correction dataset consists of paragraphs of text, at least 100 characters in length, extracted from documents randomly sampled from the sPeriodika dataset (http://hdl.handle.net/11356/1881) of Slovenian historical periodicals. From each document five paragraphs were randomly sampled. If the paragraph was longer than 500 characters, it was trimmed to that length. The correction was performed by one human annotator having access to the scan of the original document. Out of the original collection of 450 paragraphs, 41 were discarded due to non-running text or very bad quality of the OCR. The metadata in the CSV dataset are the following: - URN of the document - link to the original PDF in dLib - name of the periodical - publisher of the periodical - publication date - original text - corrected text - line offset (zero-indexed) - character length of the paragraph (trimmed to max. 500 characters)

  7. h

    idl-wds

    • huggingface.co
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pixel Parsing (2024). idl-wds [Dataset]. https://huggingface.co/datasets/pixparse/idl-wds
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2024
    Dataset authored and provided by
    Pixel Parsing
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for Industry Documents Library (IDL)

      Dataset Summary
    

    Industry Documents Library (IDL) is a document dataset filtered from UCSF documents library with 19 million pages kept as valid samples. Each document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the idl_data project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/idl-wds.

  8. A dataset for temporal analysis of files related to the JFK case

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Luczak-Roesch; Markus Luczak-Roesch (2020). A dataset for temporal analysis of files related to the JFK case [Dataset]. http://doi.org/10.5281/zenodo.1042154
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Markus Luczak-Roesch; Markus Luczak-Roesch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.

    The code to derive the dataset is given as follows:

    ### BEGIN R DATA PROCESSING SCRIPT

    library(tesseract)
    library(pdftools)

    pdfs <- list.files("[path to your output directory containing all PDF files]")

    meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)

    meta$Doc.Date <- as.character(meta$Doc.Date)

    meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),]
    for(i in 1:nrow(meta.clean)){
    meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])

    if(nchar(meta.clean$Doc.Date[i])<10){
    meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y")
    }

    }

    meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")

    meta.clean <- meta.clean[order(meta.clean$Doc.Date),]

    docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F)
    for(i in 1:nrow(meta.clean)){
    #for(i in 1:3){
    pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])))
    tmp_files <- c()
    for(k in 1:pdf_prop$pages){
    tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k))
    }

    img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)

    txt <- ""

    for(j in 1:length(img_file)){
    extract <- ocr(img_file[j], engine = tesseract("eng"))
    #unlink(img_file)
    txt <- paste(txt,extract,collapse = " ")
    }

    docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\\s+"," ",gsub("[[:punct:]]|[ ]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F)
    }


    write.table(docs,"[path to your output directory]/documents.csv", row.names = F)

    ### END R DATA PROCESSING SCRIPT

  9. R

    Mdptesting Dataset

    • universe.roboflow.com
    zip
    Updated Mar 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    new-workspace-c0qwl (2022). Mdptesting Dataset [Dataset]. https://universe.roboflow.com/new-workspace-c0qwl/mdptesting/model/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 8, 2022
    Dataset authored and provided by
    new-workspace-c0qwl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Symbol Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Quality Control in Manufacturing: Use the model to automate the process of detecting and classifying symbols on products or parts in a manufacturing line. It ensures products are correctly labeled and meets quality standards.

    2. Packaging and Inventory Management: This model can be utilized in warehouses for efficient inventory management by identifying the symbols on package labels, thus automating the process of package sorting and tracking.

    3. Optical Character Recognition (OCR): The model can be used in an OCR system to identify and classify different types of symbols found in scanned documents, pdf files, or photos of documents, aiding in data extraction and digitization efforts.

    4. Augmented Reality (AR) Apps: The model could be employed in AR applications to identify real-world symbols in user's environment, allowing the app to interact with them or provide additional context-based information.

    5. Automation in Retail: The model can be used in self-checkout systems in stores to identify and classify product symbols, streamlining the checkout process by automatically identifying purchased items based on their symbols.

  10. h

    OpenDoc-Pdf-Preview

    • huggingface.co
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prithiv Sakthi (2025). OpenDoc-Pdf-Preview [Dataset]. https://huggingface.co/datasets/prithivMLmods/OpenDoc-Pdf-Preview
    Explore at:
    Dataset updated
    Jun 25, 2025
    Authors
    Prithiv Sakthi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OpenDoc-Pdf-Preview

    OpenDoc-Pdf-Preview is a compact visual preview dataset containing 6,000 high-resolution document images extracted from PDFs. This dataset is designed for Image-to-Text tasks such as document OCR pretraining, layout understanding, and multimodal document analysis.

      Dataset Summary
    

    Modality: Image-to-Text Content Type: PDF-based document previews Number of Samples: 6,000 Language: English Format: Parquet Split: train only Size: 606 MB License: Apache… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/OpenDoc-Pdf-Preview.

  11. h

    MultiFinBen-SpanishOCR

    • huggingface.co
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Fin AI (2025). MultiFinBen-SpanishOCR [Dataset]. https://huggingface.co/datasets/TheFinAI/MultiFinBen-SpanishOCR
    Explore at:
    Dataset updated
    May 16, 2025
    Dataset authored and provided by
    The Fin AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for SpanishOCR Dataset

      Dataset Summary
    

    The SpanishOCR dataset contains images derived from regulatory documents from Peru government in pdf format. This dataset is used for benchmarkingg and evaluating Large Language Models ability on converting unstructured dcuments, such as pdfs and images, into machine readable format, particularly in finance domain, where the conversion task is more complex and valuable.

      Supported Tasks
    

    Task: Image-to-Text… See the full description on the dataset page: https://huggingface.co/datasets/TheFinAI/MultiFinBen-SpanishOCR.

  12. o

    Kiswahili-Tz-Hansard

    • explore.openaire.eu
    Updated Jun 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Muhia (2022). Kiswahili-Tz-Hansard [Dataset]. http://doi.org/10.5281/zenodo.6643278
    Explore at:
    Dataset updated
    Jun 13, 2022
    Authors
    Brian Muhia
    Description

    This is a dataset of publically available Tanzania Hansard documents, in Kiswahili. It contains 2735 png images of pages from pdf documents, and text files containing transcripts obtained from the OCR tool tesseract-ocr. The images are obtained via scanning pdf files using imagemagick. Its intended use is in how improvements to language/word sequence modeling can improve OCR in a low-resource setting, and as a record of the accuracy of pre-existing OCR tools that use language models before any other methods are applied.

  13. g

    Text from pdfs found on data.gouv.fr

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Text from pdfs found on data.gouv.fr [Dataset]. https://gimi9.com/dataset/eu_5ec45f516a58eec727e79af7/
    Explore at:
    Area covered
    France
    Description

    Text extracted from pdfs found on data.gouv.fr ## Description This dataset contains text extracted from 6602 files that have the ‘pdf’ extension in the resource catalog of data.gouv.fr. The dataset contains only the pdfs of 20 Mb or less and which are always available on the URL indicated. The extraction was done with PDFBox via its Python wrapper python-PDFBox. PDFs that are images (scans, maps, etc.) are detected with a simple heuristic: if after converting to text with ‘PDFBox’, the file size is less than 20 bytes, it is considered to be an image. In this case, OCRisation is carried out. This one is made with Tesseract via its Python wrapper pyocr. The result is ‘txt’ files from ‘pdfs’ sorted by organisation (the organisation that published the resource). There are 175 organisations in this dataset, so 175 files. The name of each file corresponds to the string ‘{id-du-dataset}--{id-de-la-resource}.txt’. #### Input Catalogue of data.gouv.fr resources. #### Output Text files of each ‘pdf’ resource found in the catalogue that was successfully converted and satisfied the above constraints. The tree is as follows: Bash . ACTION_Nogent-sur-Marne 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt |... Aeroport_La_Rochelle-Ile_de_Re Agency_de_services_and_payment_ASP Agency_du_Numerique ... “'” ## Distribution of texts [as of 20 May 2020] The top 10 organisations with the largest number of documents is: Python [(‘Les_Lilas’, 1294), (‘Ville_de_Pirae’, 1099), (‘Region_Hauts-de-France’, 592), (‘Ressourcerie_datalocale’, 297), (‘NA’, 268), (‘CORBION’, 244), (‘Education_Nationale’, 189), (‘Incubator_of_Services_Numeriques’, 157), (‘Ministere_des_Solidarites_and_de_la_Sante’, 148), (‘Communaute_dAgglomeration_Plaine_Vallee’, 142)] “'” And their preview in 2D is (HashFeatures+TruncatedSVD+[t-SNE]): Plot t-SNE of DGF texts ## Code The Python scripts used to do this extraction are here. ## Remarks Due to the quality of the original pdfs (low resolution scans, non-aligned pdfs,...) and the performance of the pdf->txt transformation methods, the results can be very loud.

  14. Barnes Ice Cap South Dome Trilateration Net Survey Data 1970-1984, Version 1...

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Barnes Ice Cap South Dome Trilateration Net Survey Data 1970-1984, Version 1 - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/barnes-ice-cap-south-dome-trilateration-net-survey-data-1970-1984-version-1-7057f
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Area covered
    Barnes Ice Cap
    Description

    The Barnes Ice Cap data set contains survey measurements of a network of 43 stakes along a 10 km flow line on the northeast flank of the south dome of the Barnes Ice Cap. The measurements are of mass balance, surface velocity, and surface elevation. They were taken over a period of time from 1970 to 1984. The data set came from a hard copy computer printout containing raw data as well as processed quantities. This printout was scanned and digitized into a PDF file. This PDF file was put through Optical Character Recognition (OCR) software and saved as another PDF file. The resultant PDF file is human readable and all values are correct when viewed in an Adobe PDF reader. However, if you copy the contents and paste them into another application there may be errors in the values as the OCR process did not accurately compute all characters correctly. If you copy the data values into another application for analysis, double check the values against what is in the PDF file. The data are available via FTP.

  15. Nordic Exceptionalism Datasprint Datasets

    • zenodo.org
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lars Kjær; Lars Kjær (2023). Nordic Exceptionalism Datasprint Datasets [Dataset]. http://doi.org/10.5281/zenodo.7913055
    Explore at:
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lars Kjær; Lars Kjær
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Nordic countries
    Description

    The Nordic Exceptionalism Datasprint Datasets consists of two files with text data that is prepared for the Nordic Exceptionalism Datasprint held 11 May 2023 at University of Copenhagen, South Campus.

    The text data is partly ocr text from scanned pdf files that has been checked through so it is close to perfect; partly text data from Gutenberg.org.

    The first file, "nordic corpus files incl abbyy finerader files.zip", holds a biography file as well as multiple folders each containing an Abbyy Finereader file, a txt file, a word file, and a searchable pdf file if the folder holds a text that has been ocr treated with Abbyy FineReader.

    The second file, "nordic corpus txt files prepared for voyant tools.zip" holds a biography file and multiple txt files prepared to be used in Voyant Tools. These files are named beginning with year of publication.

  16. m

    Data from: Making the Case for Process Analytics: A Use Case in Court...

    • data.mendeley.com
    Updated Aug 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milda Aleknonyte-Resch (2025). Making the Case for Process Analytics: A Use Case in Court Proceedings [Dataset]. http://doi.org/10.17632/3mcvbrhr7c.2
    Explore at:
    Dataset updated
    Aug 15, 2025
    Authors
    Milda Aleknonyte-Resch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data was extracted in PDF format with personal information redacted to ensure privacy. The raw dataset consisted of 260 cases from three chambers within a single German social law court. The data originates from a single judge, who typically oversees five to six chambers, meaning that this dataset represents only a subset of the judge’s total caseload. Optical Character Recognition (OCR) was used to extract the document text, which was organized into an event log according to the tabular structure of the documents. In the dataset, a single timestamp is recorded for each activity, commonly indicating only the date of occurrence rather than a precise timestamp. This limits the granularity of time-based analyses and the accuracy of calculated activity durations. As the analysis focuses on the overall durations of cases, which typically range from multiple months to years, the impact of the timestamp imprecisions was negligible in our use case. After extraction, the event log was further processed in consultation with domain experts to ensure anonymity, remove noise, and raise it to an abstraction level appropriate for analysis. All remaining personal identifiers, such as expert witness names, were removed from the log to ensure anonymity. Additionally, timestamps were systematically perturbed to further enhance data privacy. Originally, the event log contained 22,664 recorded events and 290 unique activities. Activities that were extremely rare (i.e., occurring fewer than 30 times) were excluded to focus on frequently observed procedural steps. Furthermore, the domain experts reviewed the list of unique activity labels, based on which similar activities were merged, and terminology was standardized across cases. The refinement of the activity labels reduced the number of unique activities to 59. Finally, duplicate events were removed. These steps collectively reduced the dataset to 19,947 events. The final anonymized and processed dataset includes 260 cases, 19,947 events from three chambers and 59 unique activities.

  17. e

    Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset -...

    • b2find.eudat.eu
    Updated Aug 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/0406e495-3d78-5cb2-9e5b-32b9dbba1e82
    Explore at:
    Dataset updated
    Aug 3, 2025
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The corpus of Slovenian periodicals sPeriodika contains linguistically annotated periodicals published during the 18th, 19th, and beginning of 20th century (1771-1914). The periodical issues were retreived from Slovenia's national library's digital library service (https://dlib.si) in the form of OCR-ed PDF and TXT files. Before linguistically annotating the documents (lemmatisation, part-of-speech tagging, and named entity recognition) with CLASSLA-Stanza (https://github.com/clarinsi/classla), the OCR-ed texts were corrected with a lightweight and robust approach using cSMTiser (https://github.com/clarinsi/csmtiser), a text normalisation tool based on character-level machine translation. This OCR post-correction model was trained on a set of manually corrected samples (300 random paragraphs at least 100 characters in length) from the original texts, cf. http://hdl.handle.net/11356/1907. The documents in the collection are enriched with the following metadata obtained from dLib: - Document ID (URN) - Periodical name - Document (periodical issue) title - Volume number (if available) - Issue number (if available) - Year of publication - Date of publication (of varying granularity, based on original metadata available) - Source (URL of the original digitised document available at dlib.si) - Image (see below) - Quality (see below)

  18. d

    Barnes Ice Cap South Dome Trilateration Net Survey Data 1970-1984, Version 1...

    • catalog.data.gov
    • datasets.ai
    • +5more
    Updated Jul 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NSIDC (2025). Barnes Ice Cap South Dome Trilateration Net Survey Data 1970-1984, Version 1 [Dataset]. https://catalog.data.gov/dataset/barnes-ice-cap-south-dome-trilateration-net-survey-data-1970-1984-version-1-3a72c
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    NSIDC
    Area covered
    Barnes Ice Cap
    Description

    The Barnes Ice Cap data set contains survey measurements of a network of 43 stakes along a 10 km flow line on the northeast flank of the south dome of the Barnes Ice Cap. The measurements are of mass balance, surface velocity, and surface elevation. They were taken over a period of time from 1970 to 1984. The data set came from a hard copy computer printout containing raw data as well as processed quantities. This printout was scanned and digitized into a PDF file. This PDF file was put through Optical Character Recognition (OCR) software and saved as another PDF file. The resultant PDF file is human readable and all values are correct when viewed in an Adobe PDF reader. However, if you copy the contents and paste them into another application there may be errors in the values as the OCR process did not accurately compute all characters correctly. If you copy the data values into another application for analysis, double check the values against what is in the PDF file. The data are available via FTP.

  19. Labelled data for fine tuning a geological Named Entity Recognition and...

    • ckan.publishing.service.gov.uk
    • metadata.bgs.ac.uk
    • +1more
    Updated Aug 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.publishing.service.gov.uk (2025). Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model [Dataset]. https://ckan.publishing.service.gov.uk/dataset/labelled-data-for-fine-tuning-a-geological-named-entity-recognition-and-entity-relation-extract
    Explore at:
    Dataset updated
    Aug 19, 2025
    Dataset provided by
    CKANhttps://ckan.org/
    Description

    This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604

  20. olmOCR-bench

    • huggingface.co
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). olmOCR-bench [Dataset]. https://huggingface.co/datasets/allenai/olmOCR-bench
    Explore at:
    Dataset updated
    Jul 23, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    olmOCR-bench

    olmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have. This benchmark evaluates the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information. Quick links:

    📃 Paper 🛠️ Code 🎮 Demo

      Table 1. Distribution of Test Classes by Document Source
    

    Document Source Text Present Text… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmOCR-bench.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
racine.ai, ocr-pdf-degraded [Dataset]. https://huggingface.co/datasets/racineai/ocr-pdf-degraded

ocr-pdf-degraded

racineai/ocr-pdf-degraded

Explore at:
Dataset provided by
racine.ai
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

OCR-PDF-Degraded Dataset

  Overview

This dataset contains synthetically degraded document images paired with their ground truth OCR text. It addresses a critical gap in OCR model training by providing realistic document degradations that simulate real-world conditions encountered in production environments.

  Purpose

Most OCR models are trained on relatively clean, perfectly scanned documents. However, in real-world applications, especially in the military/defense… See the full description on the dataset page: https://huggingface.co/datasets/racineai/ocr-pdf-degraded.

Search
Clear search
Close search
Google apps
Main menu