35 datasets found

h
ocr-pdf-degraded
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
racine.ai, ocr-pdf-degraded [Dataset]. https://huggingface.co/datasets/racineai/ocr-pdf-degraded
Explore at:
Dataset provided by
racine.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OCR-PDF-Degraded Dataset

Overview

This dataset contains synthetically degraded document images paired with their ground truth OCR text. It addresses a critical gap in OCR model training by providing realistic document degradations that simulate real-world conditions encountered in production environments.

Purpose

Most OCR models are trained on relatively clean, perfectly scanned documents. However, in real-world applications, especially in the military/defense… See the full description on the dataset page: https://huggingface.co/datasets/racineai/ocr-pdf-degraded.
Noisy OCR Dataset (NOD)
zenodo.org
bin
Updated Jul 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Hegghammer; Thomas Hegghammer (2021). Noisy OCR Dataset (NOD) [Dataset]. http://doi.org/10.5281/zenodo.5068735
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5068735
Dataset updated
Jul 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas Hegghammer; Thomas Hegghammer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).

Source images

The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.

Artificial noise application

The dataset was created as follows:
- First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
- Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
- Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.

This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.

The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.

References:

Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https:
//github.com/PedroBarcha/old-books-dataset.

Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk
Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science
and Information Technology (CSIT), 150–54. IEEE.

Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs
h
pdfa-eng-wds
huggingface.co
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixel Parsing (2024). pdfa-eng-wds [Dataset]. https://huggingface.co/datasets/pixparse/pdfa-eng-wds
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2024
Dataset authored and provided by
Pixel Parsing
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for PDF Association dataset (PDFA)

Dataset Summary

PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf documents analysis. The purpose of that subset differs in that regard, as focus has been done on making the dataset machine learning-ready for vision-language models.

An example page of one pdf document, with added bounding boxes… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/pdfa-eng-wds.
h
OMR-scanned-documents
huggingface.co
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sananse (2023). OMR-scanned-documents [Dataset]. https://huggingface.co/datasets/saurabh1896/OMR-scanned-documents
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 2, 2023
Authors
sananse
Description
A medical forms dataset containing scanned documents is a valuable resource for healthcare professionals, researchers, and institutions seeking to streamline and improve their administrative and patient care processes. This dataset comprises digitized versions of various medical forms, such as patient intake forms, consent forms, health assessment questionnaires, and more, which have been scanned for electronic storage and easy access. These scanned medical forms preserve the layout and… See the full description on the dataset page: https://huggingface.co/datasets/saurabh1896/OMR-scanned-documents.

Invoices for Document AI

kaggle.com

Updated Aug 11, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Holt Skinner (2022). Invoices for Document AI [Dataset]. https://www.kaggle.com/datasets/holtskinner/invoices-document-ai

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 11, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Holt Skinner

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Invoices in TIFF Format processed through Document AI Invoice Parser in Document.json format.

Source of TIFF Files: https://www.kaggle.com/datasets/manishthem/text-extraction-for-ocr

Document.json Structure

{
  "mimeType": string,
  "text": string,
  "pages": [
    {
      "pageNumber": integer,
      "image": {
        "content": string,
        "mimeType": string,
        "width": integer,
        "height": integer
      },
      "dimension": {
        "width": number,
        "height": number,
        "unit": string
      },
      "layout": {
        "textAnchor": {
          "textSegments": [
            {
              "startIndex": string,
              "endIndex": string
            }
          ],
        },
        "boundingPoly": {
          "vertices": [
            {
              "x": integer,
              "y": integer
            }
          ],
          "normalizedVertices": [
            {
              "x": number,
              "y": number
            }
          ]
        },
        "orientation": enum
      },
      "detectedLanguages": [
        {
          "languageCode": string,
          "confidence": number
        }
      ],
      "blocks": [
        {
          "layout": {}
        }
      ],
      "paragraphs": [
        {
          "layout": {}
        }
      ],
      "lines": [
        {
          "layout": {}
        }
      ],
      "tokens": [
        {
          "layout": {}
        }
      ]
    }
  ],
  "entities": [
    {
      "textAnchor": {},
      "type": string,
      "mentionText": string,
      "mentionId": string,
      "confidence": number,
      "pageAnchor": {
        "pageRefs": [
          {
            "page": string,
            "layoutType": enum,
            "layoutId": string,
            "boundingPoly": {},
            "confidence": number
          }
        ]
      },
      "id": string,
      "normalizedValue": {
        "text": string,
        "moneyValue": {},
        "dateValue": {},
        "datetimeValue": {},
        "addressValue": {},
        "booleanValue": boolean,
        "integerValue": integer,
        "floatValue": number
      },
      "properties": [
        {}
      ]
    }
  ]
}

e
Post-OCR correction training dataset sPeriodika-postOCR - Dataset - B2FIND
b2find.eudat.eu
Updated Aug 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Post-OCR correction training dataset sPeriodika-postOCR - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b9062910-c8b6-55ca-9c2e-f0957867a1bb
Explore at:
Dataset updated
Aug 13, 2025
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The post-OCR correction dataset consists of paragraphs of text, at least 100 characters in length, extracted from documents randomly sampled from the sPeriodika dataset (http://hdl.handle.net/11356/1881) of Slovenian historical periodicals. From each document five paragraphs were randomly sampled. If the paragraph was longer than 500 characters, it was trimmed to that length. The correction was performed by one human annotator having access to the scan of the original document. Out of the original collection of 450 paragraphs, 41 were discarded due to non-running text or very bad quality of the OCR. The metadata in the CSV dataset are the following: - URN of the document - link to the original PDF in dLib - name of the periodical - publisher of the periodical - publication date - original text - corrected text - line offset (zero-indexed) - character length of the paragraph (trimmed to max. 500 characters)
h
idl-wds
huggingface.co
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixel Parsing (2024). idl-wds [Dataset]. https://huggingface.co/datasets/pixparse/idl-wds
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2024
Dataset authored and provided by
Pixel Parsing
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for Industry Documents Library (IDL)

Dataset Summary

Industry Documents Library (IDL) is a document dataset filtered from UCSF documents library with 19 million pages kept as valid samples. Each document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the idl_data project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/idl-wds.
A dataset for temporal analysis of files related to the JFK case
zenodo.org
data.niaid.nih.gov
csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Luczak-Roesch; Markus Luczak-Roesch (2020). A dataset for temporal analysis of files related to the JFK case [Dataset]. http://doi.org/10.5281/zenodo.1042154
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1042154
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Markus Luczak-Roesch; Markus Luczak-Roesch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.

The code to derive the dataset is given as follows:

### BEGIN R DATA PROCESSING SCRIPT

library(tesseract)
library(pdftools)

pdfs <- list.files("[path to your output directory containing all PDF files]")

meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)

meta$Doc.Date <- as.character(meta$Doc.Date)

meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),]
for(i in 1:nrow(meta.clean)){
meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])

if(nchar(meta.clean$Doc.Date[i])<10){
meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y")
}

}

meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")

meta.clean <- meta.clean[order(meta.clean$Doc.Date),]

docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F)
for(i in 1:nrow(meta.clean)){
#for(i in 1:3){
pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])))
tmp_files <- c()
for(k in 1:pdf_prop$pages){
tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k))
}

img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)

txt <- ""

for(j in 1:length(img_file)){
extract <- ocr(img_file[j], engine = tesseract("eng"))
#unlink(img_file)
txt <- paste(txt,extract,collapse = " ")
}

docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\\s+"," ",gsub("[[:punct:]]|[ ]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F)
}

write.table(docs,"[path to your output directory]/documents.csv", row.names = F)

### END R DATA PROCESSING SCRIPT
R
Mdptesting Dataset
universe.roboflow.com
zip
Updated Mar 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
new-workspace-c0qwl (2022). Mdptesting Dataset [Dataset]. https://universe.roboflow.com/new-workspace-c0qwl/mdptesting/model/2
Explore at:
zipAvailable download formats
Dataset updated
Mar 8, 2022
Dataset authored and provided by
new-workspace-c0qwl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Symbol Bounding Boxes
Description
Here are a few use cases for this project:

Quality Control in Manufacturing: Use the model to automate the process of detecting and classifying symbols on products or parts in a manufacturing line. It ensures products are correctly labeled and meets quality standards.

Packaging and Inventory Management: This model can be utilized in warehouses for efficient inventory management by identifying the symbols on package labels, thus automating the process of package sorting and tracking.

Optical Character Recognition (OCR): The model can be used in an OCR system to identify and classify different types of symbols found in scanned documents, pdf files, or photos of documents, aiding in data extraction and digitization efforts.

Augmented Reality (AR) Apps: The model could be employed in AR applications to identify real-world symbols in user's environment, allowing the app to interact with them or provide additional context-based information.

Automation in Retail: The model can be used in self-checkout systems in stores to identify and classify product symbols, streamlining the checkout process by automatically identifying purchased items based on their symbols.
h
OpenDoc-Pdf-Preview
huggingface.co
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prithiv Sakthi (2025). OpenDoc-Pdf-Preview [Dataset]. https://huggingface.co/datasets/prithivMLmods/OpenDoc-Pdf-Preview
Explore at:
Dataset updated
Jun 25, 2025
Authors
Prithiv Sakthi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OpenDoc-Pdf-Preview

OpenDoc-Pdf-Preview is a compact visual preview dataset containing 6,000 high-resolution document images extracted from PDFs. This dataset is designed for Image-to-Text tasks such as document OCR pretraining, layout understanding, and multimodal document analysis.

Dataset Summary

Modality: Image-to-Text Content Type: PDF-based document previews Number of Samples: 6,000 Language: English Format: Parquet Split: train only Size: 606 MB License: Apache… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/OpenDoc-Pdf-Preview.
h
MultiFinBen-SpanishOCR
huggingface.co
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Fin AI (2025). MultiFinBen-SpanishOCR [Dataset]. https://huggingface.co/datasets/TheFinAI/MultiFinBen-SpanishOCR
Explore at:
Dataset updated
May 16, 2025
Dataset authored and provided by
The Fin AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for SpanishOCR Dataset

Dataset Summary

The SpanishOCR dataset contains images derived from regulatory documents from Peru government in pdf format. This dataset is used for benchmarkingg and evaluating Large Language Models ability on converting unstructured dcuments, such as pdfs and images, into machine readable format, particularly in finance domain, where the conversion task is more complex and valuable.

Supported Tasks

Task: Image-to-Text… See the full description on the dataset page: https://huggingface.co/datasets/TheFinAI/MultiFinBen-SpanishOCR.
o
Kiswahili-Tz-Hansard
explore.openaire.eu
Updated Jun 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Muhia (2022). Kiswahili-Tz-Hansard [Dataset]. http://doi.org/10.5281/zenodo.6643278
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6643278
Dataset updated
Jun 13, 2022
Authors
Brian Muhia
Description
This is a dataset of publically available Tanzania Hansard documents, in Kiswahili. It contains 2735 png images of pages from pdf documents, and text files containing transcripts obtained from the OCR tool tesseract-ocr. The images are obtained via scanning pdf files using imagemagick. Its intended use is in how improvements to language/word sequence modeling can improve OCR in a low-resource setting, and as a record of the accuracy of pre-existing OCR tools that use language models before any other methods are applied.
g
Text from pdfs found on data.gouv.fr
gimi9.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Text from pdfs found on data.gouv.fr [Dataset]. https://gimi9.com/dataset/eu_5ec45f516a58eec727e79af7/
Explore at:
Area covered
France
Description
Text extracted from pdfs found on data.gouv.fr ## Description This dataset contains text extracted from 6602 files that have the ‘pdf’ extension in the resource catalog of data.gouv.fr. The dataset contains only the pdfs of 20 Mb or less and which are always available on the URL indicated. The extraction was done with PDFBox via its Python wrapper python-PDFBox. PDFs that are images (scans, maps, etc.) are detected with a simple heuristic: if after converting to text with ‘PDFBox’, the file size is less than 20 bytes, it is considered to be an image. In this case, OCRisation is carried out. This one is made with Tesseract via its Python wrapper pyocr. The result is ‘txt’ files from ‘pdfs’ sorted by organisation (the organisation that published the resource). There are 175 organisations in this dataset, so 175 files. The name of each file corresponds to the string ‘{id-du-dataset}--{id-de-la-resource}.txt’. #### Input Catalogue of data.gouv.fr resources. #### Output Text files of each ‘pdf’ resource found in the catalogue that was successfully converted and satisfied the above constraints. The tree is as follows: Bash . ACTION_Nogent-sur-Marne 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt |... Aeroport_La_Rochelle-Ile_de_Re Agency_de_services_and_payment_ASP Agency_du_Numerique ... “'” ## Distribution of texts [as of 20 May 2020] The top 10 organisations with the largest number of documents is: Python [(‘Les_Lilas’, 1294), (‘Ville_de_Pirae’, 1099), (‘Region_Hauts-de-France’, 592), (‘Ressourcerie_datalocale’, 297), (‘NA’, 268), (‘CORBION’, 244), (‘Education_Nationale’, 189), (‘Incubator_of_Services_Numeriques’, 157), (‘Ministere_des_Solidarites_and_de_la_Sante’, 148), (‘Communaute_dAgglomeration_Plaine_Vallee’, 142)] “'” And their preview in 2D is (HashFeatures+TruncatedSVD+[t-SNE]): Plot t-SNE of DGF texts ## Code The Python scripts used to do this extraction are here. ## Remarks Due to the quality of the original pdfs (low resolution scans, non-aligned pdfs,...) and the performance of the pdf->txt transformation methods, the results can be very loud.
Barnes Ice Cap South Dome Trilateration Net Survey Data 1970-1984, Version 1...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Barnes Ice Cap South Dome Trilateration Net Survey Data 1970-1984, Version 1 - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/barnes-ice-cap-south-dome-trilateration-net-survey-data-1970-1984-version-1-7057f
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Area covered
Barnes Ice Cap
Description
The Barnes Ice Cap data set contains survey measurements of a network of 43 stakes along a 10 km flow line on the northeast flank of the south dome of the Barnes Ice Cap. The measurements are of mass balance, surface velocity, and surface elevation. They were taken over a period of time from 1970 to 1984. The data set came from a hard copy computer printout containing raw data as well as processed quantities. This printout was scanned and digitized into a PDF file. This PDF file was put through Optical Character Recognition (OCR) software and saved as another PDF file. The resultant PDF file is human readable and all values are correct when viewed in an Adobe PDF reader. However, if you copy the contents and paste them into another application there may be errors in the values as the OCR process did not accurately compute all characters correctly. If you copy the data values into another application for analysis, double check the values against what is in the PDF file. The data are available via FTP.
Nordic Exceptionalism Datasprint Datasets
zenodo.org
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lars Kjær; Lars Kjær (2023). Nordic Exceptionalism Datasprint Datasets [Dataset]. http://doi.org/10.5281/zenodo.7913055
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7913055
Dataset updated
Aug 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lars Kjær; Lars Kjær
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nordic countries
Description
The Nordic Exceptionalism Datasprint Datasets consists of two files with text data that is prepared for the Nordic Exceptionalism Datasprint held 11 May 2023 at University of Copenhagen, South Campus.

The text data is partly ocr text from scanned pdf files that has been checked through so it is close to perfect; partly text data from Gutenberg.org.

The first file, "nordic corpus files incl abbyy finerader files.zip", holds a biography file as well as multiple folders each containing an Abbyy Finereader file, a txt file, a word file, and a searchable pdf file if the folder holds a text that has been ocr treated with Abbyy FineReader.

The second file, "nordic corpus txt files prepared for voyant tools.zip" holds a biography file and multiple txt files prepared to be used in Voyant Tools. These files are named beginning with year of publication.
m
Data from: Making the Case for Process Analytics: A Use Case in Court...
data.mendeley.com
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Milda Aleknonyte-Resch (2025). Making the Case for Process Analytics: A Use Case in Court Proceedings [Dataset]. http://doi.org/10.17632/3mcvbrhr7c.2
Explore at:
Unique identifier
https://doi.org/10.17632/3mcvbrhr7c.2
Dataset updated
Aug 15, 2025
Authors
Milda Aleknonyte-Resch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data was extracted in PDF format with personal information redacted to ensure privacy. The raw dataset consisted of 260 cases from three chambers within a single German social law court. The data originates from a single judge, who typically oversees five to six chambers, meaning that this dataset represents only a subset of the judge’s total caseload. Optical Character Recognition (OCR) was used to extract the document text, which was organized into an event log according to the tabular structure of the documents. In the dataset, a single timestamp is recorded for each activity, commonly indicating only the date of occurrence rather than a precise timestamp. This limits the granularity of time-based analyses and the accuracy of calculated activity durations. As the analysis focuses on the overall durations of cases, which typically range from multiple months to years, the impact of the timestamp imprecisions was negligible in our use case. After extraction, the event log was further processed in consultation with domain experts to ensure anonymity, remove noise, and raise it to an abstraction level appropriate for analysis. All remaining personal identifiers, such as expert witness names, were removed from the log to ensure anonymity. Additionally, timestamps were systematically perturbed to further enhance data privacy. Originally, the event log contained 22,664 recorded events and 290 unique activities. Activities that were extremely rare (i.e., occurring fewer than 30 times) were excluded to focus on frequently observed procedural steps. Furthermore, the domain experts reviewed the list of unique activity labels, based on which similar activities were merged, and terminology was standardized across cases. The refinement of the activity labels reduced the number of unique activities to 59. Finally, duplicate events were removed. These steps collectively reduced the dataset to 19,947 events. The final anonymized and processed dataset includes 260 cases, 19,947 events from three chambers and 59 unique activities.
e
Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset -...
b2find.eudat.eu
Updated Aug 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/0406e495-3d78-5cb2-9e5b-32b9dbba1e82
Explore at:
Dataset updated
Aug 3, 2025
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The corpus of Slovenian periodicals sPeriodika contains linguistically annotated periodicals published during the 18th, 19th, and beginning of 20th century (1771-1914). The periodical issues were retreived from Slovenia's national library's digital library service (https://dlib.si) in the form of OCR-ed PDF and TXT files. Before linguistically annotating the documents (lemmatisation, part-of-speech tagging, and named entity recognition) with CLASSLA-Stanza (https://github.com/clarinsi/classla), the OCR-ed texts were corrected with a lightweight and robust approach using cSMTiser (https://github.com/clarinsi/csmtiser), a text normalisation tool based on character-level machine translation. This OCR post-correction model was trained on a set of manually corrected samples (300 random paragraphs at least 100 characters in length) from the original texts, cf. http://hdl.handle.net/11356/1907. The documents in the collection are enriched with the following metadata obtained from dLib: - Document ID (URN) - Periodical name - Document (periodical issue) title - Volume number (if available) - Issue number (if available) - Year of publication - Date of publication (of varying granularity, based on original metadata available) - Source (URL of the original digitised document available at dlib.si) - Image (see below) - Quality (see below)
d
Barnes Ice Cap South Dome Trilateration Net Survey Data 1970-1984, Version 1...
catalog.data.gov
datasets.ai
+5more
Updated Jul 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NSIDC (2025). Barnes Ice Cap South Dome Trilateration Net Survey Data 1970-1984, Version 1 [Dataset]. https://catalog.data.gov/dataset/barnes-ice-cap-south-dome-trilateration-net-survey-data-1970-1984-version-1-3a72c
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
NSIDC
Area covered
Barnes Ice Cap
Description
The Barnes Ice Cap data set contains survey measurements of a network of 43 stakes along a 10 km flow line on the northeast flank of the south dome of the Barnes Ice Cap. The measurements are of mass balance, surface velocity, and surface elevation. They were taken over a period of time from 1970 to 1984. The data set came from a hard copy computer printout containing raw data as well as processed quantities. This printout was scanned and digitized into a PDF file. This PDF file was put through Optical Character Recognition (OCR) software and saved as another PDF file. The resultant PDF file is human readable and all values are correct when viewed in an Adobe PDF reader. However, if you copy the contents and paste them into another application there may be errors in the values as the OCR process did not accurately compute all characters correctly. If you copy the data values into another application for analysis, double check the values against what is in the PDF file. The data are available via FTP.
Labelled data for fine tuning a geological Named Entity Recognition and...
ckan.publishing.service.gov.uk
metadata.bgs.ac.uk
+1more
Updated Aug 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2025). Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model [Dataset]. https://ckan.publishing.service.gov.uk/dataset/labelled-data-for-fine-tuning-a-geological-named-entity-recognition-and-entity-relation-extract
Explore at:
Dataset updated
Aug 19, 2025
Dataset provided by
CKANhttps://ckan.org/
Description
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604
olmOCR-bench
huggingface.co
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). olmOCR-bench [Dataset]. https://huggingface.co/datasets/allenai/olmOCR-bench
Explore at:
Dataset updated
Jul 23, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
olmOCR-bench

olmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have. This benchmark evaluates the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information. Quick links:

📃 Paper 🛠️ Code 🎮 Demo

Table 1. Distribution of Test Classes by Document Source

Document Source Text Present Text… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmOCR-bench.

Facebook

Twitter

Click to copy link

Link copied

Cite

racine.ai, ocr-pdf-degraded [Dataset]. https://huggingface.co/datasets/racineai/ocr-pdf-degraded

ocr-pdf-degraded

racineai/ocr-pdf-degraded

Explore at:

Dataset provided by

racine.ai

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

OCR-PDF-Degraded Dataset

  Overview

This dataset contains synthetically degraded document images paired with their ground truth OCR text. It addresses a critical gap in OCR model training by providing realistic document degradations that simulate real-world conditions encountered in production environments.

  Purpose

Most OCR models are trained on relatively clean, perfectly scanned documents. However, in real-world applications, especially in the military/defense… See the full description on the dataset page: https://huggingface.co/datasets/racineai/ocr-pdf-degraded.

Clear search

Close search

Google apps

Main menu

ocr-pdf-degraded

Noisy OCR Dataset (NOD)

pdfa-eng-wds

OMR-scanned-documents

Invoices for Document AI

Post-OCR correction training dataset sPeriodika-postOCR - Dataset - B2FIND

idl-wds

A dataset for temporal analysis of files related to the JFK case

Mdptesting Dataset

OpenDoc-Pdf-Preview

MultiFinBen-SpanishOCR

Kiswahili-Tz-Hansard

Text from pdfs found on data.gouv.fr

Barnes Ice Cap South Dome Trilateration Net Survey Data 1970-1984, Version 1...

Nordic Exceptionalism Datasprint Datasets

Data from: Making the Case for Process Analytics: A Use Case in Court...

Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 - Dataset -...

Barnes Ice Cap South Dome Trilateration Net Survey Data 1970-1984, Version 1...

Labelled data for fine tuning a geological Named Entity Recognition and...

olmOCR-bench

ocr-pdf-degraded

racineai/ocr-pdf-degraded