Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OCR-PDF-Degraded Dataset
Overview
This dataset contains synthetically degraded document images paired with their ground truth OCR text. It addresses a critical gap in OCR model training by providing realistic document degradations that simulate real-world conditions encountered in production environments.
Purpose
Most OCR models are trained on relatively clean, perfectly scanned documents. However, in real-world applications, especially in the military/defense… See the full description on the dataset page: https://huggingface.co/datasets/racineai/ocr-pdf-degraded.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).
Source images
The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.
Artificial noise application
The dataset was created as follows:
- First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise.
- Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
- Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions.
This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents.
The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files.
References:
Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https:
//github.com/PedroBarcha/old-books-dataset.
Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk
Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science
and Information Technology (CSIT), 150–54. IEEE.
Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for PDF Association dataset (PDFA)
Dataset Summary
PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf documents analysis. The purpose of that subset differs in that regard, as focus has been done on making the dataset machine learning-ready for vision-language models.
An example page of one pdf document, with added bounding boxes… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/pdfa-eng-wds.
A medical forms dataset containing scanned documents is a valuable resource for healthcare professionals, researchers, and institutions seeking to streamline and improve their administrative and patient care processes. This dataset comprises digitized versions of various medical forms, such as patient intake forms, consent forms, health assessment questionnaires, and more, which have been scanned for electronic storage and easy access. These scanned medical forms preserve the layout and… See the full description on the dataset page: https://huggingface.co/datasets/saurabh1896/OMR-scanned-documents.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Invoices in TIFF Format processed through Document AI Invoice Parser in Document.json format.
Source of TIFF Files: https://www.kaggle.com/datasets/manishthem/text-extraction-for-ocr
Document.json Structure
{
"mimeType": string,
"text": string,
"pages": [
{
"pageNumber": integer,
"image": {
"content": string,
"mimeType": string,
"width": integer,
"height": integer
},
"dimension": {
"width": number,
"height": number,
"unit": string
},
"layout": {
"textAnchor": {
"textSegments": [
{
"startIndex": string,
"endIndex": string
}
],
},
"boundingPoly": {
"vertices": [
{
"x": integer,
"y": integer
}
],
"normalizedVertices": [
{
"x": number,
"y": number
}
]
},
"orientation": enum
},
"detectedLanguages": [
{
"languageCode": string,
"confidence": number
}
],
"blocks": [
{
"layout": {}
}
],
"paragraphs": [
{
"layout": {}
}
],
"lines": [
{
"layout": {}
}
],
"tokens": [
{
"layout": {}
}
]
}
],
"entities": [
{
"textAnchor": {},
"type": string,
"mentionText": string,
"mentionId": string,
"confidence": number,
"pageAnchor": {
"pageRefs": [
{
"page": string,
"layoutType": enum,
"layoutId": string,
"boundingPoly": {},
"confidence": number
}
]
},
"id": string,
"normalizedValue": {
"text": string,
"moneyValue": {},
"dateValue": {},
"datetimeValue": {},
"addressValue": {},
"booleanValue": boolean,
"integerValue": integer,
"floatValue": number
},
"properties": [
{}
]
}
]
}
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The post-OCR correction dataset consists of paragraphs of text, at least 100 characters in length, extracted from documents randomly sampled from the sPeriodika dataset (http://hdl.handle.net/11356/1881) of Slovenian historical periodicals. From each document five paragraphs were randomly sampled. If the paragraph was longer than 500 characters, it was trimmed to that length. The correction was performed by one human annotator having access to the scan of the original document. Out of the original collection of 450 paragraphs, 41 were discarded due to non-running text or very bad quality of the OCR. The metadata in the CSV dataset are the following: - URN of the document - link to the original PDF in dLib - name of the periodical - publisher of the periodical - publication date - original text - corrected text - line offset (zero-indexed) - character length of the paragraph (trimmed to max. 500 characters)
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for Industry Documents Library (IDL)
Dataset Summary
Industry Documents Library (IDL) is a document dataset filtered from UCSF documents library with 19 million pages kept as valid samples. Each document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the idl_data project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/idl-wds.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.
The code to derive the dataset is given as follows:
### BEGIN R DATA PROCESSING SCRIPT
library(tesseract)
library(pdftools)
pdfs <- list.files("[path to your output directory containing all PDF files]")
meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)
meta$Doc.Date <- as.character(meta$Doc.Date)
meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),]
for(i in 1:nrow(meta.clean)){
meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])
if(nchar(meta.clean$Doc.Date[i])<10){
meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y")
}
}
meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")
meta.clean <- meta.clean[order(meta.clean$Doc.Date),]
docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F)
for(i in 1:nrow(meta.clean)){
#for(i in 1:3){
pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])))
tmp_files <- c()
for(k in 1:pdf_prop$pages){
tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k))
}
img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)
txt <- ""
for(j in 1:length(img_file)){
extract <- ocr(img_file[j], engine = tesseract("eng"))
#unlink(img_file)
txt <- paste(txt,extract,collapse = " ")
}
docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\\s+"," ",gsub("[[:punct:]]|[
]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F)
}
write.table(docs,"[path to your output directory]/documents.csv", row.names = F)
### END R DATA PROCESSING SCRIPT
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Quality Control in Manufacturing: Use the model to automate the process of detecting and classifying symbols on products or parts in a manufacturing line. It ensures products are correctly labeled and meets quality standards.
Packaging and Inventory Management: This model can be utilized in warehouses for efficient inventory management by identifying the symbols on package labels, thus automating the process of package sorting and tracking.
Optical Character Recognition (OCR): The model can be used in an OCR system to identify and classify different types of symbols found in scanned documents, pdf files, or photos of documents, aiding in data extraction and digitization efforts.
Augmented Reality (AR) Apps: The model could be employed in AR applications to identify real-world symbols in user's environment, allowing the app to interact with them or provide additional context-based information.
Automation in Retail: The model can be used in self-checkout systems in stores to identify and classify product symbols, streamlining the checkout process by automatically identifying purchased items based on their symbols.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenDoc-Pdf-Preview
OpenDoc-Pdf-Preview is a compact visual preview dataset containing 6,000 high-resolution document images extracted from PDFs. This dataset is designed for Image-to-Text tasks such as document OCR pretraining, layout understanding, and multimodal document analysis.
Dataset Summary
Modality: Image-to-Text Content Type: PDF-based document previews Number of Samples: 6,000 Language: English Format: Parquet Split: train only Size: 606 MB License: Apache… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/OpenDoc-Pdf-Preview.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for SpanishOCR Dataset
Dataset Summary
The SpanishOCR dataset contains images derived from regulatory documents from Peru government in pdf format. This dataset is used for benchmarkingg and evaluating Large Language Models ability on converting unstructured dcuments, such as pdfs and images, into machine readable format, particularly in finance domain, where the conversion task is more complex and valuable.
Supported Tasks
Task: Image-to-Text… See the full description on the dataset page: https://huggingface.co/datasets/TheFinAI/MultiFinBen-SpanishOCR.
This is a dataset of publically available Tanzania Hansard documents, in Kiswahili. It contains 2735 png images of pages from pdf documents, and text files containing transcripts obtained from the OCR tool tesseract-ocr. The images are obtained via scanning pdf files using imagemagick. Its intended use is in how improvements to language/word sequence modeling can improve OCR in a low-resource setting, and as a record of the accuracy of pre-existing OCR tools that use language models before any other methods are applied.
The Barnes Ice Cap data set contains survey measurements of a network of 43 stakes along a 10 km flow line on the northeast flank of the south dome of the Barnes Ice Cap. The measurements are of mass balance, surface velocity, and surface elevation. They were taken over a period of time from 1970 to 1984. The data set came from a hard copy computer printout containing raw data as well as processed quantities. This printout was scanned and digitized into a PDF file. This PDF file was put through Optical Character Recognition (OCR) software and saved as another PDF file. The resultant PDF file is human readable and all values are correct when viewed in an Adobe PDF reader. However, if you copy the contents and paste them into another application there may be errors in the values as the OCR process did not accurately compute all characters correctly. If you copy the data values into another application for analysis, double check the values against what is in the PDF file. The data are available via FTP.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Nordic Exceptionalism Datasprint Datasets consists of two files with text data that is prepared for the Nordic Exceptionalism Datasprint held 11 May 2023 at University of Copenhagen, South Campus.
The text data is partly ocr text from scanned pdf files that has been checked through so it is close to perfect; partly text data from Gutenberg.org.
The first file, "nordic corpus files incl abbyy finerader files.zip", holds a biography file as well as multiple folders each containing an Abbyy Finereader file, a txt file, a word file, and a searchable pdf file if the folder holds a text that has been ocr treated with Abbyy FineReader.
The second file, "nordic corpus txt files prepared for voyant tools.zip" holds a biography file and multiple txt files prepared to be used in Voyant Tools. These files are named beginning with year of publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data was extracted in PDF format with personal information redacted to ensure privacy. The raw dataset consisted of 260 cases from three chambers within a single German social law court. The data originates from a single judge, who typically oversees five to six chambers, meaning that this dataset represents only a subset of the judge’s total caseload. Optical Character Recognition (OCR) was used to extract the document text, which was organized into an event log according to the tabular structure of the documents. In the dataset, a single timestamp is recorded for each activity, commonly indicating only the date of occurrence rather than a precise timestamp. This limits the granularity of time-based analyses and the accuracy of calculated activity durations. As the analysis focuses on the overall durations of cases, which typically range from multiple months to years, the impact of the timestamp imprecisions was negligible in our use case. After extraction, the event log was further processed in consultation with domain experts to ensure anonymity, remove noise, and raise it to an abstraction level appropriate for analysis. All remaining personal identifiers, such as expert witness names, were removed from the log to ensure anonymity. Additionally, timestamps were systematically perturbed to further enhance data privacy. Originally, the event log contained 22,664 recorded events and 290 unique activities. Activities that were extremely rare (i.e., occurring fewer than 30 times) were excluded to focus on frequently observed procedural steps. Furthermore, the domain experts reviewed the list of unique activity labels, based on which similar activities were merged, and terminology was standardized across cases. The refinement of the activity labels reduced the number of unique activities to 59. Finally, duplicate events were removed. These steps collectively reduced the dataset to 19,947 events. The final anonymized and processed dataset includes 260 cases, 19,947 events from three chambers and 59 unique activities.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The corpus of Slovenian periodicals sPeriodika contains linguistically annotated periodicals published during the 18th, 19th, and beginning of 20th century (1771-1914). The periodical issues were retreived from Slovenia's national library's digital library service (https://dlib.si) in the form of OCR-ed PDF and TXT files. Before linguistically annotating the documents (lemmatisation, part-of-speech tagging, and named entity recognition) with CLASSLA-Stanza (https://github.com/clarinsi/classla), the OCR-ed texts were corrected with a lightweight and robust approach using cSMTiser (https://github.com/clarinsi/csmtiser), a text normalisation tool based on character-level machine translation. This OCR post-correction model was trained on a set of manually corrected samples (300 random paragraphs at least 100 characters in length) from the original texts, cf. http://hdl.handle.net/11356/1907. The documents in the collection are enriched with the following metadata obtained from dLib: - Document ID (URN) - Periodical name - Document (periodical issue) title - Volume number (if available) - Issue number (if available) - Year of publication - Date of publication (of varying granularity, based on original metadata available) - Source (URL of the original digitised document available at dlib.si) - Image (see below) - Quality (see below)
The Barnes Ice Cap data set contains survey measurements of a network of 43 stakes along a 10 km flow line on the northeast flank of the south dome of the Barnes Ice Cap. The measurements are of mass balance, surface velocity, and surface elevation. They were taken over a period of time from 1970 to 1984. The data set came from a hard copy computer printout containing raw data as well as processed quantities. This printout was scanned and digitized into a PDF file. This PDF file was put through Optical Character Recognition (OCR) software and saved as another PDF file. The resultant PDF file is human readable and all values are correct when viewed in an Adobe PDF reader. However, if you copy the contents and paste them into another application there may be errors in the values as the OCR process did not accurately compute all characters correctly. If you copy the data values into another application for analysis, double check the values against what is in the PDF file. The data are available via FTP.
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
olmOCR-bench
olmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have. This benchmark evaluates the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information. Quick links:
📃 Paper 🛠️ Code 🎮 Demo
Table 1. Distribution of Test Classes by Document Source
Document Source Text Present Text… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmOCR-bench.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OCR-PDF-Degraded Dataset
Overview
This dataset contains synthetically degraded document images paired with their ground truth OCR text. It addresses a critical gap in OCR model training by providing realistic document degradations that simulate real-world conditions encountered in production environments.
Purpose
Most OCR models are trained on relatively clean, perfectly scanned documents. However, in real-world applications, especially in the military/defense… See the full description on the dataset page: https://huggingface.co/datasets/racineai/ocr-pdf-degraded.