Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.
Dataset Card for "invoices-and-receipts_ocr_v1"
More Information needed
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Invoice Ocr is a dataset for object detection tasks - it contains Document annotations for 499 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
One Lakh OCR images dataset for several industries like : Hotel, Cab Rental, Bar etc. Every invoices are high quality images clicked by smartphones. We are covering USA and Indian business in those invoices.
There are three types of invoices (Well Light, Low Light and Shadow). Invoices are clicked in indoor and outdoor with different background.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Invoices in TIFF Format processed through Document AI Invoice Parser in Document.json format.
Source of TIFF Files: https://www.kaggle.com/datasets/manishthem/text-extraction-for-ocr
Document.json Structure
{
"mimeType": string,
"text": string,
"pages": [
{
"pageNumber": integer,
"image": {
"content": string,
"mimeType": string,
"width": integer,
"height": integer
},
"dimension": {
"width": number,
"height": number,
"unit": string
},
"layout": {
"textAnchor": {
"textSegments": [
{
"startIndex": string,
"endIndex": string
}
],
},
"boundingPoly": {
"vertices": [
{
"x": integer,
"y": integer
}
],
"normalizedVertices": [
{
"x": number,
"y": number
}
]
},
"orientation": enum
},
"detectedLanguages": [
{
"languageCode": string,
"confidence": number
}
],
"blocks": [
{
"layout": {}
}
],
"paragraphs": [
{
"layout": {}
}
],
"lines": [
{
"layout": {}
}
],
"tokens": [
{
"layout": {}
}
]
}
],
"entities": [
{
"textAnchor": {},
"type": string,
"mentionText": string,
"mentionId": string,
"confidence": number,
"pageAnchor": {
"pageRefs": [
{
"page": string,
"layoutType": enum,
"layoutId": string,
"boundingPoly": {},
"confidence": number
}
]
},
"id": string,
"normalizedValue": {
"text": string,
"moneyValue": {},
"dateValue": {},
"datetimeValue": {},
"addressValue": {},
"booleanValue": boolean,
"integerValue": integer,
"floatValue": number
},
"properties": [
{}
]
}
]
}
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
Invoice Ocr is a dataset for object detection tasks - it contains Cars annotations for 1,000 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Digital Bookkeeping Systems: Developers can integrate the "invoice" model into bookkeeping software, where it scans, reads, and categorizes information from physical or digital invoices. It automates the accounting process by sorting invoices into specific classes like vendor_info, total_price, etc.
Expense Management Applications: Companies can use this model to simplify expense tracking, where employees just need to upload the invoice image, and the model will extract required details like vendor_info, total_price and more.
OCR (Optical Character Recognition) Systems: The "invoice" model can significantly enhance OCR systems, allowing for context-aware recognition of specific text elements within images of documents, such as an invoice's details, bank info, and customer info.
Automatic Auditing Systems: The model can be utilized by auditing firms to automate the auditing process. It would help to compare details of scanned invoices with saved financial records, spotting any disparities instantly.
Vendor Management Systems: Large companies dealing with multiple vendors can use the "invoice" model in their vendor management systems. The model would automatically extract and categorize information about the vendor and the services rendered from the invoices.
Invoice OCR Dataset
This dataset contains annotated invoice images and their corresponding OCR-extracted text in structured JSON format. The data was originally sourced from an open-source invoice dataset and processed using the GPT-4o mini model to extract relevant fields such as invoice number, date, total amount, vendor, and line items.
Dataset Details
Dataset Description
This dataset is designed to support training and evaluation of document understanding… See the full description on the dataset page: https://huggingface.co/datasets/GokulRajaR/invoice-ocr-json.
This dataset was created by DoMixi1989
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Unlock the potential of our invoice and receipt image dataset. Perfect for AI training, OCR development, and advancing data extraction technologies.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
如需完整数据集或了解更多,请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com
The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING
Database Name Category Quantity
RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024
RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080
RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000
RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003
RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037
RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147
RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006
RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963
RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118
English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118
amaye15/invoices-google-ocr dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.
The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.
The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">
Each image from images
folder is accompanied by an XML-annotation in the annotations.xml
file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">
keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ReceiptSense: Beyond Traditional OCR - A Dataset for Receipt Understanding
🔥 News
[2024] ReceiptSense dataset is now publicly available! [2024] Paper accepted and published
📖 Abstract
Multilingual OCR and information extraction from receipts remains challenging, particularly for complex scripts like Arabic. We introduce ReceiptSense, a comprehensive dataset designed for Arabic-English receipt understanding comprising:
20,000 annotated receipts… See the full description on the dataset page: https://huggingface.co/datasets/abdoelsayed/CORU.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Automating Accounting Processes: The OCR-trained model can be utilized in automating the input of invoice details into accounting software. This allows for quicker, more accurate data entry, reducing human error.
Invoice Management Systems: OCR-trained can serve as a part of an advanced invoice management system that manages and organizes invoices from multiple vendors. This can simplify invoice tracking and payment processes.
Compliance and Audit: This model can be used to verify the accuracy of invoice information for compliance and audit purposes. It can identify key details like date, total amount, vendor name, etc., which can then be compared against recorded transactions.
Paperless Office Transition: Businesses seeking to transition to a paperless environment can utilize this model to digitize their existing paper invoices. This helps in efficient document management and promotes environmental sustainability.
Data Extraction for Analytics: The model can also be employed to extract data from invoices for data analysis. This could help in building predictive models, analyzing spending patterns, and optimizing vendor selection.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Typical NER will identify various entities in the text but not every name come with proper context. The data set provides structured data in the XML format and requires its users to extract various entities.
The data set consist of XML files and images. The XML files contain the extracted data from the image of the invoices, name of text and XML file is kept the same for clarity. Users of the dataset should extract entities like invoice no, invoice data, company name (invoice from company1 to company2/person), telephone number of the company, address e.t.c
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Challenges: Invoices data contains tabular data, which is challenging to deal with. Design a methodology to extract information from tabular data. Due to obvious reasons, certain numbers in XML are erroneous for eg, '0' replaced by 'O'.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Photographing fiscal receipts has become increasingly common with the rise of online storage and accounting services. However, capturing images in uncontrolled environments often leads to distortions that can compromise Optical Character Recognition (OCR) techniques, rendering the output text unreadable. To address this problem, we propose an open-source expert filtering approach based on low-level features to identify and discard low-quality invoice images, select high-quality images, and flag images that require preparation prior to being processed for OCR. The dataset used in this work is an extension of the Express Expense SRD dataset, which consists of 200 hand-photographed images of restaurant receipts. The free version of the original dataset has no OCR task labels. Since this information is needed to calculate the accuracy of the OCR and to analyze the effects of the proposed approach, we created a new version of the existing dataset with manual annotations for the receipts and also for the four corners of the documents.
More information can be found at the following link: https://github.com/MaVILab-UFV/Filtering-Preparation-for-OCR_SIBGRAPI-2024
If you use this data, please cite our paper as follows
Auad, Manoela; Alves, Sarah; Kakizaki, Gabriel; Reis, Julio C. S.; Silva, Michel. A Filtering and Image Preparation Approach to Enhance OCR for Fiscal Receipts. In 37th Conference on Graphics, Patterns and Images (SIBGRAPI), 2024.
497 Images – English Invoice Data,the collection background is a solid color background, and personal information is desensitized, including various types of invoices, which can be used for tasks such as bill recognition and text recognition.
bbox column is [x, y, width, height] ymean is y position of the mean of the box line is the line number calculated using ymean
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Einvoice OCR is a dataset for classification tasks - it contains Invoice annotations for 500 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.