31 datasets found

Dataset of invoices and receipts including annotation of relevant fields
zenodo.org
zip
Updated Apr 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli (2022). Dataset of invoices and receipts including annotation of relevant fields [Dataset]. http://doi.org/10.5281/zenodo.6371710
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6371710
Dataset updated
Apr 3, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.
h
ocr-invoice-data
huggingface.co
Updated Oct 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philipp Schmid (2023). ocr-invoice-data [Dataset]. https://huggingface.co/datasets/philschmid/ocr-invoice-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 28, 2023
Authors
Philipp Schmid
Description
Dataset Card for "invoices-and-receipts_ocr_v1"

More Information needed
R
Invoice Ocr Dataset
universe.roboflow.com
zip
Updated Jul 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ifbind (2024). Invoice Ocr Dataset [Dataset]. https://universe.roboflow.com/ifbind-eno47/invoice-ocr-yee00/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Jul 8, 2024
Dataset authored and provided by
Ifbind
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Document Bounding Boxes
Description
Invoice Ocr

## Overview Invoice Ocr is a dataset for object detection tasks - it contains Document annotations for 499 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
d
Knuckle Head OCR Invoice Images Dataset - available for several industries...
datarade.ai
.csv, .xls
Updated Jan 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Knuckle Head (2025). Knuckle Head OCR Invoice Images Dataset - available for several industries in USA & India [Dataset]. https://datarade.ai/data-providers/knuckle-head/data-products/ocr-invoice-dataset-available-for-several-industry-knuckle-head
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
Jan 31, 2025
Dataset authored and provided by
Knuckle Head
Area covered
United States of America, India
Description
One Lakh OCR images dataset for several industries like : Hotel, Cab Rental, Bar etc. Every invoices are high quality images clicked by smartphones. We are covering USA and Indian business in those invoices.

There are three types of invoices (Well Light, Low Light and Shadow). Invoices are clicked in indoor and outdoor with different background.

Invoices for Document AI

kaggle.com

Updated Aug 11, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Holt Skinner (2022). Invoices for Document AI [Dataset]. https://www.kaggle.com/datasets/holtskinner/invoices-document-ai

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 11, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Holt Skinner

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Invoices in TIFF Format processed through Document AI Invoice Parser in Document.json format.

Source of TIFF Files: https://www.kaggle.com/datasets/manishthem/text-extraction-for-ocr

Document.json Structure

{
  "mimeType": string,
  "text": string,
  "pages": [
    {
      "pageNumber": integer,
      "image": {
        "content": string,
        "mimeType": string,
        "width": integer,
        "height": integer
      },
      "dimension": {
        "width": number,
        "height": number,
        "unit": string
      },
      "layout": {
        "textAnchor": {
          "textSegments": [
            {
              "startIndex": string,
              "endIndex": string
            }
          ],
        },
        "boundingPoly": {
          "vertices": [
            {
              "x": integer,
              "y": integer
            }
          ],
          "normalizedVertices": [
            {
              "x": number,
              "y": number
            }
          ]
        },
        "orientation": enum
      },
      "detectedLanguages": [
        {
          "languageCode": string,
          "confidence": number
        }
      ],
      "blocks": [
        {
          "layout": {}
        }
      ],
      "paragraphs": [
        {
          "layout": {}
        }
      ],
      "lines": [
        {
          "layout": {}
        }
      ],
      "tokens": [
        {
          "layout": {}
        }
      ]
    }
  ],
  "entities": [
    {
      "textAnchor": {},
      "type": string,
      "mentionText": string,
      "mentionId": string,
      "confidence": number,
      "pageAnchor": {
        "pageRefs": [
          {
            "page": string,
            "layoutType": enum,
            "layoutId": string,
            "boundingPoly": {},
            "confidence": number
          }
        ]
      },
      "id": string,
      "normalizedValue": {
        "text": string,
        "moneyValue": {},
        "dateValue": {},
        "datetimeValue": {},
        "addressValue": {},
        "booleanValue": boolean,
        "integerValue": integer,
        "floatValue": number
      },
      "properties": [
        {}
      ]
    }
  ]
}

R
Invoice Ocr Dataset
universe.roboflow.com
zip
Updated Aug 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rui ke (2021). Invoice Ocr Dataset [Dataset]. https://universe.roboflow.com/rui-ke/invoice-ocr
Explore at:
zipAvailable download formats
Dataset updated
Aug 23, 2021
Dataset authored and provided by
rui ke
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Cars Bounding Boxes
Description
Invoice Ocr

## Overview Invoice Ocr is a dataset for object detection tasks - it contains Cars annotations for 1,000 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
R
Invoice Dataset
universe.roboflow.com
zip
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lker Galip (2023). Invoice Dataset [Dataset]. https://universe.roboflow.com/lker-galip/invoice-oevrd/dataset/18
Explore at:
zipAvailable download formats
Dataset updated
Jun 17, 2023
Dataset authored and provided by
lker Galip
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Invoice Bounding Boxes
Description
Here are a few use cases for this project:

Digital Bookkeeping Systems: Developers can integrate the "invoice" model into bookkeeping software, where it scans, reads, and categorizes information from physical or digital invoices. It automates the accounting process by sorting invoices into specific classes like vendor_info, total_price, etc.

Expense Management Applications: Companies can use this model to simplify expense tracking, where employees just need to upload the invoice image, and the model will extract required details like vendor_info, total_price and more.

OCR (Optical Character Recognition) Systems: The "invoice" model can significantly enhance OCR systems, allowing for context-aware recognition of specific text elements within images of documents, such as an invoice's details, bank info, and customer info.

Automatic Auditing Systems: The model can be utilized by auditing firms to automate the auditing process. It would help to compare details of scanned invoices with saved financial records, spotting any disparities instantly.

Vendor Management Systems: Large companies dealing with multiple vendors can use the "invoice" model in their vendor management systems. The model would automatically extract and categorize information about the vendor and the services rendered from the invoices.
h
invoice-ocr-json
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gokul Raja R, invoice-ocr-json [Dataset]. https://huggingface.co/datasets/GokulRajaR/invoice-ocr-json
Explore at:
Authors
Gokul Raja R
Description
Invoice OCR Dataset

This dataset contains annotated invoice images and their corresponding OCR-extracted text in structured JSON format. The data was originally sourced from an open-source invoice dataset and processed using the GPT-4o mini model to extract relevant fields such as invoice number, date, total amount, vendor, and line items.

Dataset Details Dataset Description

This dataset is designed to support training and evaluation of document understanding… See the full description on the dataset page: https://huggingface.co/datasets/GokulRajaR/invoice-ocr-json.
Vietnamese Receipts MC_OCR 2021
kaggle.com
zip
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DoMixi1989 (2022). Vietnamese Receipts MC_OCR 2021 [Dataset]. https://www.kaggle.com/datasets/domixi1989/vietnamese-receipts-mc-ocr-2021
Explore at:
zip(2271709772 bytes)Available download formats
Dataset updated
Apr 8, 2022
Authors
DoMixi1989
Description
Dataset

This dataset was created by DoMixi1989

Contents
m
Invoice and Recipt Image Dataset
data.macgence.com
mp3
Updated Jun 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Invoice and Recipt Image Dataset [Dataset]. https://data.macgence.com/dataset/invoice-and-recipt-image-dataset
Explore at:
mp3Available download formats
Dataset updated
Jun 16, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Unlock the potential of our invoice and receipt image dataset. Perfect for AI training, OCR development, and advancing data extraction technologies.
OCR image data of Korean documents
kaggle.com
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Appen Limited (2025). OCR image data of Korean documents [Dataset]. https://www.kaggle.com/datasets/appenlimited/ocr-image-data-of-korean-documents
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 13, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Appen Limited
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
如需完整数据集或了解更多，请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com

The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING

Data Specification Usage Cases Image label recognition training Collecting device Mobile phone / Camera Collecting environment Multiple lights environments

Database Name Category Quantity

Korean Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024

Vietnamese Document OCR Images

RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080

Spanish Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000

French Document OCR Images

RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003

Thai Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037

Japanese Document OCR Images

RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147

Indonesian Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006

Tamil Document OCR Images

RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963

Burmese Document OCR Images

RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118

English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118

Information provided by database

Data Format：. JPG
h
invoices-google-ocr
huggingface.co
Updated Apr 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Mayes (2024). invoices-google-ocr [Dataset]. https://huggingface.co/datasets/amaye15/invoices-google-ocr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2024
Authors
Andrew Mayes
Description
amaye15/invoices-google-ocr dataset hosted on Hugging Face and contributed by the HF Datasets community
OCR Document Text Recognition Dataset
kaggle.com
Updated Sep 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

images - contains of original images of documents

boxes - includes bounding box labeling for the original images

annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

"Text Title" - corresponds to titles, the box is red

"Text Paragraph" - corresponds to paragraphs of text, the box is blue

"Table" - corresponds to the table, the box is green

"Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text
h
CORU
huggingface.co
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelrahman Abdallah (2025). CORU [Dataset]. https://huggingface.co/datasets/abdoelsayed/CORU
Explore at:
Dataset updated
Jun 19, 2025
Authors
Abdelrahman Abdallah
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
ReceiptSense: Beyond Traditional OCR - A Dataset for Receipt Understanding

🔥 News

[2024] ReceiptSense dataset is now publicly available! [2024] Paper accepted and published

📖 Abstract

Multilingual OCR and information extraction from receipts remains challenging, particularly for complex scripts like Arabic. We introduce ReceiptSense, a comprehensive dataset designed for Arabic-English receipt understanding comprising:

20,000 annotated receipts… See the full description on the dataset page: https://huggingface.co/datasets/abdoelsayed/CORU.
R
Ocr Trained Dataset
universe.roboflow.com
zip
Updated Mar 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xerosum (2023). Ocr Trained Dataset [Dataset]. https://universe.roboflow.com/xerosum/ocr-trained/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Mar 19, 2023
Dataset authored and provided by
Xerosum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Invoices Info Bounding Boxes
Description
Here are a few use cases for this project:

Automating Accounting Processes: The OCR-trained model can be utilized in automating the input of invoice details into accounting software. This allows for quicker, more accurate data entry, reducing human error.

Invoice Management Systems: OCR-trained can serve as a part of an advanced invoice management system that manages and organizes invoices from multiple vendors. This can simplify invoice tracking and payment processes.

Compliance and Audit: This model can be used to verify the accuracy of invoice information for compliance and audit purposes. It can identify key details like date, total amount, vendor name, etc., which can then be compared against recorded transactions.

Paperless Office Transition: Businesses seeking to transition to a paperless environment can utilize this model to digitize their existing paper invoices. This helps in efficient document management and promotes environmental sustainability.

Data Extraction for Analytics: The model can also be employed to extract data from invoices for data analysis. This could help in building predictive models, analyzing spending patterns, and optimizing vendor selection.
Text extraction for OCR
kaggle.com
Updated Mar 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
manishthemanu (2021). Text extraction for OCR [Dataset]. https://www.kaggle.com/datasets/manishthem/text-extraction-for-ocr/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 20, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
manishthemanu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Typical NER will identify various entities in the text but not every name come with proper context. The data set provides structured data in the XML format and requires its users to extract various entities.

Content

The data set consist of XML files and images. The XML files contain the extracted data from the image of the invoices, name of text and XML file is kept the same for clarity. Users of the dataset should extract entities like invoice no, invoice data, company name (invoice from company1 to company2/person), telephone number of the company, address e.t.c

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Challenges: Invoices data contains tabular data, which is challenging to deal with. Design a methodology to extract information from tabular data. Due to obvious reasons, certain numbers in XML are erroneous for eg, '0' replaced by 'O'.
A labeled dataset of hand-captured images of restaurant receipts
zenodo.org
zip
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manoela Auad; Sarah Alves; Gabriel Kakizaki; Julio Reis; Michel Silva; Manoela Auad; Sarah Alves; Gabriel Kakizaki; Julio Reis; Michel Silva (2024). A labeled dataset of hand-captured images of restaurant receipts [Dataset]. http://doi.org/10.5281/zenodo.13633335
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13633335
Dataset updated
Sep 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Manoela Auad; Sarah Alves; Gabriel Kakizaki; Julio Reis; Michel Silva; Manoela Auad; Sarah Alves; Gabriel Kakizaki; Julio Reis; Michel Silva
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Photographing fiscal receipts has become increasingly common with the rise of online storage and accounting services. However, capturing images in uncontrolled environments often leads to distortions that can compromise Optical Character Recognition (OCR) techniques, rendering the output text unreadable. To address this problem, we propose an open-source expert filtering approach based on low-level features to identify and discard low-quality invoice images, select high-quality images, and flag images that require preparation prior to being processed for OCR. The dataset used in this work is an extension of the Express Expense SRD dataset, which consists of 200 hand-photographed images of restaurant receipts. The free version of the original dataset has no OCR task labels. Since this information is needed to calculate the accuracy of the OCR and to analyze the effects of the proposed approach, we created a new version of the existing dataset with manual annotations for the receipts and also for the four corners of the documents.

More information can be found at the following link: https://github.com/MaVILab-UFV/Filtering-Preparation-for-OCR_SIBGRAPI-2024

If you use this data, please cite our paper as follows

Auad, Manoela; Alves, Sarah; Kakizaki, Gabriel; Reis, Julio C. S.; Silva, Michel. A Filtering and Image Preparation Approach to Enhance OCR for Fiscal Receipts. In 37th Conference on Graphics, Patterns and Images (SIBGRAPI), 2024.
497 Images – English Invoice Data
nexdata.ai
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). 497 Images – English Invoice Data [Dataset]. https://www.nexdata.ai/datasets/ocr/1392
Explore at:
Dataset updated
Feb 2, 2024
Dataset authored and provided by
Nexdata
Variables measured
Device, Date size, Data format, Data diversity, Annotation format, Collecting environment
Description
497 Images – English Invoice Data，the collection background is a solid color background, and personal information is desensitized, including various types of invoices, which can be used for tasks such as bill recognition and text recognition.
h
invoices-donut-data-v1-with-ocr
huggingface.co
Updated Mar 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Pansa (2019). invoices-donut-data-v1-with-ocr [Dataset]. https://huggingface.co/datasets/MJPansa/invoices-donut-data-v1-with-ocr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 23, 2019
Authors
Marco Pansa
Description
bbox column is [x, y, width, height] ymean is y position of the mean of the box line is the line number calculated using ymean
R
Einvoice Ocr Dataset
universe.roboflow.com
zip
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RPAGUI (2024). Einvoice Ocr Dataset [Dataset]. https://universe.roboflow.com/rpagui/einvoice-ocr
Explore at:
zipAvailable download formats
Dataset updated
Jan 29, 2024
Dataset authored and provided by
RPAGUI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Invoice
Description
Einvoice OCR

## Overview Einvoice OCR is a dataset for classification tasks - it contains Invoice annotations for 500 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).

Facebook

Twitter

Click to copy link

Link copied

Cite

Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli (2022). Dataset of invoices and receipts including annotation of relevant fields [Dataset]. http://doi.org/10.5281/zenodo.6371710

Dataset of invoices and receipts including annotation of relevant fields

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6371710

Dataset updated

Apr 3, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.

Clear search

Close search

Google apps

Main menu

Dataset of invoices and receipts including annotation of relevant fields

ocr-invoice-data

Invoice Ocr Dataset

Invoice Ocr

Knuckle Head OCR Invoice Images Dataset - available for several industries...

Invoices for Document AI

Invoice Ocr Dataset

Invoice Ocr

Invoice Dataset

invoice-ocr-json

Vietnamese Receipts MC_OCR 2021

Dataset

Contents

Invoice and Recipt Image Dataset

OCR image data of Korean documents

Korean Document OCR Images

Vietnamese Document OCR Images

Spanish Document OCR Images

French Document OCR Images

Thai Document OCR Images

Japanese Document OCR Images

Indonesian Document OCR Images

Tamil Document OCR Images

Burmese Document OCR Images

invoices-google-ocr

OCR Document Text Recognition Dataset

OCR Text Detection in the Documents Object Detection dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

Dataset structure

Data Format

Labels for the text:

Example of XML file structure

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

CORU

Ocr Trained Dataset

Text extraction for OCR

Context

Content

Acknowledgements

Inspiration

A labeled dataset of hand-captured images of restaurant receipts

497 Images – English Invoice Data

invoices-donut-data-v1-with-ocr

Einvoice Ocr Dataset

Einvoice OCR

Dataset of invoices and receipts including annotation of relevant fields