31 datasets found
  1. Dataset of invoices and receipts including annotation of relevant fields

    • zenodo.org
    zip
    Updated Apr 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli (2022). Dataset of invoices and receipts including annotation of relevant fields [Dataset]. http://doi.org/10.5281/zenodo.6371710
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 3, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.

  2. h

    ocr-invoice-data

    • huggingface.co
    Updated Oct 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philipp Schmid (2023). ocr-invoice-data [Dataset]. https://huggingface.co/datasets/philschmid/ocr-invoice-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 28, 2023
    Authors
    Philipp Schmid
    Description

    Dataset Card for "invoices-and-receipts_ocr_v1"

    More Information needed

  3. R

    Invoice Ocr Dataset

    • universe.roboflow.com
    zip
    Updated Jul 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ifbind (2024). Invoice Ocr Dataset [Dataset]. https://universe.roboflow.com/ifbind-eno47/invoice-ocr-yee00/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2024
    Dataset authored and provided by
    Ifbind
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Document Bounding Boxes
    Description

    Invoice Ocr

    ## Overview
    
    Invoice Ocr is a dataset for object detection tasks - it contains Document annotations for 499 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  4. d

    Knuckle Head OCR Invoice Images Dataset - available for several industries...

    • datarade.ai
    .csv, .xls
    Updated Jan 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Knuckle Head (2025). Knuckle Head OCR Invoice Images Dataset - available for several industries in USA & India [Dataset]. https://datarade.ai/data-providers/knuckle-head/data-products/ocr-invoice-dataset-available-for-several-industry-knuckle-head
    Explore at:
    .csv, .xlsAvailable download formats
    Dataset updated
    Jan 31, 2025
    Dataset authored and provided by
    Knuckle Head
    Area covered
    United States of America, India
    Description

    One Lakh OCR images dataset for several industries like : Hotel, Cab Rental, Bar etc. Every invoices are high quality images clicked by smartphones. We are covering USA and Indian business in those invoices.

    There are three types of invoices (Well Light, Low Light and Shadow). Invoices are clicked in indoor and outdoor with different background.

  5. Invoices for Document AI

    • kaggle.com
    Updated Aug 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holt Skinner (2022). Invoices for Document AI [Dataset]. https://www.kaggle.com/datasets/holtskinner/invoices-document-ai
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Holt Skinner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Invoices in TIFF Format processed through Document AI Invoice Parser in Document.json format.

    Source of TIFF Files: https://www.kaggle.com/datasets/manishthem/text-extraction-for-ocr

    Document.json Structure

    {
      "mimeType": string,
      "text": string,
      "pages": [
        {
          "pageNumber": integer,
          "image": {
            "content": string,
            "mimeType": string,
            "width": integer,
            "height": integer
          },
          "dimension": {
            "width": number,
            "height": number,
            "unit": string
          },
          "layout": {
            "textAnchor": {
              "textSegments": [
                {
                  "startIndex": string,
                  "endIndex": string
                }
              ],
            },
            "boundingPoly": {
              "vertices": [
                {
                  "x": integer,
                  "y": integer
                }
              ],
              "normalizedVertices": [
                {
                  "x": number,
                  "y": number
                }
              ]
            },
            "orientation": enum
          },
          "detectedLanguages": [
            {
              "languageCode": string,
              "confidence": number
            }
          ],
          "blocks": [
            {
              "layout": {}
            }
          ],
          "paragraphs": [
            {
              "layout": {}
            }
          ],
          "lines": [
            {
              "layout": {}
            }
          ],
          "tokens": [
            {
              "layout": {}
            }
          ]
        }
      ],
      "entities": [
        {
          "textAnchor": {},
          "type": string,
          "mentionText": string,
          "mentionId": string,
          "confidence": number,
          "pageAnchor": {
            "pageRefs": [
              {
                "page": string,
                "layoutType": enum,
                "layoutId": string,
                "boundingPoly": {},
                "confidence": number
              }
            ]
          },
          "id": string,
          "normalizedValue": {
            "text": string,
            "moneyValue": {},
            "dateValue": {},
            "datetimeValue": {},
            "addressValue": {},
            "booleanValue": boolean,
            "integerValue": integer,
            "floatValue": number
          },
          "properties": [
            {}
          ]
        }
      ]
    }
    
  6. R

    Invoice Ocr Dataset

    • universe.roboflow.com
    zip
    Updated Aug 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    rui ke (2021). Invoice Ocr Dataset [Dataset]. https://universe.roboflow.com/rui-ke/invoice-ocr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 23, 2021
    Dataset authored and provided by
    rui ke
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Cars Bounding Boxes
    Description

    Invoice Ocr

    ## Overview
    
    Invoice Ocr is a dataset for object detection tasks - it contains Cars annotations for 1,000 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
    
  7. R

    Invoice Dataset

    • universe.roboflow.com
    zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lker Galip (2023). Invoice Dataset [Dataset]. https://universe.roboflow.com/lker-galip/invoice-oevrd/dataset/18
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset authored and provided by
    lker Galip
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Invoice Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Digital Bookkeeping Systems: Developers can integrate the "invoice" model into bookkeeping software, where it scans, reads, and categorizes information from physical or digital invoices. It automates the accounting process by sorting invoices into specific classes like vendor_info, total_price, etc.

    2. Expense Management Applications: Companies can use this model to simplify expense tracking, where employees just need to upload the invoice image, and the model will extract required details like vendor_info, total_price and more.

    3. OCR (Optical Character Recognition) Systems: The "invoice" model can significantly enhance OCR systems, allowing for context-aware recognition of specific text elements within images of documents, such as an invoice's details, bank info, and customer info.

    4. Automatic Auditing Systems: The model can be utilized by auditing firms to automate the auditing process. It would help to compare details of scanned invoices with saved financial records, spotting any disparities instantly.

    5. Vendor Management Systems: Large companies dealing with multiple vendors can use the "invoice" model in their vendor management systems. The model would automatically extract and categorize information about the vendor and the services rendered from the invoices.

  8. h

    invoice-ocr-json

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gokul Raja R, invoice-ocr-json [Dataset]. https://huggingface.co/datasets/GokulRajaR/invoice-ocr-json
    Explore at:
    Authors
    Gokul Raja R
    Description

    Invoice OCR Dataset

    This dataset contains annotated invoice images and their corresponding OCR-extracted text in structured JSON format. The data was originally sourced from an open-source invoice dataset and processed using the GPT-4o mini model to extract relevant fields such as invoice number, date, total amount, vendor, and line items.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    This dataset is designed to support training and evaluation of document understanding… See the full description on the dataset page: https://huggingface.co/datasets/GokulRajaR/invoice-ocr-json.

  9. Vietnamese Receipts MC_OCR 2021

    • kaggle.com
    zip
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DoMixi1989 (2022). Vietnamese Receipts MC_OCR 2021 [Dataset]. https://www.kaggle.com/datasets/domixi1989/vietnamese-receipts-mc-ocr-2021
    Explore at:
    zip(2271709772 bytes)Available download formats
    Dataset updated
    Apr 8, 2022
    Authors
    DoMixi1989
    Description

    Dataset

    This dataset was created by DoMixi1989

    Contents

  10. m

    Invoice and Recipt Image Dataset

    • data.macgence.com
    mp3
    Updated Jun 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Invoice and Recipt Image Dataset [Dataset]. https://data.macgence.com/dataset/invoice-and-recipt-image-dataset
    Explore at:
    mp3Available download formats
    Dataset updated
    Jun 16, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Unlock the potential of our invoice and receipt image dataset. Perfect for AI training, OCR development, and advancing data extraction technologies.

  11. OCR image data of Korean documents

    • kaggle.com
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Appen Limited (2025). OCR image data of Korean documents [Dataset]. https://www.kaggle.com/datasets/appenlimited/ocr-image-data-of-korean-documents
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 13, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Appen Limited
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    如需完整数据集或了解更多,请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com

    The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING

    1. Data Specification Usage Cases Image label recognition training Collecting device Mobile phone / Camera Collecting environment Multiple lights environments

    Database Name Category Quantity

    Korean Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024

    Vietnamese Document OCR Images

    RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080

    Spanish Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000

    French Document OCR Images

    RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003

    Thai Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037

    Japanese Document OCR Images

    RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147

    Indonesian Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006

    Tamil Document OCR Images

    RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963

    Burmese Document OCR Images

    RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118

    English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118

    1. Information provided by database
    2. Data Format:. JPG
  12. h

    invoices-google-ocr

    • huggingface.co
    Updated Apr 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Mayes (2024). invoices-google-ocr [Dataset]. https://huggingface.co/datasets/amaye15/invoices-google-ocr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 23, 2024
    Authors
    Andrew Mayes
    Description

    amaye15/invoices-google-ocr dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. OCR Document Text Recognition Dataset

    • kaggle.com
    Updated Sep 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    OCR Text Detection in the Documents Object Detection dataset

    The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

    The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

    💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

    The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

    Dataset structure

    • images - contains of original images of documents
    • boxes - includes bounding box labeling for the original images
    • annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

    Data Format

    Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

    Labels for the text:

    • "Text Title" - corresponds to titles, the box is red
    • "Text Paragraph" - corresponds to paragraphs of text, the box is blue
    • "Table" - corresponds to the table, the box is green
    • "Handwritten" - corresponds to handwritten text, the box is purple

    Example of XML file structure

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

    Text Detection in the Documents might be made in accordance with your requirements.

    💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

  14. h

    CORU

    • huggingface.co
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelrahman Abdallah (2025). CORU [Dataset]. https://huggingface.co/datasets/abdoelsayed/CORU
    Explore at:
    Dataset updated
    Jun 19, 2025
    Authors
    Abdelrahman Abdallah
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ReceiptSense: Beyond Traditional OCR - A Dataset for Receipt Understanding

      🔥 News
    

    [2024] ReceiptSense dataset is now publicly available! [2024] Paper accepted and published

      📖 Abstract
    

    Multilingual OCR and information extraction from receipts remains challenging, particularly for complex scripts like Arabic. We introduce ReceiptSense, a comprehensive dataset designed for Arabic-English receipt understanding comprising:

    20,000 annotated receipts… See the full description on the dataset page: https://huggingface.co/datasets/abdoelsayed/CORU.

  15. R

    Ocr Trained Dataset

    • universe.roboflow.com
    zip
    Updated Mar 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xerosum (2023). Ocr Trained Dataset [Dataset]. https://universe.roboflow.com/xerosum/ocr-trained/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 19, 2023
    Dataset authored and provided by
    Xerosum
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Invoices Info Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Automating Accounting Processes: The OCR-trained model can be utilized in automating the input of invoice details into accounting software. This allows for quicker, more accurate data entry, reducing human error.

    2. Invoice Management Systems: OCR-trained can serve as a part of an advanced invoice management system that manages and organizes invoices from multiple vendors. This can simplify invoice tracking and payment processes.

    3. Compliance and Audit: This model can be used to verify the accuracy of invoice information for compliance and audit purposes. It can identify key details like date, total amount, vendor name, etc., which can then be compared against recorded transactions.

    4. Paperless Office Transition: Businesses seeking to transition to a paperless environment can utilize this model to digitize their existing paper invoices. This helps in efficient document management and promotes environmental sustainability.

    5. Data Extraction for Analytics: The model can also be employed to extract data from invoices for data analysis. This could help in building predictive models, analyzing spending patterns, and optimizing vendor selection.

  16. Text extraction for OCR

    • kaggle.com
    Updated Mar 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    manishthemanu (2021). Text extraction for OCR [Dataset]. https://www.kaggle.com/datasets/manishthem/text-extraction-for-ocr/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    manishthemanu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Typical NER will identify various entities in the text but not every name come with proper context. The data set provides structured data in the XML format and requires its users to extract various entities.

    Content

    The data set consist of XML files and images. The XML files contain the extracted data from the image of the invoices, name of text and XML file is kept the same for clarity. Users of the dataset should extract entities like invoice no, invoice data, company name (invoice from company1 to company2/person), telephone number of the company, address e.t.c

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Challenges: Invoices data contains tabular data, which is challenging to deal with. Design a methodology to extract information from tabular data. Due to obvious reasons, certain numbers in XML are erroneous for eg, '0' replaced by 'O'.

  17. A labeled dataset of hand-captured images of restaurant receipts

    • zenodo.org
    zip
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manoela Auad; Sarah Alves; Gabriel Kakizaki; Julio Reis; Michel Silva; Manoela Auad; Sarah Alves; Gabriel Kakizaki; Julio Reis; Michel Silva (2024). A labeled dataset of hand-captured images of restaurant receipts [Dataset]. http://doi.org/10.5281/zenodo.13633335
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Manoela Auad; Sarah Alves; Gabriel Kakizaki; Julio Reis; Michel Silva; Manoela Auad; Sarah Alves; Gabriel Kakizaki; Julio Reis; Michel Silva
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Photographing fiscal receipts has become increasingly common with the rise of online storage and accounting services. However, capturing images in uncontrolled environments often leads to distortions that can compromise Optical Character Recognition (OCR) techniques, rendering the output text unreadable. To address this problem, we propose an open-source expert filtering approach based on low-level features to identify and discard low-quality invoice images, select high-quality images, and flag images that require preparation prior to being processed for OCR. The dataset used in this work is an extension of the Express Expense SRD dataset, which consists of 200 hand-photographed images of restaurant receipts. The free version of the original dataset has no OCR task labels. Since this information is needed to calculate the accuracy of the OCR and to analyze the effects of the proposed approach, we created a new version of the existing dataset with manual annotations for the receipts and also for the four corners of the documents.

    More information can be found at the following link: https://github.com/MaVILab-UFV/Filtering-Preparation-for-OCR_SIBGRAPI-2024

    If you use this data, please cite our paper as follows

    Auad, Manoela; Alves, Sarah; Kakizaki, Gabriel; Reis, Julio C. S.; Silva, Michel. A Filtering and Image Preparation Approach to Enhance OCR for Fiscal Receipts. In 37th Conference on Graphics, Patterns and Images (SIBGRAPI), 2024.

  18. 497 Images – English Invoice Data

    • nexdata.ai
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 497 Images – English Invoice Data [Dataset]. https://www.nexdata.ai/datasets/ocr/1392
    Explore at:
    Dataset updated
    Feb 2, 2024
    Dataset authored and provided by
    Nexdata
    Variables measured
    Device, Date size, Data format, Data diversity, Annotation format, Collecting environment
    Description

    497 Images – English Invoice Data,the collection background is a solid color background, and personal information is desensitized, including various types of invoices, which can be used for tasks such as bill recognition and text recognition.

  19. h

    invoices-donut-data-v1-with-ocr

    • huggingface.co
    Updated Mar 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Pansa (2019). invoices-donut-data-v1-with-ocr [Dataset]. https://huggingface.co/datasets/MJPansa/invoices-donut-data-v1-with-ocr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2019
    Authors
    Marco Pansa
    Description

    bbox column is [x, y, width, height] ymean is y position of the mean of the box line is the line number calculated using ymean

  20. R

    Einvoice Ocr Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RPAGUI (2024). Einvoice Ocr Dataset [Dataset]. https://universe.roboflow.com/rpagui/einvoice-ocr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2024
    Dataset authored and provided by
    RPAGUI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Invoice
    Description

    Einvoice OCR

    ## Overview
    
    Einvoice OCR is a dataset for classification tasks - it contains Invoice annotations for 500 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli (2022). Dataset of invoices and receipts including annotation of relevant fields [Dataset]. http://doi.org/10.5281/zenodo.6371710
Organization logo

Dataset of invoices and receipts including annotation of relevant fields

Explore at:
zipAvailable download formats
Dataset updated
Apr 3, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francisco Cruz; Francisco Cruz; Mauro Castelli; Mauro Castelli
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is a dataset comprising 813 images of invoices and receipts of a private company in the Portuguese language. It also includes text files with the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference.

Search
Clear search
Close search
Google apps
Main menu