Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The invoice dataset provided is a mock dataset generated using the Python Faker library. It has been designed to mimic the format of data collected from an online store. The dataset contains various fields, including first name, last name, email, product ID, quantity, amount, invoice date, address, city, and stock code. All of the data in the dataset is randomly generated and does not represent actual individuals or products. The dataset can be used for various purposes, including testing algorithms or models related to invoice management, e-commerce, or customer behavior analysis. The data in this dataset can be used to identify trends, patterns, or anomalies in online shopping behavior, which can help businesses to optimize their online sales strategies.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Inoices Sample Dataset
This is a sample dataset generated on app.parsee.ai for invoices. The goal was to evaluate different LLMs on this RAG task using the Parsee evaluation tools. A full study can be found here: https://github.com/parsee-ai/parsee-datasets/blob/main/datasets/invoices/parsee-loader/README.md parsee-core version used: 0.1.3.11 This dataset was created on the basis of 15 sample invoices (PDF files). All PDF files are publicly accessible on parsee.ai, to access them… See the full description on the dataset page: https://huggingface.co/datasets/parsee-ai/invoices-example.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electronic invoices have become the product of the information age, increasing their utility on the nowadays market. Looking at real electronic invoices across the globe, we have come up with sufficient placement of the information. Each detail has been generated in a programmable way using Python programs. Billing information is minimalistic to omit or lower the chance of fraud detection. The process of collecting each product has been achieved by scrapping popular online marketplaces. As a result, categorized groups have been created to imitate a manner of the persona. The direction of the potential reusability is heading towards becoming an input of the machine learning fraud detection algorithms or data extraction mechanisms. Datasets presents 1000 samples each of auto-generated invoices containing: - valid information. - valid information with colored iban background. RGB color of a background varies between (255,255,240) to (255,255,254). - valid information with modified space between iban characters. Charspace coefficient varies between 0.001 to 1.
Both ends of a special invoice modifier represents a domain from detectable to non-detectable factor by a human eye. Nomenclature: invoice_
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
High-Quality Invoice Images for OCR is a curated dataset containing professionally scanned and digitally captured invoice documents. It is designed for training, fine-tuning, and evaluating OCR models, machine learning pipelines, and data extraction systems.
This dataset focuses on clean, structured invoices to simulate real-world scenarios in financial document automation.
📄 Variety of invoice templates from multiple industries (e.g., retail, manufacturing, services)
🖋️ Different currencies, tax formats, and layouts
📸 High-resolution scanned and photographed invoices
🏷️ Optional field annotations (e.g., invoice number, date, total amount, vendor name) for supervised training
Training and fine-tuning OCR and Document AI models
Machine learning for structured and semi-structured data extraction
Intelligent Document Processing (IDP) and Robotic Process Automation (RPA)
Benchmarking table detection, key-value extraction, and layout analysis models
✅ High-quality images optimized for OCR and data extraction tasks
✅ Real-world invoice variations to improve model robustness
✅ Ideal for machine learning workflows in finance, ERP, and accounting systems
✅ Supports rapid prototyping for invoice understanding models
Researchers working on OCR and document understanding
Developers building invoice processing systems
Machine learning engineers fine-tuning models for data extraction
Startups and enterprises automating financial workflows
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Invoices (Sparrow)
This dataset contains 500 invoice documents annotated and processed to be ready for Donut ML model fine-tuning. Annotation and data preparation task was done by Katana ML team. Sparrow - open-source data extraction solution by Katana ML. Original dataset info: Kozłowski, Marek; Weichbroth, Paweł (2021), “Samples of electronic invoices”, Mendeley Data, V2, doi: 10.17632/tnj49gpmtz.2
Facebook
TwitterThis dataset was created by Prashanth Sheri
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.
PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.
The document types are:
Here are a few example entries from the CSV file:
This dataset can be used for:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is related to financial transactions or invoices and includes information about the invoiced parties, services, and financial details. Depending on your specific analysis or use case.
Facebook
TwitterEach sample that is received by NSIL is assigned a laboratory number and a case file is initiated by the sample custodian. The case file will contain all relevant paperwork for that sample including the sample submission sheet, laboratory raw data worksheets, the final results report and any other relevant documentation. The sample custodian enters the client information into the NSIL Sample tracking system (Sample receipt database) and generates appropriate client and sample receipt information. The laboratory analysts perform the appropriate analyses and record the results and whether the results are compliant or non-compliant with the assigned acceptance levels. The analysts also record the record of charges and the analytical and quality assurance units that were used to complete all analysis. The database is used to track samples analyzed by NSIL from sample receipt to reporting of results. It tracks numbers of samples, number of analytical units, types of samples, purpose for sampling ans analytical costs.
Facebook
TwitterAbstract: Numerous business workflows involve printed forms, such as invoices or receipts, which are often manually digitalized to persistently search or store the data. As hardware scanners are costly and inflexible, smartphones are increasingly used for digitalization. Here, processing algorithms need to deal with prevailing environmental factors, such as shadows or crumples. Current state-of-the-art approaches learn supervised image dewarping models based on pairs of raw images and rectification meshes. The available results show promising predictive accuracies for dewarping, but generated errors still lead to sub-optimal information retrieval. In this paper, we explore the potential of improving dewarping models using additional, structured information in the form of invoice templates. We provide two core contributions: (1) a novel dataset, referred to as Inv3D, comprising synthetic and real-world high-resolution invoice images with structural templates, rectification meshes, and a multiplicity of per-pixel supervision signals and (2) a novel image dewarping algorithm, which extends the state-of-the-art approach GeoTr to leverage structural templates using attention. Our extensive evaluation includes an implementation of DewarpNet and shows that exploiting structured templates can improve the performance for image dewarping. We report superior performance for the proposed algorithm on our new benchmark for all metrics, including an improved local distortion of 26.1 %. We made our new dataset and all code publicly available at https://felixhertlein.github.io/inv3d. TechnicalRemarks: Each sample contains the following files: "flat_document.png" (2200x1700x3, uint8, 0-255), showcasing a document in perfect condition. "flat_information_delta.png" displays all texts which represent invoice data (2200x1700x3, uint8, 0-255). "flat_template.png" is an empty invoice template (2200x1700x3, uint8, 0-255). "flat_text_mask.png" visually presents all texts shown in the given document (2200x1700x3, uint8, 0-255). "warped_angle.png" shows warping-induced x- and y-axis angle (1600x1600x2, float32, -Pi to Pi). "warped_albedo.png" is an albedo map (1600x1600x3, uint8, 0-255). "warped_BM.npz" stores backward mapping, i. e. the realtive pixel shift from warped to normalized image for each pixel shifts (1600x1600x2, float32, 0-1). "warped_curvature.npz" has pixel-wise curvature of the warped document (1600x1600x1, float32, 0-inf). "warped_depth.npz" holds per-pixel depth between camera and document (1600x1600x3, float32, 0-inf). "warped_document.png" displays the warped document (1600x1600x3, uint8, 0-255). "warped_normal.npz" contains warped document normals (1600x1600x3, float32, -inf to inf). "warped_recon.png" features a chess-textured warped document (1600x1600x3, uint8, 0-255). "warped_text_mask.npz" is a boolean text pixel mask (1600x1600x1, bool8, True/False). "warped_UV.npz" stores warped texture coordinates (1600x1600x3, float32, 0-1). "warped_WC.npz" includes document coordinates in the 3D space (1600x1600x3, float32, -inf to inf). For more details see https://github.com/FelixHertlein/inv3d-generator. Released under CC BY-NC-SA 4.0. Excluded files are listed in 'restricted-license-files.txt' (located in record with DOI 10.35097/1730, "Inv3D: a high-resolution 3D invoice dataset for template-driven Single-Image Document Unwarping - Metadata"). These are for academic use only.
Facebook
TwitterThis dataset contains 7000 invoice images and their corresponding JSON files. There are 7 types of invoices in this dataset, each one containing 1000 examples each. The data content in the invoices has been generated using Python Faker. If you do not want to download in the form of parquet (default download format) and want to download the dataset in the original format (a folder containing the 2 subfolders, image and json), use the below code: from huggingface_hub import snapshot_download… See the full description on the dataset page: https://huggingface.co/datasets/Ananthu01/7000_invoice_images_with_json.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of 10000 jpg images and 3x10000 json annotation files. The images are generated from 50 different templates. For each template, 200 images were generated. We provide annotations in three formats: our own original format, the COCO format and a format compatible with HuggingFace Transformers.
In terms of objects, the dataset contains 24 different classes. The classes vary considerably in their numbers of occurrences and thus, the dataset is somewhat imbalanced.
The annotations contain bounding box coordinates, bounding box text and object classes.
We propose two methods for training and evaluating models. The models were trained until convergence ie until the model reaches optimal performance on the validation split and started overfitting. The model version used for evaluation is the one with the best validation performance.
First Evaluation strategy:
For each template, the generated images are randomly split into 3 subsets: training, validation and testing.
In this scenario, the model trains on all templates and is thus tested on new images rather than new layouts.
Second Evaluation strategy:
The real templates are randomly split into a training set, and a common set of templates for validation and testing. All the variants created from the training templates are used as training dataset. The same is done to form the validation and testing datasets. The validation and testing sets are made up of the same templates but of different images.
This approach tests the models' performance on different unseen templates/layouts, rather than the same templates with different content.
We provide the data splits we used for every evaluation scenario. We also provide the background colors we used as augmentation for each template.
Facebook
TwitterHitHorizons Invoice Data API gives access to aggregated company data on 88M+ companies from 18 countries.
Available countries:
France United Kingdom Germany Poland Czech Republic Hungary Slovakia Latvia Estonia Austria
Parameters:
Id Company Name Company Alternative Name Street Address Street Number Location District Region Postal Code City Country Status Incorporation Date Dissolution Date National ID Tax ID Vat ID Parent ID Idents Inactive Company Type Company Type Normalized
parameters may vary depending on the country
Facebook
TwitterContains information about invoices submitted to HPD by private contractors under an OMO. This is part of the HPD Charge Data collection of data tables.
Facebook
TwitterAbstract: Numerous business workflows involve printed forms, such as invoices or receipts, which are often manually digitalized to persistently search or store the data. As hardware scanners are costly and inflexible, smartphones are increasingly used for digitalization. Here, processing algorithms need to deal with prevailing environmental factors, such as shadows or crumples. Current state-of-the-art approaches learn supervised image dewarping models based on pairs of raw images and rectification meshes. The available results show promising predictive accuracies for dewarping, but generated errors still lead to sub-optimal information retrieval. In this paper, we explore the potential of improving dewarping models using additional, structured information in the form of invoice templates. We provide two core contributions: (1) a novel dataset, referred to as Inv3D, comprising synthetic and real-world high-resolution invoice images with structural templates, rectification meshes, and a multiplicity of per-pixel supervision signals and (2) a novel image dewarping algorithm, which extends the state-of-the-art approach GeoTr to leverage structural templates using attention. Our extensive evaluation includes an implementation of DewarpNet and shows that exploiting structured templates can improve the performance for image dewarping. We report superior performance for the proposed algorithm on our new benchmark for all metrics, including an improved local distortion of 26.1 %. We made our new dataset and all code publicly available at https://felixhertlein.github.io/inv3d. TechnicalRemarks: Each sample contains the following files: "flat_document.png" (2200x1700x3, uint8, 0-255), showcasing a document in perfect condition. "flat_information_delta.png" displays all texts which represent invoice data (2200x1700x3, uint8, 0-255). "flat_template.png" is an empty invoice template (2200x1700x3, uint8, 0-255). "flat_text_mask.png" visually presents all texts shown in the given document (2200x1700x3, uint8, 0-255). "warped_angle.png" shows warping-induced x- and y-axis angle (1600x1600x2, float32, -Pi to Pi). "warped_albedo.png" is an albedo map (1600x1600x3, uint8, 0-255). "warped_BM.npz" stores backward mapping, i. e. the realtive pixel shift from warped to normalized image for each pixel shifts (1600x1600x2, float32, 0-1). "warped_curvature.npz" has pixel-wise curvature of the warped document (1600x1600x1, float32, 0-inf). "warped_depth.npz" holds per-pixel depth between camera and document (1600x1600x3, float32, 0-inf). "warped_document.png" displays the warped document (1600x1600x3, uint8, 0-255). "warped_normal.npz" contains warped document normals (1600x1600x3, float32, -inf to inf). "warped_recon.png" features a chess-textured warped document (1600x1600x3, uint8, 0-255). "warped_text_mask.npz" is a boolean text pixel mask (1600x1600x1, bool8, True/False). "warped_UV.npz" stores warped texture coordinates (1600x1600x3, float32, 0-1). "warped_WC.npz" includes document coordinates in the 3D space (1600x1600x3, float32, -inf to inf). For more details see https://github.com/FelixHertlein/inv3d-generator. Released under CC BY-NC-SA 4.0. Excluded files are listed in 'restricted-license-files.txt' (located in record with DOI 10.35097/1730, "Inv3D: a high-resolution 3D invoice dataset for template-driven Single-Image Document Unwarping - Metadata"). These are for academic use only.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The Grocery Store Receipts Dataset is a collection of photos captured from various grocery store receipts. This dataset is specifically designed for tasks related to Optical Character Recognition (OCR) and is useful for retail.
Each image in the dataset is accompanied by bounding box annotations, indicating the precise locations of specific text segments on the receipts. The text segments are categorized into four classes: item, store, date_time and total.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F4d5c600731265119bb28668959d5c357%2FFrame%2016.png?generation=1695111877176656&alt=media" alt="">
Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and detected text . For each point, the x and y coordinates are provided.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F62643adde75dd6ca4e3f26909174ae40%2Fcarbon.png?generation=1695112527839805&alt=media" alt="">
🚀 You can learn more about our high-quality unique datasets here
keywords: receipts reading, retail dataset, consumer goods dataset, grocery store dataset, supermarket dataset, deep learning, retail store management, pre-labeled dataset, annotations, text detection, text recognition, optical character recognition, document text recognition, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text, object detection
Facebook
TwitterThis dataset provides a cumulative record of payment and invoice details for County suppliers, vendors and other payees. Data is from December 1, 2016 to present. Payment data prior to December 1, 2016 is archived in the Cook County Check Register here: https://datacatalog.cookcountyil.gov/Finance-Administration/Cook-County-Check-Register/gywr-fjeh
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
Twitter
According to our latest research, the global invoice processing software market size was valued at USD 4.2 billion in 2024, with a robust compound annual growth rate (CAGR) of 12.8% anticipated through 2033. By 2033, the market is forecasted to reach USD 12.4 billion, driven by rapid digital transformation initiatives, increasing demand for automation in finance departments, and a growing emphasis on operational efficiency. The adoption of cloud-based solutions and the integration of artificial intelligence into invoice management platforms are among the key factors fueling this consistent growth trajectory, as organizations across various sectors seek to streamline their accounts payable processes and reduce manual intervention.
One of the primary growth drivers for the invoice processing software market is the accelerating shift towards digital transformation across enterprises globally. As organizations increasingly seek to automate their financial workflows, invoice processing software is emerging as a critical tool for enhancing accuracy, reducing errors, and minimizing the time required for invoice approvals and payments. The integration of advanced technologies such as artificial intelligence (AI), machine learning (ML), and robotic process automation (RPA) into invoice processing solutions is enabling businesses to extract data from invoices more efficiently, detect anomalies, and ensure compliance with regulatory requirements. This not only streamlines the overall process but also provides actionable insights for better financial decision-making, further propelling market growth.
Another significant factor contributing to the expansion of the invoice processing software market is the rising adoption of cloud-based solutions. Cloud deployment offers several advantages, including scalability, cost-effectiveness, and remote accessibility, making it an attractive option for organizations of all sizes. The COVID-19 pandemic further accelerated the migration to cloud-based platforms, as businesses prioritized remote work capabilities and digital collaboration tools. As a result, vendors are increasingly focusing on developing cloud-native invoice processing solutions with enhanced security features and seamless integration capabilities. This trend is particularly pronounced among small and medium enterprises (SMEs), which often lack the resources to maintain complex on-premises infrastructure but require robust solutions to manage their invoicing workflows efficiently.
The invoice processing software market is also benefiting from the growing need for compliance and risk management in the face of evolving regulatory landscapes. With stricter regulations around financial reporting, tax compliance, and data privacy, organizations are under pressure to implement systems that ensure transparency and auditability in their accounts payable processes. Invoice processing software provides automated audit trails, reduces the risk of fraud, and helps organizations adhere to local and international compliance standards. This is especially crucial for industries such as banking, financial services, and insurance (BFSI), healthcare, and government, where regulatory scrutiny is particularly high. Consequently, the demand for robust invoice processing solutions is expected to remain strong across these verticals.
From a regional perspective, North America currently dominates the invoice processing software market, accounting for the largest share in 2024. This is attributed to the presence of major technology vendors, high adoption rates of automation solutions, and a mature digital infrastructure in the region. Europe follows closely, driven by stringent regulatory requirements and a strong focus on process optimization within enterprises. The Asia Pacific region is expected to exhibit the fastest growth over the forecast period, fueled by the rapid digitalization of businesses, increasing investments in cloud technology, and the proliferation of SMEs. Latin America and the Middle East & Africa are also witnessing steady adoption, supported by ongoing efforts to modernize financial operations and improve business efficiency.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
The importance of effectively using Google Cloud Platform (GCP) billing data to gain actionable insights into cloud spending. It emphasizes the need for strategic cost management, offering guidance on how to analyze billing data, optimize resource usage, and implement best practices to minimize costs while maximizing the value derived from cloud services. The subtitle is geared towards businesses and technical teams looking to maintain financial control and improve their cloud operations.
This dataset contains the data of GCP billing cloud cost. For a updated one, comment ! contact !
Facebook
Twitterhttps://www.koncile.ai/en/termsandconditionshttps://www.koncile.ai/en/termsandconditions
AI-powered software to extract fields from PDF or image invoices. Reliable and available via API to turn documents into actionable data.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The invoice dataset provided is a mock dataset generated using the Python Faker library. It has been designed to mimic the format of data collected from an online store. The dataset contains various fields, including first name, last name, email, product ID, quantity, amount, invoice date, address, city, and stock code. All of the data in the dataset is randomly generated and does not represent actual individuals or products. The dataset can be used for various purposes, including testing algorithms or models related to invoice management, e-commerce, or customer behavior analysis. The data in this dataset can be used to identify trends, patterns, or anomalies in online shopping behavior, which can help businesses to optimize their online sales strategies.