Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The Grocery Store Receipts Dataset is a collection of photos captured from various grocery store receipts. This dataset is specifically designed for tasks related to Optical Character Recognition (OCR) and is useful for retail.
Each image in the dataset is accompanied by bounding box annotations, indicating the precise locations of specific text segments on the receipts. The text segments are categorized into four classes: item, store, date_time and total.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F4d5c600731265119bb28668959d5c357%2FFrame%2016.png?generation=1695111877176656&alt=media" alt="">
Each image from images
folder is accompanied by an XML-annotation in the annotations.xml
file indicating the coordinates of the bounding boxes and detected text . For each point, the x and y coordinates are provided.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F62643adde75dd6ca4e3f26909174ae40%2Fcarbon.png?generation=1695112527839805&alt=media" alt="">
keywords: receipts reading, retail dataset, consumer goods dataset, grocery store dataset, supermarket dataset, deep learning, retail store management, pre-labeled dataset, annotations, text detection, text recognition, optical character recognition, document text recognition, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text, object detection
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This file contains governmental receipts for 1962 through the current budget year, as well as four years of projections. It can be used to reproduce many of the totals published in the Budget and examine unpublished details below the levels of aggregation published in the Budget.
Metrics that can be unearthed will be ones contained in the email booking invoice such as Hotel name, type of room, dates, check in and check out times, price paid, duration of stay. We can go back to 5 years of history.
We also have cancellation emails.
Any hotel vendor can be requested too. We will conduct a search in our database to see if it justifies a parser build to extract the data.
Each sample that is received by NSIL is assigned a laboratory number and a case file is initiated by the sample custodian. The case file will contain all relevant paperwork for that sample including the sample submission sheet, laboratory raw data worksheets, the final results report and any other relevant documentation. The sample custodian enters the client information into the NSIL Sample tracking system (Sample receipt database) and generates appropriate client and sample receipt information. The laboratory analysts perform the appropriate analyses and record the results and whether the results are compliant or non-compliant with the assigned acceptance levels. The analysts also record the record of charges and the analytical and quality assurance units that were used to complete all analysis. The database is used to track samples analyzed by NSIL from sample receipt to reporting of results. It tracks numbers of samples, number of analytical units, types of samples, purpose for sampling ans analytical costs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hong Kong Govt Consolidated Acc: Year to Date: Repayment of Bonds and Notes data was reported at 0.000 HKD mn in May 2018. This stayed constant from the previous number of 0.000 HKD mn for Apr 2018. Hong Kong Govt Consolidated Acc: Year to Date: Repayment of Bonds and Notes data is updated monthly, averaging 0.000 HKD mn from Jul 2014 (Median) to May 2018, with 47 observations. The data reached an all-time high of 9,687.800 HKD mn in Mar 2015 and a record low of 0.000 HKD mn in May 2018. Hong Kong Govt Consolidated Acc: Year to Date: Repayment of Bonds and Notes data remains active status in CEIC and is reported by The Treasury. The data is categorized under Global Database’s Hong Kong – Table HK.F002: Government Consolidated Account: Receipts and Payments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Business process event data modeled as labeled property graphs
Data Format
-----------
The dataset comprises one labeled property graph in two different file formats.
#1) Neo4j .dump format
A neo4j (https://neo4j.com) database dump that contains the entire graph and can be imported into a fresh neo4j database instance using the following command, see also the neo4j documentation: https://neo4j.com/docs/
/bin/neo4j-admin.(bat|sh) load --database=graph.db --from=
The .dump was created with Neo4j v3.5.
#2) .graphml format
A .zip file containing a .graphml file of the entire graph
Data Schema
-----------
The graph is a labeled property graph over business process event data. Each graph uses the following concepts
:Event nodes - each event node describes a discrete event, i.e., an atomic observation described by attribute "Activity" that occurred at the given "timestamp"
:Entity nodes - each entity node describes an entity (e.g., an object or a user), it has an EntityType and an identifier (attribute "ID")
:Log nodes - describes a collection of events that were recorded together, most graphs only contain one log node
:Class nodes - each class node describes a type of observation that has been recorded, e.g., the different types of activities that can be observed, :Class nodes group events into sets of identical observations
:CORR relationships - from :Event to :Entity nodes, describes whether an event is correlated to a specific entity; an event can be correlated to multiple entities
:DF relationships - "directly-followed by" between two :Event nodes describes which event is directly-followed by which other event; both events in a :DF relationship must be correlated to the same entity node. All :DF relationships form a directed acyclic graph.
:HAS relationship - from a :Log to an :Event node, describes which events had been recorded in which event log
:OBSERVES relationship - from an :Event to a :Class node, describes to which event class an event belongs, i.e., which activity was observed in the graph
:REL relationship - placeholder for any structural relationship between two :Entity nodes
The concepts a further defined in Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases. CoRR abs/2005.14552 (2020) https://arxiv.org/abs/2005.14552
Data Contents
-------------
neo4j-bpic19-2021-02-17 (.dump|.graphml.zip)
An integrated graph describing the raw event data of the entire BPI Challenge 2019 dataset.
van Dongen, B.F. (Boudewijn) (2019): BPI Challenge 2019. 4TU.ResearchData. Collection. https://doi.org/10.4121/uuid:d06aff4b-79f0-45e6-8ec8-e19730c248f1
This data originated from a large multinational company operating from The Netherlands in the area of coatings and paints and we ask participants to investigate the purchase order handling process for some of its 60 subsidiaries. In particular, the process owner has compliance questions. In the data, each purchase order (or purchase document) contains one or more line items. For each line item, there are roughly four types of flows in the data: (1) 3-way matching, invoice after goods receipt: For these items, the value of the goods receipt message should be matched against the value of an invoice receipt message and the value put during creation of the item (indicated by both the GR-based flag and the Goods Receipt flags set to true). (2) 3-way matching, invoice before goods receipt: Purchase Items that do require a goods receipt message, while they do not require GR-based invoicing (indicated by the GR-based IV flag set to false and the Goods Receipt flags set to true). For such purchase items, invoices can be entered before the goods are receipt, but they are blocked until goods are received. This unblocking can be done by a user, or by a batch process at regular intervals. Invoices should only be cleared if goods are received and the value matches with the invoice and the value at creation of the item. (3) 2-way matching (no goods receipt needed): For these items, the value of the invoice should match the value at creation (in full or partially until PO value is consumed), but there is no separate goods receipt message required (indicated by both the GR-based flag and the Goods Receipt flags set to false). (4)Consignment: For these items, there are no invoices on PO level as this is handled fully in a separate process. Here we see GR indicator is set to true but the GR IV flag is set to false and also we know by item type (consignment) that we do not expect an invoice against this item. Unfortunately, the complexity of the data goes further than just this division in four categories. For each purchase item, there can be many goods receipt messages and corresponding invoices which are subsequently paid. Consider for example the process of paying rent. There is a Purchase Document with one item for paying rent, but a total of 12 goods receipt messages with (cleared) invoices with a value equal to 1/12 of the total amount. For logistical services, there may even be hundreds of goods receipt messages for one line item. Overall, for each line item, the amounts of the line item, the goods receipt messages (if applicable) and the invoices have to match for the process to be compliant. Of course, the log is anonymized, but some semantics are left in the data, for example: The resources are split between batch users and normal users indicated by their name. The batch users are automated processes executed by different systems. The normal users refer to human actors in the process. The monetary values of each event are anonymized from the original data using a linear translation respecting 0, i.e. addition of multiple invoices for a single item should still lead to the original item worth (although there may be small rounding errors for numerical reasons). Company, vendor, system and document names and IDs are anonymized in a consistent way throughout the log. The company has the key, so any result can be translated by them to business insights about real customers and real purchase documents.
The case ID is a combination of the purchase document and the purchase item. There is a total of 76,349 purchase documents containing in total 251,734 items, i.e. there are 251,734 cases. In these cases, there are 1,595,923 events relating to 42 activities performed by 627 users (607 human users and 20 batch users). Sometimes the user field is empty, or NONE, which indicates no user was recorded in the source system. For each purchase item (or case) the following attributes are recorded: concept:name: A combination of the purchase document id and the item id, Purchasing Document: The purchasing document ID, Item: The item ID, Item Type: The type of the item, GR-Based Inv. Verif.: Flag indicating if GR-based invoicing is required (see above), Goods Receipt: Flag indicating if 3-way matching is required (see above), Source: The source system of this item, Doc. Category name: The name of the category of the purchasing document, Company: The subsidiary of the company from where the purchase originated, Spend classification text: A text explaining the class of purchase item, Spend area text: A text explaining the area for the purchase item, Sub spend area text: Another text explaining the area for the purchase item, Vendor: The vendor to which the purchase document was sent, Name: The name of the vendor, Document Type: The document type, Item Category: The category as explained above (3-way with GR-based invoicing, 3-way without, 2-way, consignment).
The data contains the following entities and their events
- PO - Purchase Order documents handled at a large multinational company operating from The Netherlands
- POItem - an item in a Purchase Order document describing a specific item to be purchased
- Resource - the user or worker handling the document or a specific item
- Vendor - the external organization from which an item is to be purchased
Data Size
---------
BPIC19, nodes: 1926651, relationships: 15082099
Abstract copyright UK Data Service and data collection copyright owner.
The European State Finance Database (ESFD) is an international collaborative research project for the collection of data in European fiscal history. There are no strict geographical or chronological boundaries to the collection, although data for this collection comprise the period between c.1200 to c.1815. The purpose of the ESFD was to establish a significant database of European financial and fiscal records. The data are drawn from the main extant sources of a number of European countries, as the evidence and the state of scholarship permit. The aim was to collect the data made available by scholars, whether drawing upon their published or unpublished archival research, or from other published material.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database contains electricity bills related to energy consumption in Spanish households. The contents of bills are automatically generated following some statistics from official bodies. The main purpose of the dataset is for training machine learning algorithms, especially for designing new methods for extracting information from invoices. There are 86 different labels, which are related to several topics, such as the customer and marketer, the contract, energy consumption, or billing.
The total number of invoices is 75.000. The files are organized in two directories: a training directory, with six subdirectories, each containing 5.000 invoices in PDF format and the corresponding labels in JSON files; and a test directory, with nine subdirectories, each containing 5.000 invoices in PDF format.
There are two main zip files that contain the test and training sets (test.zip and training.zip). In addition, we have included separate files with a subset of the directories in each set, so it can be downloaded by parts. There is also a reduced version of the dataset with 100 invoices per directory, which is interesting for users who want to preview the content of the dataset before downloading it.
IDSEM is an acronym for "an Invoices Database for the Spanish Electricity Market". More information can be found at https://idsem.ulpgc.es/ and in the following article:
[1] Javier Sánchez, AgustĂn Salgado, Alejandro GarcĂa, and Nelson MonzĂłn, "IDSEM, an invoices database of the Spanish electricity market", Sci. Data, (2022).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Intelligent Invoice Management System
Project Description:
The Intelligent Invoice Management System is an advanced AI-powered platform designed to revolutionize traditional invoice processing. By automating the extraction, validation, and management of invoice data, this system addresses the inefficiencies, inaccuracies, and high costs associated with manual methods. It enables businesses to streamline operations, reduce human error, and expedite payment cycles.
Problem Statement:
Manual invoice processing involves labor-intensive tasks such as data entry, verification, and reconciliation. These processes are time-consuming, prone to errors, and can result in financial losses and delays. The diversity of invoice formats from various vendors adds complexity, making automation a critical need for efficiency and scalability.
Proposed Solution:
The Intelligent Invoice Management System automates the end-to-end process of invoice handling using AI and machine learning techniques. Core functionalities include:
1. Invoice Generation: Automatically generate PDF invoices in at least four formats, populated with synthetic data.
2. Data Development: Leverage a dataset containing fields such as receipt numbers, company details, sales tax information, and itemized tables to create realistic invoice samples.
3. AI-Powered Labeling: Use Tesseract OCR to extract labeled data from invoice images, and train YOLO for label recognition, ensuring precise identification of fields.
4. Database Integration: Store extracted information in a structured database for seamless retrieval and analysis.
5. Web-Based Information System: Provide a user-friendly platform to upload invoices and retrieve key metrics, such as:
- Total sales within a specified duration.
- Total sales tax paid during a given timeframe.
- Detailed invoice information in tabular form for specific date ranges.
Key Features and Deliverables:
1. Invoice Generation:
- Generate 20,000 invoices using an automated script.
- Include dummy logos, company details, and itemized tables for four items per invoice.
Label Definition and Format:
OCR and AI Training:
Database Management:
Web-Based Interface:
Expected Outcomes:
- Reduction in manual effort and operational costs.
- Improved accuracy in invoice processing and financial reporting.
- Enhanced scalability and adaptability for diverse invoice formats.
- Faster turnaround time for invoice-related tasks.
By automating critical aspects of invoice management, this system delivers a robust and intelligent solution to meet the evolving needs of businesses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hong Kong Govt Consolidated Acc: CE: Interest & Expenses on Bonds & Notes data was reported at 76.669 HKD mn in 2017. This records a decrease from the previous number of 77.301 HKD mn for 2016. Hong Kong Govt Consolidated Acc: CE: Interest & Expenses on Bonds & Notes data is updated yearly, averaging 574.844 HKD mn from Mar 2005 (Median) to 2017, with 13 observations. The data reached an all-time high of 850.524 HKD mn in 2006 and a record low of 76.669 HKD mn in 2017. Hong Kong Govt Consolidated Acc: CE: Interest & Expenses on Bonds & Notes data remains active status in CEIC and is reported by The Treasury. The data is categorized under Global Database’s Hong Kong – Table HK.F003: Government Consolidated Account: Receipts and Payments: Annual.
Abstract copyright UK Data Service and data collection copyright owner.
The European State Finance Database (ESFD) is an international collaborative research project for the collection of data in European fiscal history. There are no strict geographical or chronological boundaries to the collection, although data for this collection comprise the period between c.1200 to c.1815. The purpose of the ESFD was to establish a significant database of European financial and fiscal records. The data are drawn from the main extant sources of a number of European countries, as the evidence and the state of scholarship permit. The aim was to collect the data made available by scholars, whether drawing upon their published or unpublished archival research, or from other published material.Comptes rendus de l'administration des finances du royaume de France, (London, 1789). For a discussion of this source in English, cosult Bonney, R.J.,
Jean-Roland Malet: historian of the finances of the French monarchy', French History, 5 (1991), 180-233.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Jordan Tourism Receipts: Arab data was reported at 65.446 JOD mn in Jun 2018. This records an increase from the previous number of 59.879 JOD mn for May 2018. Jordan Tourism Receipts: Arab data is updated monthly, averaging 49.691 JOD mn from Jan 2002 (Median) to Jun 2018, with 198 observations. The data reached an all-time high of 97.444 JOD mn in Jul 2012 and a record low of 13.600 JOD mn in Apr 2003. Jordan Tourism Receipts: Arab data remains active status in CEIC and is reported by Central Bank of Jordan. The data is categorized under Global Database’s Jordan – Table JO.Q012: Tourism Receipts and Expenditures .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hong Kong Govt Consolidated Acc: OR: IR: IT: SD: Contract Notes data was reported at 23,567.300 HKD mn in 2017. This records a decrease from the previous number of 33,410.000 HKD mn for 2016. Hong Kong Govt Consolidated Acc: OR: IR: IT: SD: Contract Notes data is updated yearly, averaging 6,948.700 HKD mn from Mar 1991 (Median) to 2017, with 27 observations. The data reached an all-time high of 35,447.000 HKD mn in 2008 and a record low of 2,145.500 HKD mn in 1991. Hong Kong Govt Consolidated Acc: OR: IR: IT: SD: Contract Notes data remains active status in CEIC and is reported by Inland Revenue Department. The data is categorized under Global Database’s Hong Kong – Table HK.F003: Government Consolidated Account: Receipts and Payments: Annual.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Iraq IQ: BOP: Current Account: Personal Transfers: Receipts data was reported at 941.300 USD mn in 2016. This records a decrease from the previous number of 954.100 USD mn for 2015. Iraq IQ: BOP: Current Account: Personal Transfers: Receipts data is updated yearly, averaging 250.100 USD mn from Dec 2005 (Median) to 2016, with 12 observations. The data reached an all-time high of 954.100 USD mn in 2015 and a record low of 2.500 USD mn in 2007. Iraq IQ: BOP: Current Account: Personal Transfers: Receipts data remains active status in CEIC and is reported by World Bank. The data is categorized under Global Database’s Iraq – Table IQ.World Bank.WDI: Balance of Payments: Current Account. Personal transfers consist of all current transfers in cash or in kind made or received by resident households to or from nonresident households. Personal transfers thus include all current transfers between resident and nonresident individuals. Data are in current U.S. dollars.; ; International Monetary Fund, Balance of Payments Statistics Yearbook and data files.; Sum; Note: Data are based on the sixth edition of the IMF's Balance of Payments Manual (BPM6) and are only available from 2005 onwards.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The Grocery Store Receipts Dataset is a collection of photos captured from various grocery store receipts. This dataset is specifically designed for tasks related to Optical Character Recognition (OCR) and is useful for retail.
Each image in the dataset is accompanied by bounding box annotations, indicating the precise locations of specific text segments on the receipts. The text segments are categorized into four classes: item, store, date_time and total.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F4d5c600731265119bb28668959d5c357%2FFrame%2016.png?generation=1695111877176656&alt=media" alt="">
Each image from images
folder is accompanied by an XML-annotation in the annotations.xml
file indicating the coordinates of the bounding boxes and detected text . For each point, the x and y coordinates are provided.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F62643adde75dd6ca4e3f26909174ae40%2Fcarbon.png?generation=1695112527839805&alt=media" alt="">
keywords: receipts reading, retail dataset, consumer goods dataset, grocery store dataset, supermarket dataset, deep learning, retail store management, pre-labeled dataset, annotations, text detection, text recognition, optical character recognition, document text recognition, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text, object detection