Transform Unstructured Financial Docs into Actionable Insights Harness proprietary AI models to extract, validate, and standardize financial data from any document format, including scanned images, handwritten notes, or multi-language PDFs. Unlike basic OCR tools, our solution handles complex layouts, merged cells, poor quality PDFs and low-quality scans with industry-leading precision.
Key Features Universal Format Support: Extract data from scanned PDFs, images (JPEG/PNG), Excel, Word, and any other handwritten documents.
AI-Driven OCR & LLM Standardization:
Convert unstructured text into standardized fields (e.g., "Net Profit" → ISO 20022-compliant tags).
Resolve inconsistencies (e.g., "$1M" vs. "1,000,000 USD") using context-aware LLMs.
100+ Language Coverage: Process financial docs in Arabic, Bulgarian, and more with automated translation.
Up to 99% Accuracy: Triple-validation via AI cross-checks, rule-based audits, and human-in-the-loop reviews.
Prebuilt Templates: Auto-detect formats for common documents (e.g., IFRS-compliant P&L statements, IRS tax forms).
Data Sourcing & Output Supported Documents: Balance sheets, invoices, tax filings, bank statements, receipts and more. Export Formats: Excel, CSV, JSON, API, PostgreSQL, or direct integration with tools like QuickBooks, SAP.
Use Cases 1. Credit Risk Analysis: Automate financial health assessments for loan approvals and vender analysis.
Audit Compliance: Streamline data aggregation for GAAP/IFRS audits.
Due Diligence: Verify company legitimacy for mergers, investments, acquisitions, or partnerships.
Compliance: Streamline KYC/AML workflows with automated financials check.
Invoice Processing: Extract vendor payment terms, due dates, and amounts.
Technical Edge 1. AI Architecture: Leverages proprietary algorithm which combines vision transformers and OCR pipelines for layout detection, LLM models for context analysis, and rule-based validation.
Security: SOC 2 compliance, and on-premise storage options.
Latency: Process as much as 10,000 pages/hour with sub-60-second extractions.
Pricing & Trials Pay-as-you-go (min 1,000 docs/month).
Enterprise: Custom pricing for volume discounts, SLA guarantees, and white-glove onboarding.
Free Trial Available
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
• The dataset contains 19,404,395 public comments and replies from 70,000 video news published from 20 renowned Arabic news YouTube channels. Each channel contributes 3,500 video news segments. It includes 10 properties of news that include video URL, ID, title, likes, views, date of publishing, hashtags, description, number of comments, and comments details which include comment time, comment, likes in the comment, and reply count, providing a comprehensive corpus for analysis. • The data is organized in a standardized Excel format, making it easy to access and analyze. The data is organized in a standardized Excel format, making it easy to access and analyze. It includes 10 columns and 3500 records. • The final curated datasets are saved in 20 primary folders; each news channel has a separate folder. This folder contains two files, a data file in Arabic and a file translated into English. This file contains raw data, which includes include the video URL, ID, title, likes, views, date of publishing, hashtags, description, number of comments, and comments details which include comment time, comment, likes in the comment, and reply count.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All published datasets from the ice core paleoclimatology (ICP) group at the Byrd Polar and Climate Research Center (BPCRC) are archived in the NOAA-NCEI Paleoclimatology Database (https://www.ncei.noaa.gov/access/paleo-search/?dataTypeId=7). However, the formatting of these datasets is not consistent across the archival files, making it difficult to download and aggregate multiple datasets for research purposes. This repository is intended to provide a simple, consistently formatted archive of Excel files containing the published data for more than 16 ice core records collected by the BPCRC-ICP group since the 1980s.
The file "2023-ByrdICP-datasets.xlsx " contains a column for each ice core location and a list of the sheet names within the corresponding Excel file for that ice core location.
This is a collection of all GPS- and computer-generated geospatial data specific to the Alpine Treeline Warming Experiment (ATWE), located on Niwot Ridge, Colorado, USA. The experiment ran between 2008 and 2016, and consisted of three sites spread across an elevation gradient. Geospatial data for all three experimental sites and cone/seed collection locations are included in this package. –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––Geospatial files include cone collection, experimental site, seed trap, and other GPS location/terrain data. File types include ESRI shapefiles, ESRI grid files or Arc/Info binary grids, TIFFs (.tif), and keyhole markup language (.kml) files. Trimble-imported data include plain text files (.txt), Trimble COR (CorelDRAW) files, and Trimble SSF (Standard Storage Format) files. Microsoft Excel (.xlsx) and comma-separated values (.csv) files corresponding to the attribute tables of many files within this package are also included. A complete list of files can be found in this document in the “Data File Organization” section in the included Data User's Guide. Maps are also included in this data package for reference and use. These maps are separated into two categories, 2021 maps and legacy maps, which were made in 2010. Each 2021 map has one copy in portable network graphics (.png) format, and the other in .pdf format. All legacy maps are in .pdf format. .png image files can be opened with any compatible programs, such as Preview (Mac OS) and Photos (Windows). All GIS files were imported into geopackages (.gpkg) using QGIS, and double-checked for compatibility and data/attribute integrity using ESRI ArcGIS Pro. Note that files packaged within geopackages will open in ArcGIS Pro with “main.” preceding each file name, and an extra column named “geom” defining geometry type in the attribute table. The contents of each geospatial file remain intact, unless otherwise stated in “niwot_geospatial_data_list_07012021.pdf/.xlsx”. This list of files can be found as an .xlsx and a .pdf in this archive.As an open-source file format, files within gpkgs (TIFF, shapefiles, ESRI grid or “Arc/Info Binary”) can be read using both QGIS and ArcGIS Pro, and any other geospatial softwares. Text and .csv files can be read using TextEdit/Notepad/any simple text-editing software; .csv’s can also be opened using Microsoft Excel and R. .kml files can be opened using Google Maps or Google Earth, and Trimble files are most compatible with Trimble’s GPS Pathfinder Office software. .xlsx files can be opened using Microsoft Excel. PDFs can be opened using Adobe Acrobat Reader, and any other compatible programs. A selection of original shapefiles within this archive were generated using ArcMap with associated FGDC-standardized metadata (xml file format). We are including these original files because they contain metadata only accessible using ESRI programs at this time, and so that the relationship between shapefiles and xml files is maintained. Individual xml files can be opened (without a GIS-specific program) using TextEdit or Notepad. Since ESRI’s compatibility with FGDC metadata has changed since the generation of these files, many shapefiles will require upgrading to be compatible with ESRI’s latest versions of geospatial software. These details are also noted in the “niwot_geospatial_data_list_07012021” file.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
This dataset was used by the NCI's Quantitative Imaging Network (QIN) PET-CT Subgroup for their project titled: Multi-center Comparison of Radiomic Features from Different Software Packages on Digital Reference Objects and Patient Datasets. The purpose of this project was to assess the agreement among radiomic features when computed by several groups by using different software packages under very tightly controlled conditions, which included common image data sets and standardized feature definitions.
The image datasets (and Volumes of Interest – VOIs) provided here are the same ones used in that project and reported in the publication listed below (ISSN 2379-1381 https://doi.org/10.18383/j.tom.2019.00031). In addition, we have provided detailed information about the software packages used (Table 1 in that publication) as well as the individual feature value results for each image dataset and each software package that was used to create the summary tables (Tables 2, 3 and 4) in that publication.
For that project, nine common quantitative imaging features were selected for comparison including features that describe morphology, intensity, shape, and texture and that are described in detail in the International Biomarker Standardisation Initiative (IBSI, https://arxiv.org/abs/1612.07003 and publication (Zwanenburg A. Vallières M, et al, The Image Biomarker Standardization Initiative: Standardized Quantitative Radiomics for High-Throughput Image-based Phenotyping. Radiology. 2020 May;295(2):328-338. doi: https://doi.org/10.1148/radiol.2020191145).
There are three datasets provided – two image datasets and one dataset consisting of four excel spreadsheets containing feature values.
NaiveBayes_R.xlsx: This Excel file includes information as to how probabilities of observed features are calculated given recidivism (P(x_ij│R)) in the training data. Each cell is embedded with an Excel function to render appropriate figures. P(Xi|R): This tab contains probabilities of feature attributes among recidivated offenders. NIJ_Recoded: This tab contains re-coded NIJ recidivism challenge data following our coding schema described in Table 1. Recidivated_Train: This tab contains re-coded features of recidivated offenders. Tabs from [Gender] through [Condition_Other]: Each tab contains probabilities of feature attributes given recidivism. We use these conditional probabilities to replace the raw values of each feature in P(Xi|R) tab. NaiveBayes_NR.xlsx: This Excel file includes information as to how probabilities of observed features are calculated given non-recidivism (P(x_ij│N)) in the training data. Each cell is embedded with an Excel function to render appropriate figures. P(Xi|N): This tab contains probabilities of feature attributes among non-recidivated offenders. NIJ_Recoded: This tab contains re-coded NIJ recidivism challenge data following our coding schema described in Table 1. NonRecidivated_Train: This tab contains re-coded features of non-recidivated offenders. Tabs from [Gender] through [Condition_Other]: Each tab contains probabilities of feature attributes given non-recidivism. We use these conditional probabilities to replace the raw values of each feature in P(Xi|N) tab. Training_LnTransformed.xlsx: Figures in each cell are log-transformed ratios of probabilities in NaiveBayes_R.xlsx (P(Xi|R)) to the probabilities in NaiveBayes_NR.xlsx (P(Xi|N)). TestData.xlsx: This Excel file includes the following tabs based on the test data: P(Xi|R), P(Xi|N), NIJ_Recoded, and Test_LnTransformed (log-transformed P(Xi|R)/ P(Xi|N)). Training_LnTransformed.dta: We transform Training_LnTransformed.xlsx to Stata data set. We use Stat/Transfer 13 software package to transfer the file format. StataLog.smcl: This file includes the results of the logistic regression analysis. Both estimated intercept and coefficient estimates in this Stata log correspond to the raw weights and standardized weights in Figure 1. Brier Score_Re-Check.xlsx: This Excel file recalculates Brier scores of Relaxed Naïve Bayes Classifier in Table 3, showing evidence that results displayed in Table 3 are correct. *****Full List***** NaiveBayes_R.xlsx NaiveBayes_NR.xlsx Training_LnTransformed.xlsx TestData.xlsx Training_LnTransformed.dta StataLog.smcl Brier Score_Re-Check.xlsx Data for Weka (Training Set): Bayes_2022_NoID Data for Weka (Test Set): BayesTest_2022_NoID Weka output for machine learning models (Conventional naïve Bayes, AdaBoost, Multilayer Perceptron, Logistic Regression, and Random Forest)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports the publication titled "A database of non-aqueous proton conducting materials," which compiles experimental data on non-aqueous proton conductors from 48 peer-reviewed papers. The dataset encompasses 74 distinct compounds, yielding a total of 3152 data points that cover a broad temperature range from −70°C to 260°C.
Contents of the Dataset:
Chemical Structures: Molecules are encoded using SMILES (Simplified Molecular-Input Line-Entry System) for easy parsing and compatibility with cheminformatics tools.
Experimental Data: The dataset includes proton conductivity and proton diffusion coefficients. Parameters are reported both for doped and undoped systems, with doping levels explicitly quantified.
File Formats:
Raw Data: An Excel spreadsheet (.xlsx) with two sheets (“Compounds” and “Parameters”) containing original data as extracted from the papers.
Cleaned Data: Two tab-separated values (.tsv) files, containing conductivity and diffusion coefficients, which have been standardized for easy integration into machine learning models.
http://open.alberta.ca/licencehttp://open.alberta.ca/licence
Municipal Financial and Statistical Data includes the information submitted annually by all Alberta municipalities via the financial information return (FIR) and statistical information return (SIR). Information is available both in excel and zipped formats. The FIR is a standardized summary of the information contained in the annual audited financial statements of each municipality including assets, liabilities, revenue, expenses, long term debt, and property taxes. The municipal data is converted to excel for analysis purposes. The SIR provides basic municipal statistics including population, assessment and tax rate information.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Transform Unstructured Financial Docs into Actionable Insights Harness proprietary AI models to extract, validate, and standardize financial data from any document format, including scanned images, handwritten notes, or multi-language PDFs. Unlike basic OCR tools, our solution handles complex layouts, merged cells, poor quality PDFs and low-quality scans with industry-leading precision.
Key Features Universal Format Support: Extract data from scanned PDFs, images (JPEG/PNG), Excel, Word, and any other handwritten documents.
AI-Driven OCR & LLM Standardization:
Convert unstructured text into standardized fields (e.g., "Net Profit" → ISO 20022-compliant tags).
Resolve inconsistencies (e.g., "$1M" vs. "1,000,000 USD") using context-aware LLMs.
100+ Language Coverage: Process financial docs in Arabic, Bulgarian, and more with automated translation.
Up to 99% Accuracy: Triple-validation via AI cross-checks, rule-based audits, and human-in-the-loop reviews.
Prebuilt Templates: Auto-detect formats for common documents (e.g., IFRS-compliant P&L statements, IRS tax forms).
Data Sourcing & Output Supported Documents: Balance sheets, invoices, tax filings, bank statements, receipts and more. Export Formats: Excel, CSV, JSON, API, PostgreSQL, or direct integration with tools like QuickBooks, SAP.
Use Cases 1. Credit Risk Analysis: Automate financial health assessments for loan approvals and vender analysis.
Audit Compliance: Streamline data aggregation for GAAP/IFRS audits.
Due Diligence: Verify company legitimacy for mergers, investments, acquisitions, or partnerships.
Compliance: Streamline KYC/AML workflows with automated financials check.
Invoice Processing: Extract vendor payment terms, due dates, and amounts.
Technical Edge 1. AI Architecture: Leverages proprietary algorithm which combines vision transformers and OCR pipelines for layout detection, LLM models for context analysis, and rule-based validation.
Security: SOC 2 compliance, and on-premise storage options.
Latency: Process as much as 10,000 pages/hour with sub-60-second extractions.
Pricing & Trials Pay-as-you-go (min 1,000 docs/month).
Enterprise: Custom pricing for volume discounts, SLA guarantees, and white-glove onboarding.
Free Trial Available