The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to answer. This dataset is re-annotated from the previous HybridQA dataset. The dataset is collected by UCSB NLP group and issued under MIT license.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
A collection of tables collected from the open research knowledge graph (ORKG) infrastructure, with a set of questions about these tables.
TAT-QA (Tabular And Textual dataset for Question Answering) is a large-scale QA dataset, aiming to stimulate progress of QA research over more complex and realistic tabular and textual data, especially those requiring numerical reasoning.
The unique features of TAT-QA include:
The context given is hybrid, comprising a semi-structured table and at least two relevant paragraphs that describe, analyze or complement the table; The questions are generated by the humans with rich financial knowledge, most are practical; The answer forms are diverse, including single span, multiple spans and free-form; To answer the questions, various numerical reasoning capabilities are usually required, including addition (+), subtraction (-), multiplication (x), division (/), counting, comparison, sorting, and their compositions;In addition to the ground-truth answers, the corresponding derivations and scale are also provided if any.
In total, TAT-QA contains 16,552 questions associated with 2,757 hybrid contexts from real-world financial reports.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains resources, namely TempTabQA, developed for the paper: Gupta, V., Kandoi, P., Vora, M., Zhang, S., He, Y., Reinanda R., Srikumar V., TempTabQA: Temporal Question Answering for Semi-Structured Tables. In: Proceeding of the The 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023.
TempTabQA is a dataset which comprises 11,454 question-answer pairs extracted from Wikipedia Infobox tables. These question-answer pairs are annotated by human annotators. We provide two test sets instead of one: the Head set with popular frequent domains, and the Tail set with rarer domains.
Files to access the annotation follow the below structure:
Maindata
Carefully read the ```LICENCE``` for non-academic usage.
Note : Wherever required consider the year of 2022 as the build date for the dataset.
https://cdla.dev/permissive-1-0/https://cdla.dev/permissive-1-0/
AIT-QA is a dataset for Table Question Answering (Table-QA) which is specific to the airline industry. The dataset consists of 515 questions authored by human annotators on 116 tables extracted from public U.S. SEC filings of major airline companies for the fiscal years 2017-2019. It also contains annotations pertaining to the nature of questions, marking those that require hierarchical headers, domain-specific terminology, and paraphrased forms. Different from the Table QA dataset, the tables in this dataset have more complex layouts.
hirundo-io/500-telecomm-personnel-table-qa dataset hosted on Hugging Face and contributed by the HF Datasets community
SPIQA Dataset Card Dataset Details Dataset Name: SPIQA (Scientific Paper Image Question Answering)
Paper: SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Github: SPIQA eval and metrics code repo
Dataset Summary: SPIQA is a large-scale and challenging QA dataset focused on figures, tables, and text paragraphs from scientific research papers in various computer science domains. The figures cover a wide variety of plots, charts, schematic diagrams, result visualization etc. The dataset is the result of a meticulous curation process, leveraging the breadth of expertise and ability of multimodal large language models (MLLMs) to understand figures. We employ both automatic and manual curation to ensure the highest level of quality and reliability. SPIQA consists of more than 270K questions divided into training, validation, and three different evaluation splits. The purpose of the dataset is to evaluate the ability of Large Multimodal Models to comprehend complex figures and tables with the textual paragraphs of scientific papers.
This Data Card describes the structure of the SPIQA dataset, divided into training, validation, and three different evaluation splits. The test-B and test-C splits are filtered from the QASA and QASPER datasets and contain human-written QAs. We collect all scientific papers published at top computer science conferences between 2018 and 2023 from arXiv.
If you have any comments or questions, reach out to Shraman Pramanick or Subhashini Venugopalan.
Supported Tasks: - Direct QA with figures and tables - Direct QA with full paper - CoT QA (retrieval of helpful figures, tables; then answering)
Language: English
Release Date: SPIQA is released in June 2024.
Data Splits The statistics of different splits of SPIQA is shown below.
Split | Papers | Questions | Schematics | Plots & Charts | Visualizations | Other figures | Tables |
---|---|---|---|---|---|---|---|
Train | 25,459 | 262,524 | 44,008 | 70,041 | 27,297 | 6,450 | 114,728 |
Val | 200 | 2,085 | 360 | 582 | 173 | 55 | 915 |
test-A | 118 | 666 | 154 | 301 | 131 | 95 | 434 |
test-B | 65 | 228 | 147 | 156 | 133 | 17 | 341 |
test-C | 314 | 493 | 415 | 404 | 26 | 66 | 1,332 |
Dataset Structure The contents of this dataset card are structured as follows:
bash SPIQA βββ SPIQA_train_val_test-A_extracted_paragraphs.zip βββ Extracted textual paragraphs from the papers in SPIQA train, val and test-A splits βββ SPIQA_train_val_test-A_raw_tex.zip βββ The raw tex files from the papers in SPIQA train, val and test-A splits. These files are not required to reproduce our results; we open-source them for future research. βββ train_val βββ SPIQA_train_val_Images.zip βββ Full resolution figures and tables from the papers in SPIQA train, val splits βββ SPIQA_train.json βββ SPIQA train metadata βββ SPIQA_val.json βββ SPIQA val metadata βββ test-A βββ SPIQA_testA_Images.zip βββ Full resolution figures and tables from the papers in SPIQA test-A split βββ SPIQA_testA_Images_224px.zip βββ 224px figures and tables from the papers in SPIQA test-A split βββ SPIQA_testA.json βββ SPIQA test-A metadata βββ test-B βββ SPIQA_testB_Images.zip βββ Full resolution figures and tables from the papers in SPIQA test-B split βββ SPIQA_testB_Images_224px.zip βββ 224px figures and tables from the papers in SPIQA test-B split βββ SPIQA_testB.json βββ SPIQA test-B metadata βββ test-C βββ SPIQA_testC_Images.zip βββ Full resolution figures and tables from the papers in SPIQA test-C split βββ SPIQA_testC_Images_224px.zip βββ 224px figures and tables from the papers in SPIQA test-C split βββ SPIQA_testC.json βββ SPIQA test-C metadata
The testA_data_viewer.json file is only for viewing a portion of the data on HuggingFace viewer to get a quick sense of the metadata.
Metadata Structure The metadata for every split is provided as dictionary where the keys are arXiv IDs of the papers. The primary contents of each dictionary item are:
arXiv ID Semantic scholar ID (for test-B) Figures and tables Name of the png file Caption Content type (figure or table) Figure type (schematic, plot, photo (visualization), others)
QAs Question, answer and rationale Reference figures and tables Textual evidence (for test-B and test-C)
Abstract and full paper text (for test-B and test-C; full paper for other splits are provided as a zip)
Dataset Use and Starter Snippets Downloading the Dataset to Local We recommend the users to download the metadata and images to their local machine.
Download the whole dataset (all splits). bash from huggingface_hub import snapshot_download snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path
Download specific file. bash from huggingface_hub import hf_hub_download hf_hub_download(repo_id="google/spiqa", filename="test-A/SPIQA_testA.json", repo_type="dataset", local_dir='.') ### Mention the local directory path
Questions and Answers from a Specific Paper in test-A bash import json testA_metadata = json.load(open('test-A/SPIQA_testA.json', 'r')) paper_id = '1702.03584v3' print(testA_metadata[paper_id]['qa'])
Questions and Answers from a Specific Paper in test-B bash import json testB_metadata = json.load(open('test-B/SPIQA_testB.json', 'r')) paper_id = '1707.07012' print(testB_metadata[paper_id]['question']) ## Questions print(testB_metadata[paper_id]['composition']) ## Answers
Questions and Answers from a Specific Paper in test-C bash import json testC_metadata = json.load(open('test-C/SPIQA_testC.json', 'r')) paper_id = '1808.08780' print(testC_metadata[paper_id]['question']) ## Questions print(testC_metadata[paper_id]['answer']) ## Answers
Annotation Overview Questions and answers for the SPIQA train, validation, and test-A sets were machine-generated. Additionally, the SPIQA test-A set was manually filtered and curated. Questions in the SPIQA test-B set are collected from the QASA dataset, while those in the SPIQA test-C set are from the QASPER dataset. Answering the questions in all splits requires holistic understanding of figures and tables with related text from the scientific papers.
Personal and Sensitive Information We are not aware of any personal or sensitive information in the dataset.
Licensing Information CC BY 4.0
Citation Information bibtex @article{pramanick2024spiqa, title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers}, author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini}, journal={NeurIPS}, year={2024} }
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
HCT-QA: Human-Centric Tables Question Answering
HCT-QA is a benchmark dataset designed to evaluate large language models (LLMs) on question answering over complex, human-centric tables (HCTs). These tables often appear in documents such as research papers, reports, and webpages and present significant challenges for traditional table QA due to their non-standard layouts and compositional structure. The dataset includes:
2,188 real-world tables with 9,835 human-annotated QA pairs 4β¦ See the full description on the dataset page: https://huggingface.co/datasets/qcri-ai/HCTQA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Excel file contains a quantitative analysis table for 523 manuscripts of the Epistle of Jude. The data were compiled using the open-cbgm library ./compare_witnesses function and a script that I wrote for the purpose of automating the comparison of all witnesses with every other witness and exporting the data to a single file. This script and a pdf guide can be found at https://github.com/dopeyduck/qa-table-starter. The original dataset contained 562 witnesses. This table was generated with all witnesses extant in at least 85% of variation units established during the collation process.
Korean tabular dataset is a collection of 1.4M tables with corresponding descriptions for unsupervised pre-training language models. Korean table question answering corpus consists of 70k pairs of questions and answers created by crowd-sourced workers.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GRI-QA
GRI-QA is a benchmark for Table Question Answering (QA) over environmental data extracted from corporate sustainability reports, following the Global Reporting Initiative (GRI) standards. It contains 4,000+ questions across 204 tables from English-language reports of European companies, covering extractive, comparative, quantitative, multi-step, and multi-table reasoning.
Tasks
Table QA on real-world corporate sustainability data Question types: extra (extractive)β¦ See the full description on the dataset page: https://huggingface.co/datasets/lucacontalbo/GRI-QA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains validated hyrochemical groundwater bore sample data with appended hydrodynamic and hydrostratigraphic data for the Galilee subregion. The dataset consists of a spreadsheet which includes field sampling data (e.g. water temperature, pH, EC, DO, alkalinity, water level and aquifer formations), sampling laboratory data (e..g. major ions, trace metals and isotopes) and desktop interpreted data (e.g. aquifer formations).
Data has been filtered based on it's hydrochemical validity following a 'Charged Balance Equation' (CBE) test where only those samples which balanced to +/- 10% were retained.
The spreadsheet was compiled from data sourced from the following locations:
Coulmn:
A - Row ID: Manually added as an index
B - RN: QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Table 'REGISTRATION', Field 'RN'
C - Old bore ID: QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Table 'REGISTRATION', Field 'ORIG_NAME_NO'
D - Dec lat: QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60 ) - Table 'REGISTRATION', Field 'GIS_LAT'
E - Dec long: QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Table 'REGISTRATION', Field 'GIS_LNG'
F - Water level measure data: QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Table 'WATER_LEVELS', Field 'RDATE'
G - Most Recent SWL: QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Table 'WATER_LEVELS', Field 'MEASURMENT'
H - Uncorrected RWL: 1 second DEM value (Geoscience Australia, 1 second SRTM Digital Elevation Model (DEM) - GUID: 9a9284b6-eb45-4a13-97d0-91bf25f1187b) minus 'Most Recent SWL' (column G) value
I - Formation from bore log: Multiple sources (and GIS interpretation using location and screen depth) including:
- QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores v3 20140312 (GUID: 68bbd3fb-6e2a-4088-a3dd-55cc44c2f0d6) - Table 'QLD GW Licences - Original - Final - v3.xls', Tab 'All Data', Field - 'WaterCodeSourcesList'
- QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Tables include: 'STRATIGRAPHY', 'AQUIFER' and 'MONITORING BORES'
- RPS Galilee Hydrogeological Investigations - Appendix tables B to F (original) (GUID: bfe35a54-7a71-45ec-b20e-f3bfcc8999ef) - Table 'Append B to F_ Dec2012.xlsx', Tab 'Table E-1', Field 'Identified aquifer'
- Carmichael Coal Mine and Rail Project Environmental Impact Statement (GUID: 2a595f74-aae6-4d83-9cd7-1459247d751a)
J - Formation Name Source: Name of source material used to decide aquifer formation for each sample/bore
K - Location Group: Broad location category from GIS location
L - Chem Formation Group: Standardised aquifer formation name to be used during analysis
M - Sample date: QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Table 'WATER ANALYSIS', Field 'RDATE'
N to P - Interpretation of screen and bore depths from QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Table 'CASING', Fields 'MATERIAL', 'TOP', 'BOTTOM'
Q to AC - QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Table 'WATER ANALYSIS'
AD to AK - QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Table 'FIELD WATER QUALITY'
AL to AW - Cation and anion calculation from Q to AC data
AX to AZ - Charged Balance Equations (CBE) for QA and QC of data. Validated (passed) data has a CBE of between +/-10%.
BA to CN - QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 (GUID: 827e70b3-29d5-4e89-8d59-a93bc499de60) - Table 'WATER ANALYSIS'
Only data which passed the CBE validation test (columns AX to AZ) were retained within the spreadsheet.
Bioregional Assessment Programme (2014) QLD Hydrochemistry QA QC GAL v02. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/e3fb6c9b-e224-4d2e-ad11-4bcba882b0af.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From QLD DNRM Hydrochemistry with QA/QC
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores v3 03122014
Derived From RPS Galilee Hydrogeological Investigations - Appendix tables B to F (original)
Derived From Geoscience Australia, 1 second SRTM Digital Elevation Model (DEM)
Derived From Carmichael Coal Mine and Rail Project Environmental Impact Statement
Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
DOI retrieved: 1987
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset presents key indicators related to water production and consumption in the State of Qatar. It includes figures on total water production, maximum daily production, and number of registered customers including tanker services. The data tracks both absolute values and percentage changes, supporting analysis of water supply trends, infrastructure demand, and service expansion.
Link to the ScienceBase Item Summary page for the item described by this metadata record. Service Protocol: Link to the ScienceBase Item Summary page for the item described by this metadata record. Application Profile: Web Browser. Link Function: information
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Abundance is characterized by VA= very abundant; C= common; F= few; R= rare; B= barren, EB= essentially barren. For preservation. P= poor; M= moderate; G= good, E= etched; O= overgrown. Lowercase letters indicate material considered to be reworked.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Qatar QA: PPP Conversion Factor: Private Consumption data was reported at 2.823 QAR/Intl $ in 2016. This records an increase from the previous number of 2.779 QAR/Intl $ for 2015. Qatar QA: PPP Conversion Factor: Private Consumption data is updated yearly, averaging 2.111 QAR/Intl $ from Dec 1990 (Median) to 2016, with 27 observations. The data reached an all-time high of 2.938 QAR/Intl $ in 2008 and a record low of 1.992 QAR/Intl $ in 1994. Qatar QA: PPP Conversion Factor: Private Consumption data remains active status in CEIC and is reported by World Bank. The data is categorized under Global Databaseβs Qatar β Table QA.World Bank: Gross Domestic Product: Purchasing Power Parity. Purchasing power parity conversion factor is the number of units of a country's currency required to buy the same amounts of goods and services in the domestic market as U.S. dollar would buy in the United States. This conversion factor is for private consumption (i.e., household final consumption expenditure). For most economies PPP figures are extrapolated from the 2011 International Comparison Program (ICP) benchmark estimates or imputed using a statistical model based on the 2011 ICP. For 47 high- and upper middle-income economies conversion factors are provided by Eurostat and the Organisation for Economic Co-operation and Development (OECD).; ; World Bank, International Comparison Program database.; ;
A preliminary dataset of related tables and a corresponding set of natural language questions.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global table-top tensile tester market is experiencing robust growth, driven by increasing demand across diverse sectors like textiles, paper, and packaging. The market size in 2025 is estimated at $150 million, demonstrating a Compound Annual Growth Rate (CAGR) of 7% from 2025 to 2033. This growth is fueled by several factors, including the rising need for quality control in manufacturing, the miniaturization of testing equipment for space-constrained labs, and the growing adoption of automated testing procedures. The increasing preference for user-friendly, compact, and cost-effective testing solutions is further boosting the market. Single-column testing machines currently dominate the market, however, dual-column machines are expected to witness faster growth due to their enhanced capacity and precision. Geographically, North America and Europe are currently leading the market share, but significant growth potential lies within the Asia-Pacific region driven by rapid industrialization and expanding manufacturing sectors in countries like China and India. The market faces certain restraints, primarily the high initial investment cost associated with advanced testing equipment and the availability of cheaper, albeit less accurate, alternative testing methods. However, the long-term benefits of enhanced quality control and reduced production losses outweigh these initial costs, driving continued market expansion. Technological advancements, such as the integration of advanced software and digital data analysis, are expected to further enhance the capabilities and appeal of table-top tensile testers. This trend towards smart testing solutions will continue to shape the market landscape in the coming years, fostering innovation and creating new opportunities for market players. The competition is expected to remain fierce among established players like Shimadzu, Instron, and ZwickRoell, while new entrants are focusing on niche applications and providing cost-effective solutions.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
ComTQA Dataset
1. Introduction
This dataset is a visual table question answering benchmark. The images are collected from FinTabNet and PubTables-1M. It totally includes 9070 QA pairs with 1591 images. The specific distribution of data is shown in the following table.
PubTables-1M FinTabNet Total
932 659 1,591
6,2322,838 9,070
2. How to use it
FirstοΌplease download the FinTabNet and PubTables-1M from their originalβ¦ See the full description on the dataset page: https://huggingface.co/datasets/ByteDance/ComTQA.
The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to answer. This dataset is re-annotated from the previous HybridQA dataset. The dataset is collected by UCSB NLP group and issued under MIT license.