Visual Question Answering (VQA) is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. The first version of the dataset was released in October 2015. VQA v2.0 was released in April 2017.
Visual Question Answering (VQA) v2.0 is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. It is the second version of the VQA dataset.
265,016 images (COCO and abstract scenes) At least 3 questions (5.4 questions on average) per image 10 ground truth answers per question 3 plausible (but likely incorrect) answers per question Automatic evaluation metric
The first version of the dataset was released in October 2015.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
JA-VG-VQA-500
Dataset Description
JA-VG-VQA-500 is a 500-sample subset of Japanese Visual Genome VQA dataset. This dataset was used in the evaluation of EvoVLM-JP-v1-7B. Please refer to our report and blog for more details. We are grateful to the developers for making the dataset available under Creative Commons Attribution 4.0 License.
Visual Genome Japanese Visual Genome VQA dataset
Usage
Use the code below to get started with the dataset. from datasets… See the full description on the dataset page: https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
JA-Multi-Image-VQA
Dataset Description
JA-Multi-Image-VQA is a dataset for evaluating the question answering capabilities on multiple image inputs. We carefully collected a diverse set of 39 images with 55 questions in total. Some images contain Japanese culture and objects in Japan. The Japanese questions and answers were created manually.
Usage
from datasets import load_dataset dataset = load_dataset("SakanaAI/JA-Multi-Image-VQA", split="test")… See the full description on the dataset page: https://huggingface.co/datasets/SakanaAI/JA-Multi-Image-VQA.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A BitTorrent file to download data with the title 'VQA: Visual Question Answering Dataset'
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
WorldCuisines is a massive-scale visual question answering (VQA) benchmark for multilingual and multicultural understanding through global cuisines. The dataset contains text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark as of 17 October 2024.… See the full description on the dataset page: https://huggingface.co/datasets/worldcuisines/vqa-v1.1.
VQA-RAD consists of 3,515 question–answer pairs on 315 radiology images.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
man-made architecture and natural scenery.
Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for VQA-RAD
Dataset Description
VQA-RAD is a dataset of question-answer pairs on radiology images. The dataset is intended to be used for training and testing Medical Visual Question Answering (VQA) systems. The dataset includes both open-ended questions and binary "yes/no" questions. The dataset is built from MedPix, which is a free open-access online database of medical images. The question-answer pairs were manually generated by a team of clinicians.… See the full description on the dataset page: https://huggingface.co/datasets/flaviagiammarino/vqa-rad.
ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
PitVQA dataset comprises 25 videos of endoscopic pituitary surgeries from the National Hospital of Neurology and Neurosurgery in London, United Kingdom, similar to the dataset used in the MICCAI PitVis challenge. All patients provided informed consent, and the study was registered with the local governance committee. The surgeries were recorded using a high-definition endoscope (Karl Storz Endoscopy) with a resolution of 720p and stored as MP4 files. All videos were annotated for the surgical phases, steps, instruments present and operation notes guided by a standardised annotation framework, which was derived from a preceding international consensus study on pituitary surgery workflow. Annotation was performed collaboratively by 2 neurosurgical residents with operative pituitary experience and checked by an attending neurosurgeon. We extracted image frames from each video at 1 fps and removed any frames that were blurred or occluded. Ultimately, we obtained a total of 109,173 frames, with the videos of minimum and maximum length yielding 2,443 and 7,179 frames, respectively. We acquired frame-wise question-answer pairs for all the categories of the annotation. Overall, there are 884,242 question-answer pairs from 109,173 frames, which is around 8 pairs for each frame. There are 59 classes overall, including 4 phases, 15 steps, 18 instruments, 3 variations of instruments present in a frame, 5 positions of the instruments, and 14 operation notes in the annotation classes. The length of the questions ranges from a minimum of 7 words to a maximum of 12 words.The details description of the original videos can be found at the MICCAI PitVis challenge and the videos can be directly download from UCL HDR portal.
The VQA 2.0 dataset is used for visual question answering task. It consists of three sets with a train set containing 83k images and 444k questions, a validation set containing 41k images and 214k questions, and a test set containing 81k images and 448k questions.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Le Trong Hieu
Released under Apache 2.0
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We present VQA-MHUG - a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA), collected using a high-speed eye tracker. To the best of our knowledge, this is the first resource containing multimodal human gaze data over a textual question and the corresponding image. Our corpus encompasses task-specific gaze on a subset of the benchmark dataset VQAv2 val2. Our dataset is unique in that it is the first to provide real human gaze data on both images and corresponding questions and, as such, allows researchers to jointly study human and machine attention. We use our dataset to analyse the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models: Modulated Co-Attention Network (MCAN) with either grid or region features, Pythia, Bilinear Attention Network (BAN), and the Multimodal Factorised Bilinear Pooling Network (MFB). While prior work has focused on studying the image modality, our analyses show - for the first time - that for all models, higher correlation with human attention on text is a significant predictor of VQA performance. This finding points at a potential for improving VQA performance and, at the same time, calls for further research on neural text attention mechanisms and their integration into architectures for vision and language tasks, including but potentially also beyond VQA.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The Visual Question Answering (VQA) technology market is experiencing robust growth, driven by increasing demand for advanced image analysis and AI-powered solutions across diverse industries. The market, estimated at $2 billion in 2025, is projected to expand at a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This significant growth is fueled by several key factors. The proliferation of big data and the advancements in deep learning algorithms are enabling more accurate and efficient VQA systems. Furthermore, the rising adoption of VQA in sectors such as healthcare (for medical image analysis), retail (for enhanced customer experience), and autonomous vehicles (for scene understanding) is significantly boosting market expansion. The increasing availability of powerful cloud computing resources further facilitates the development and deployment of complex VQA models. While challenges such as data bias and the need for robust annotation techniques remain, the overall market outlook for VQA technology is extremely positive. Segmentation analysis reveals strong growth across various application areas. The software industry currently leads in VQA adoption, followed by the computer and electronics industries. Within the technology itself, image classification and image identification are the dominant segments, indicating a strong focus on practical applications. Geographically, North America and Europe currently hold the largest market shares, but the Asia-Pacific region is expected to witness substantial growth in the coming years, driven by increasing investments in AI and technological advancements in countries like China and India. Key players like Toshiba Corporation, Amazon Science, and Cognex are actively contributing to market growth through continuous innovation and strategic partnerships. The competitive landscape is dynamic, with both established tech giants and emerging startups vying for market share. The long-term outlook suggests that VQA technology will continue to be a critical component of various emerging technologies and will play a pivotal role in shaping the future of artificial intelligence.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The Kvasir-VQA dataset is an extended dataset derived from the HyperKvasir and Kvasir-Instrument datasets, augmented with question-and-answer annotations. This dataset is designed to facilitate advanced machine learning tasks in gastrointestinal (GI) diagnostics, including image captioning, Visual Question Answering (VQA) and text-based generation of synthetic medical images Homepage: https://datasets.simula.no/kvasir-vqa
Usage
You can use the Kvasir-VQA dataset directly from… See the full description on the dataset page: https://huggingface.co/datasets/SimulaMet-HOST/Kvasir-VQA.
VQA-E is a dataset for Visual Question Answering with Explanation, where the models are required to generate and explanation with the predicted answer. The VQA-E dataset is automatically derived from the VQA v2 dataset by synthesizing a textual explanation for each image-question-answer triple.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by MD Zeeshan Hassan
Released under Apache 2.0
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
PathVQA来源于He, Xuehai, et al. "Pathvqa: 30000+ questions for medical visual question answering." arxiv preprint arxiv:2003.10286 (2020). VQA-RAD来源于 Lau, Jason J., et al. "A dataset of clinically generated visual questions and answers about radiology images." Scientific data 5.1 (2018): 1-10.
Visual Question Answering (VQA) is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. The first version of the dataset was released in October 2015. VQA v2.0 was released in April 2017.