Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We challenge you with a visual question answering task! Given an image and a textual question, draw the bounding box around the object correctly responding to that question.
Our dataset consists of the images associated with textual questions. One entry (instance) in our dataset is a question-image pair labeled with the ground truth coordinates of a bounding box containing the visual answer to the given question.
The images were obtained from a CC BY-licensed subset of the Microsoft Common Objects in Context dataset, MS COCO. All data labeling was performed on the Toloka crowdsourcing platform, https://toloka.ai/. The entire dataset can be downloaded at https://doi.org/10.5281/zenodo.7057740.
Please cite the challenge results or dataset description as follows.
@inproceedings{TolokaWSDMCup2023,
author = {Ustalov, Dmitry and Pavlichenko, Nikita and Koshelev, Sergey and Likhobaba, Daniil and Smirnova, Alisa},
title = {{Toloka Visual Question Answering Benchmark}},
year = {2023},
eprint = {2309.16511},
eprinttype = {arxiv},
eprintclass = {cs.CV},
language = {english},
}
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The first significant Visual Question Answering (VQA) dataset was the DAtaset for QUestion Answering on Real-world images (DAQUAR). It contains 6794 training and 5674 test question-answer pairs, based on images from the NYU-Depth V2 Dataset. That means about 9 pairs per image on average.
This dataset is a processed version of the Full DAQUAR Dataset where the questions have been normalized (for easier consumption by tokenizers) & the image IDs, questions & answers are stored in a tabular (CSV) format, which can be loaded & used as-is for training VQA models.
This dataset contains the processed DAQUAR Dataset (full), along with some of the raw files from the original dataset.
Processed data:
- data.csv: This is the processed dataset after normalizing all the questions & converting the {question, answer, image_id} data into a tabular format for easier consumption.
- data_train.csv: This contains those records from data.csv which correspond to images present in train_images_list.txt
- data_eval.csv: This contains those records from data.csv which correspond to images present in test_images_list.txt
- answer_space.txt: This file contains a list of all possible answers extracted from all_qa_pairs.txt (This will allow the VQA task to be modelled as a multi-class classification problem)
Raw files:
- all_qa_pairs.txt
- train_images_list.txt
- test_images_list.txt
Malinowski, Mateusz, and Mario Fritz. "A multi-world approach to question answering about real-world scenes based on uncertain input." Advances in neural information processing systems 27 (2014): 1682-1690.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
PitVQA dataset comprises 25 videos of endoscopic pituitary surgeries from the National Hospital of Neurology and Neurosurgery in London, United Kingdom, similar to the dataset used in the MICCAI PitVis challenge. All patients provided informed consent, and the study was registered with the local governance committee. The surgeries were recorded using a high-definition endoscope (Karl Storz Endoscopy) with a resolution of 720p and stored as MP4 files. All videos were annotated for the surgical phases, steps, instruments present and operation notes guided by a standardised annotation framework, which was derived from a preceding international consensus study on pituitary surgery workflow. Annotation was performed collaboratively by 2 neurosurgical residents with operative pituitary experience and checked by an attending neurosurgeon. We extracted image frames from each video at 1 fps and removed any frames that were blurred or occluded. Ultimately, we obtained a total of 109,173 frames, with the videos of minimum and maximum length yielding 2,443 and 7,179 frames, respectively. We acquired frame-wise question-answer pairs for all the categories of the annotation. Overall, there are 884,242 question-answer pairs from 109,173 frames, which is around 8 pairs for each frame. There are 59 classes overall, including 4 phases, 15 steps, 18 instruments, 3 variations of instruments present in a frame, 5 positions of the instruments, and 14 operation notes in the annotation classes. The length of the questions ranges from a minimum of 7 words to a maximum of 12 words.The details description of the original videos can be found at the MICCAI PitVis challenge and the videos can be directly download from UCL HDR portal.
Facebook
TwitterA new challenge named Visual Text Question Answering (VTQA) along with a corresponding dataset, which includes 23,781 questions based on 10,124 image-text pairs.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
EVJVQA - Multilingual Visual Question Answering
Abstract
Visual Question Answering (VQA) is a challenging task of natural language processing (NLP) and computer vision (CV), attracting significant attention from researchers. English is a resource-rich language that has witnessed various developments in datasets and models for visual question answering. Visual question answering in other languages also would be developed for resources and models. In addition, there is no… See the full description on the dataset page: https://huggingface.co/datasets/dinhanhx/evjvqa.
Facebook
Twitterhttps://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
We propose an artificial intelligence challenge to design algorithms that assist people who are blind to overcome their daily visual challenges. For this purpose, we introduce the VizWiz dataset, which originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. Our proposed challenge addresses the following two tasks for this dataset: (1) predict the answer to a visual question and (2) predict whether a visual question cannot be answered. Ultimately, we hope this work will educate more people about the technological needs of blind people while providing an exciting new opportunity for researchers to develop assistive technologies that eliminate accessibility barriers for blind people. VizWiz v1.0 dataset download: 20,000 training image/question pairs 200,000 training answer/answer confidence pairs 3,173 image/question pairs 31,730 validation answ
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TextVQA requires models to read and reason about text in an image to answer questions based on them. In order to perform well on this task, models need to first detect and read text in the images. Models then need to reason about this to answer the question. Current state-of-the-art models fail to answer questions in TextVQA because they do not have text reading and reasoning capabilities. See the examples in the image to compare ground truth answers and corresponding predictions by a state-of-the-art model. Challenge link: https://eval.ai/web/challenges/challenge-page/874/
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for SEA-VQA
SEA-VQA is a dataset designed to evaluate the performance of Visual Question Answering (VQA) models on culturally specific content from Southeast Asia (SEA). This dataset aims to highlight the challenges and gaps in existing VQA models when confronted with culturally rich content.
Dataset Details
Dataset Description
SEA-VQA is a specialized VQA dataset that includes images from eight Southeast Asian countries, curated from the UNESCO… See the full description on the dataset page: https://huggingface.co/datasets/wit543/sea-vqa.
Facebook
Twitterhttps://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The task of Difference Visual Question Answering involves answering questions about the difference between a pair of main and reference images. This process is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. We've assembled a new dataset, called Medical-Diff-VQA, for this purpose. Unlike previous medical VQA datasets, ours is the first one designed specifically for the Difference Visual Question Answering task, with questions crafted to suit the Assessment-Diagnosis-Intervention-Evaluation treatment procedure employed by medical professionals. The Medical-Diff-VQA dataset, a derivative of the MIMIC-CXR dataset, consists of questions categorized into seven categories: abnormality (145,421), location (84,193), type (27,478), level (67,296), view (56,265), presence (155,726), and difference(164,324). The 'difference' questions are specifically for comparing two images. In total, the Medical-Diff-VQA dataset contains 700,703 question-answer pairs derived from 164,324 pairs of main and reference images.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NOTE : I upload this dataset for ease of use, original one can be found here
Updated version as of January 10, 2023: Replaced “unsuitable” or “unsuitable image” in the answers to be “unanswerable”.
We introduce the visual question answering (VQA) dataset, which we call VizWiz-VQA. It originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. Our proposed challenge addresses the following two tasks for this dataset: predict the answer to a visual question and predict whether a visual question cannot be answered.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Visual Question Answering (VQA) technology market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) and computer vision across various sectors. While precise market figures are unavailable, considering a typical CAGR for emerging AI technologies of 20-25%, and a plausible 2025 market size of $500 million (this is an estimation based on the growth rates of similar AI sub-segments), we can project a significant expansion. Key drivers include the escalating demand for automated data analysis, particularly in healthcare (medical image analysis), retail (customer service chatbots), and autonomous vehicles (object recognition and scene understanding). Advances in deep learning algorithms and the availability of large-scale image and text datasets further fuel market growth. Trends suggest a shift towards more sophisticated VQA systems capable of handling complex questions and nuanced visual contexts. This includes integrating natural language processing (NLP) capabilities for better question understanding and context awareness. Restraints include challenges in handling ambiguous questions, addressing biases in training data, and ensuring the robustness and reliability of VQA systems in real-world applications. The development of explainable AI (XAI) techniques to improve transparency and trust in VQA outputs is also crucial. The market segmentation likely includes solutions categorized by deployment (cloud-based vs. on-premise), application (healthcare, retail, automotive, etc.), and technology (deep learning, convolutional neural networks, etc.). Companies like Toshiba, Amazon, Cognex, and others are actively involved in developing and deploying VQA technologies, contributing to a competitive landscape. Regional distribution will likely see a strong presence in North America and Europe initially, followed by growth in Asia-Pacific and other regions as adoption expands. Continued innovation in areas such as multimodal learning, transfer learning, and few-shot learning will be vital for future growth and the improvement of VQA systems' overall accuracy and efficiency. The forecast period (2025-2033) points towards sustained market expansion as businesses increasingly leverage VQA for improved decision-making, enhanced efficiency, and new avenues of innovation.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
OpenViVQA: Open-domain Vietnamese Visual Question Answering
The OpenViVQA dataset contains 11,000+ images with 37,000+ question-answer pairs which introduces the Text-based Open-ended Visual Question Answering in Vietnamese. This dataset is publicly available to the research community in the VLSP 2023 - ViVRC shared task challenge. You can access the dataset as well as submit your results to evaluate on the private test set on the Codalab evaluation system. Link to the OpenViVQA… See the full description on the dataset page: https://huggingface.co/datasets/uitnlp/OpenViVQA-dataset.
Facebook
TwitterThe beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview of results for MNIST Dialog, CLEVR-Dialog and CLEVR VQA.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark
Paper: https://www.arxiv.org/abs/2507.13405 Repository: https://github.com/corevqa/COREVQA Demo: https://colab.research.google.com/drive/1SpuTta5tSzktiCo9xN4CtE9P1pmYV0ax CrowdHuman Dataset Homepage: https://www.crowdhuman.org/
Abstract
Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs’ ability to reason over certain types of image–question pairs in crowded scenes.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
TyDiQA is the gold passage version of the Typologically Diverse Question Answering (TyDiWA) dataset, a benchmark for information-seeking question answering, which covers nine languages. The gold passage version is a simplified version of the primary task, which uses only the gold passage as context and excludes unanswerable questions. It is thus similar to XQuAD and MLQA, while being more challenging as questions have been written without seeing the answers, leading to 3× and 2× less lexical overlap compared to XQuAD and MLQA respectively.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
VisualOverload
[📚 Paper] [💻 Code] [🌐 Project Page] [🏆 Leaderboard] [🎯 Online Evaluator]
Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question–answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple… See the full description on the dataset page: https://huggingface.co/datasets/paulgavrikov/visualoverload.
Facebook
TwitterQuestion-answering has become a popular task, with many practical applications (e.g. dialogue systems). It's appealingly easy to interpret and quantitatively evaluate, and with a simple setup, it's relatively easy to generate the very large datasets that work well for deep learning. Advancements in visual QA (with images and natural language questions) with large-scale datasets have pushed this field forward rapidly, and with this challenge dataset we hope to extend this progress to video.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. This dataset is more challenging than standard QA benchmark datasets such as Stanford Question Answering Dataset (SQuAD), as the answers for a question may not be directly obtained by span prediction and the context is very long. TriviaQA dataset consists of both human-verified and machine-generated QA subsets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview of primitive operations categorised by their symbolic or subsymbolic implementation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We challenge you with a visual question answering task! Given an image and a textual question, draw the bounding box around the object correctly responding to that question.
Our dataset consists of the images associated with textual questions. One entry (instance) in our dataset is a question-image pair labeled with the ground truth coordinates of a bounding box containing the visual answer to the given question.
The images were obtained from a CC BY-licensed subset of the Microsoft Common Objects in Context dataset, MS COCO. All data labeling was performed on the Toloka crowdsourcing platform, https://toloka.ai/. The entire dataset can be downloaded at https://doi.org/10.5281/zenodo.7057740.
Please cite the challenge results or dataset description as follows.
@inproceedings{TolokaWSDMCup2023,
author = {Ustalov, Dmitry and Pavlichenko, Nikita and Koshelev, Sergey and Likhobaba, Daniil and Smirnova, Alisa},
title = {{Toloka Visual Question Answering Benchmark}},
year = {2023},
eprint = {2309.16511},
eprinttype = {arxiv},
eprintclass = {cs.CV},
language = {english},
}