Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images from the OpenImages dataset.
Large-scale Multi-modality Models Evaluation Suite
Accelerating the development of large-scale multi-modality models (LMMs) with lmms-eval
🏠 Homepage | 📚 Documentation | 🤗 Huggingface Datasets
This Dataset
This is a formatted version of TextVQA. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models. @inproceedings{singh2019towards, title={Towards vqa models that can read}, author={Singh, Amanpreet and… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/textvqa.
redactable-llm/TextVQA dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TextVQA requires models to read and reason about text in an image to answer questions based on them. In order to perform well on this task, models need to first detect and read text in the images. Models then need to reason about this to answer the question. Current state-of-the-art models fail to answer questions in TextVQA because they do not have text reading and reasoning capabilities. See the examples in the image to compare ground truth answers and corresponding predictions by a state-of-the-art model. Challenge link: https://eval.ai/web/challenges/challenge-page/874/
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
LIME-DATA/textvqa dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Overview
This dataset is was created from 42,678 Vietnamese 🇻🇳 images with the last GPT-4o. The dataset has superior quality compared to other existing datasets with:
Highly detailed descriptions, from the overall composition of the image to descriptions of each object, including their location, quantity, etc. Descriptions of text include not only recognition but also the font style, color, position, and size of the text. Answers are very long and detailed, including… See the full description on the dataset page: https://huggingface.co/datasets/5CD-AI/Viet-ShareGPT-4o-Text-VQA.
nielsr/textvqa-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answering, address the performance bottleneck issue caused by over-fitting risk in existing self-attention-based models, and propose a scenario text visual question answering method called INT2-VQA that fuses knowledge manifestation based on inter-modality and intra-modality collaborations. Specifically, we model the complementary priori knowledge of locational collaboration between visual targets and textual targets across modalities and the contextual semantical collaboration among textual word targets within a modality. Based on this, a universal knowledge-reinforced attention module is designed to achieve a unified encoding manifestation of both relations. Extensive ablation experiments, contrast experiments, and visual analyses demonstrate the effectiveness of the proposed method and prove its superiority over the other state-of-the-art methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.
Dataset Card for "spotlight-textvqa-enrichment"
More Information needed
TextVQA
Overview
TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions.
Statistics
28,408 images from OpenImages 45,336 questions 453,360 ground truth answers
Code and Papers
TextVQA and LoRRA at https://github.com/facebookresearch/pythia. Iterative Answer Prediction with… See the full description on the dataset page: https://huggingface.co/datasets/d-delaurier/redactable-text-vqa.
PresentLogic/textvqa-chat dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
TextVQA validation set with grounding truth bounding box
The dataset used in the paper MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs for studying MLLMs' attention patterns. The dataset is sourced from TextVQA and annotated manually with ground-truth bounding boxes. We consider questions with a single area of interest in the image so that 4370 out of 5000 samples are kept.
Citation
If you find our paper and code useful… See the full description on the dataset page: https://huggingface.co/datasets/jrzhang/TextVQA_GT_bbox.
MrZilinXiao/MMEB-eval-TextVQA-beir-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Mobile Capture VQA
What is it?
This benchmark is an early effort to provide some evaluation of the different existing VLMs on mobile capture data. It contains:
122 unique images 871 question/answers pairs
This dataset is a collection of "mobile capture" images, i.e. images made from a cellphone. Most existing benchmarks rely on document scans/PDFs (DocVQA, ChartQA) or scene text recognition (TextVQA) but overlook the unique challenges that mobile capture poses:
poor… See the full description on the dataset page: https://huggingface.co/datasets/arnaudstiegler/mobile_capture_vqa.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Resources
Paper: VoQA: Visual-only Question Answering Evaluation code and task brief introduction: LuyangJ/VoQA (Github) Dataset folders: train (VoQA Training Dataset, 3.35M samples) test (VoQA Benchmark, 134k samples)
VoQA Benchmark
Sub-tasks
The VoQA evaluation dataset (VoQA Benchmark) includes the following tasks:
GQA POPE ScienceQA TextVQA VQAv2
Each task contains two data formats:
VoQA Task: Watermark rendering images Traditional VQA Task: original… See the full description on the dataset page: https://huggingface.co/datasets/AJN-AI/VoQA.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
nanoLLaVA - Sub 1B Vision-Language Model
Description
nanoLLaVA is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices.
Base LLM: Quyen-SE-v0.1 (Qwen1.5-0.5B) Vision Encoder: google/siglip-so400m-patch14-384
Model VQA v2 TextVQA ScienceQA POPE MMMU (Test) MMMU (Eval) GQA MM-VET
Score 70.84 46.71 58.97 84.1 28.6 30.4 54.79 23.9
Training Data
Training Data will be released later as I am still writing a… See the full description on the dataset page: https://huggingface.co/datasets/taiseimatsuoka/test-public.
WeThink-Multimodal-Reasoning-120K
Image Type
Images data can be access from https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k
Image Type Source Dataset Images
General Images COCO 25,344
SAM-1B 18,091
Visual Genome 4,441
GQA 3,251
PISC 835
LLaVA 134
Text-Intensive Images TextVQA 25,483
ShareTextVQA 538
DocVQA 4,709
OCR-VQA5,142
ChartQA 21,781
Scientific & Technical GeoQA+ 4,813
ScienceQA 4,990
AI2D 1,812
CLEVR-Math 677… See the full description on the dataset page: https://huggingface.co/datasets/WeThink/WeThink-Multimodal-Reasoning-120K.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images from the OpenImages dataset.