Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CulturalGround: Grounding Multilingual Multimodal LLMs With Cultural Knowledge
🌍 🇩🇪 🇫🇷 🇬🇧 🇪🇸 🇮🇹 🇵🇱 🇷🇺 🇨🇿 🇯🇵 🇺🇦 🇧🇷 🇮🇳 🇨🇳 🇳🇴 🇵🇹 🇮🇩 🇮🇱 🇹🇷 🇬🇷 🇷🇴 🇮🇷 🇹🇼 🇲🇽 🇮🇪 🇰🇷 🇧🇬 🇹🇭 🇳🇱 🇪🇬 🇵🇰 🇳🇬 🇮🇩 🇻🇳 🇲🇾 🇸🇦 🇮🇩 🇧🇩 🇸🇬 🇱🇰 🇰🇪 🇲🇳 🇪🇹 🇹🇿 🇷🇼 🏠 Homepage | 🤖 CulturalPangea-7B | 📊 CulturalGround | 💻 Github | 📄 Arxiv
We introduce CulturalGround, a large-scale cultural VQA dataset and a pipeline for creating cultural… See the full description on the dataset page: https://huggingface.co/datasets/neulab/CulturalGround.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models
Dataset Summary
TaiwanVQA is a visual question answering (VQA) benchmark designed to evaluate the capability of vision-language models (VLMs) in recognizing and reasoning about culturally specific content related to Taiwan. This dataset contains 2,736 images captured by our team, paired with 5,472 manually designed questions that cover diverse topics from daily life in Taiwan… See the full description on the dataset page: https://huggingface.co/datasets/hhhuang/TaiwanVQA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Visual Question Answering (VQA) is a Vision-to-Text (V2T) task that integrates visual
features of images with natural language questions to generate meaningful responses.
Most existing research has focused on English, leaving a significant gap for other
languages, including Amharic. Tourism, a major global industry, relies heavily on
interactions where visitors seek information about natural, historical, cultural, and
religious sites. Ethiopia is a remarkable tourist destination, home to unique sites such as
the Rock-hewn churches of Lalibela and the Castles of Gondar, as well as natural
phenomena like Simien National Park and Lake Tana. Most visitors are local, creating an
urgent need for a VQA model that can deliver accurate, culturally relevant information in
Amharic. Unfortunately, no such model currently exists to assist tourists at these heritage
sites. This research addresses this gap by developing an Amharic Visual Question
Answering model specifically tailored for Ethiopian tourism. A new Amharic VQA
dataset was created using 2,200 diverse images from Ethiopian tourist sites paired with
6,600 questions in Amharic, covering natural landmarks, historical sites, and religious
celebrations. Our dataset is collected from various sources, including the UNICCO
website, the Amhara Tourism office, and online platforms such as Facebook, Free pixel,
and Instagram. Each image is complemented by three corresponding questions
formulated by three individual experts and answered by ten candidates. The questions,
answers, and images are linked through annotations and fed into the model. We used
ResNet-50 for feature extraction and Bidirectional Gated Recurrent Unit (BiGRU) with
attention mechanisms, achieving a testing accuracy of 54.98%, demonstrating the model's
effectiveness in answering questions about Ethiopian heritage. We will expand this
research using external knowledge to gat answer and description beyond image and
custom object detection
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture
FoodieQA is a benchmark including Multi-image, Single-image VQA, and textQA questions about regional Chinese food. Built upon 389 unique food images on 350 unique food entries. The food images are collected from individual volunteers and not from the web to ensure evaluation fairness, specifically designed to evaluate the VLMs' capability on fine-grained understanding of Chinese food culture.… See the full description on the dataset page: https://huggingface.co/datasets/lyan62/FoodieQA.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the French Language Visual Question Answer Dataset. The dataset includes 5000 diverse images and total 35000+ question-answer pairs associated with it. This meticulously curated dataset advances AI models for multimodal data understanding and development of French language visual question-answering (VQA) models.
This image question-answer training dataset comprises over 5,000 high-resolution images across diverse categories and scenes. Each image is carefully selected to represent a wide array of contexts, objects, and environments, ensuring comprehensive coverage for training robust VQA models.
The dataset includes more than 35,000 French-language question and answer pairs, which means around 7-10 question answers for each image. It is thoughtfully crafted to cover various levels of complexity and types of questions. These pairs are designed to test and improve the model's ability to understand and respond to visual inputs in natural language.
ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
ChitroJera is the first large-scale Visual Question Answering (VQA) dataset for Bangla, designed to capture regionally relevant cultural, linguistic, and visual contexts. It enables research on multimodal learning in low-resource languages and encourages the development of AI systems tailored to South Asian contexts.
📑 Paper
If you use ChitroJera in your research, please cite:… See the full description on the dataset page: https://huggingface.co/datasets/pltops/chitroJera.
Dataset Overview
This dataset is was created from 56,989 Vietnamese 🇻🇳 localization images. The dataset includes quintessentially Vietnamese images such as scenic landscapes, historical sites, culinary specialties, festivals, cultural aspects from various regions, familiar rural scenes, and everyday life in urban areas, among others. Each image has been analyzed and annotated using advanced Visual Question Answering (VQA) techniques to produce a comprehensive dataset. There is a… See the full description on the dataset page: https://huggingface.co/datasets/5CD-AI/Viet-Localization-VQA.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CulturalGround: Grounding Multilingual Multimodal LLMs With Cultural Knowledge
🌍 🇩🇪 🇫🇷 🇬🇧 🇪🇸 🇮🇹 🇵🇱 🇷🇺 🇨🇿 🇯🇵 🇺🇦 🇧🇷 🇮🇳 🇨🇳 🇳🇴 🇵🇹 🇮🇩 🇮🇱 🇹🇷 🇬🇷 🇷🇴 🇮🇷 🇹🇼 🇲🇽 🇮🇪 🇰🇷 🇧🇬 🇹🇭 🇳🇱 🇪🇬 🇵🇰 🇳🇬 🇮🇩 🇻🇳 🇲🇾 🇸🇦 🇮🇩 🇧🇩 🇸🇬 🇱🇰 🇰🇪 🇲🇳 🇪🇹 🇹🇿 🇷🇼 🏠 Homepage | 🤖 CulturalPangea-7B | 📊 CulturalGround | 💻 Github | 📄 Arxiv
We introduce CulturalGround, a large-scale cultural VQA dataset and a pipeline for creating cultural… See the full description on the dataset page: https://huggingface.co/datasets/neulab/CulturalGround.