MMBench is a multi-modality benchmark. It methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Provide:
a high-level explanation of the dataset characteristics explain motivations and summary of its content potential use cases of the dataset
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MM-SpuBench Datacard
Basic Information
Title: The Multimodal Spurious Benchmark (MM-SpuBench) Description: MM-SpuBench is a comprehensive benchmark designed to evaluate the robustness of MLLMs to spurious biases. This benchmark systematically assesses how well these models distinguish between core and spurious features, providing a detailed framework for understanding and quantifying spurious biases. Data Structure: ├── data/images │ ├── 000000.jpg │ ├── 000001.jpg │… See the full description on the dataset page: https://huggingface.co/datasets/mmbench/MM-SpuBench.
Dataset Card for "MMBench_dev"
Dataset Summary
In recent years, the field has seen a surge in the development of numerous vision-language (VL) models, such as MiniGPT-4 and LLaVA. These models showcase promising performance in tackling previously challenging tasks. However, effectively evaluating these models' performance has become a primary challenge hindering further advancement in large VL models. Traditional benchmarks like VQAv2 and COCO Caption are widely used to… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/MMBench_dev.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🖥️ MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
Introduction
We are happy to release MMBench-GUI, a hierarchical, multi-platform benchmark framework and toolbox, to evaluate GUI agents. MMBench-GUI is comprising four evaluation levels: GUI Content Understanding, GUI Element Grounding, GUI Task Automation, and GUI Task Collaboration. We also propose the Efficiency–Quality Area (EQA) metric for agent navigation, integrating… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/MMBench-GUI.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
YaxinLuo/mmbench dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Based on the Kuaishou short video data, we constructed 6 datasets for Vision-Language Models (VLMs) like Kwai Keye-VL-8B, Qwen2.5-VL and InternVL to evaluate performance.
Tasks
Task Description
CPV The task of predicting product attributes in e-commerce.
Hot_Videos_Aggregation The task of determining whether multiple videos belong to the same topic.
Collection_Order The task of determining the logical order between multiple videos with the same topic.… See the full description on the dataset page: https://huggingface.co/datasets/Kwai-Keye/KC-MMbench.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
VLMEval/GMAI-MMBench dataset hosted on Hugging Face and contributed by the HF Datasets community
lscpku/MMBench-Video dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a subset of the video understanding benchmark MMBench-Video.
ko-vlm/K-MMBench dataset hosted on Hugging Face and contributed by the HF Datasets community
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MMBench is a multi-modality benchmark. It methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions.