Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for "LiveMathBench"
Homepage: https://open-compass.github.io/GPassK/ Repository: https://github.com/open-compass/GPassK Paper: Are Your LLMs Capable of Stable Reasoning?
Introduction
LiveMathBench is a mathematical dataset, specifically designed to include challenging latest question sets from various mathematical competitions, aiming to avoid data contamination issues in existing LLMs and public math benchmarks.
Leaderboard
The Latest… See the full description on the dataset page: https://huggingface.co/datasets/opencompass/LiveMathBench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
Homepage: https://mmbench-video.github.io/ Repository: https://huggingface.co/datasets/opencompass/MMBench-Video Paper: MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding.
Introduction
MMBench-Video is a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates approximately 600 web videos… See the full description on the dataset page: https://huggingface.co/datasets/opencompass/MMBench-Video.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CodeCompass: A Benchmark for Code Generation
Paper: Rethinking Verification for LLM Code Generation: From Generation to Testing
Description
CodeCompass is a rigorous benchmark designed to evaluate the code generation capabilities of Large Language Models (LLMs). It comprises a comprehensive collection of programming problems sourced from competitive platforms, offering a standardized framework for assessing algorithmic reasoning, problem-solving, and code synthesis in a… See the full description on the dataset page: https://huggingface.co/datasets/opencompass/CodeCompass.
CriticBench: Evaluating Large Language Model as Critic
This repository is the official implementation of CriticBench, a comprehensive benchmark for evaluating critique ability of LLMs.
Introduction
CriticBench: Evaluating Large Language Model as Critic
Tian Lan1*, Wenwei Zhang2*, Chen Xu1, Heyan Huang1, Dahua Lin2, Kai Chen2†, Xian-ling Mao1† († Corresponding Author, * Equal Contribution) 1 Beijing Institute of Technology, 2 Shanghai AI Laboratory
[Dataset on HF]… See the full description on the dataset page: https://huggingface.co/datasets/opencompass/CriticBench.
Compass Academic Predictions
This dataset stores most of the reusable evaluation results of Opencompass, currently including predictions of models on different datasets.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for VerifierBench
Dataset Description
VerifierBench is a comprehensive benchmark for evaluating the verification capabilities of Large Language Models (LLMs). It demonstrates multi-domain competency spanning math, knowledge, science, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively… See the full description on the dataset page: https://huggingface.co/datasets/opencompass/VerifierBench.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AIME 2025 Dataset
Dataset Description
This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Uncover historical ownership history and changes over time by performing a reverse Whois lookup for the company THE-OPEN-COMPASS.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description
Dataset Summary
The NeedleBench dataset is a part of the OpenCompass project, designed to evaluate the capabilities of large language models (LLMs) in processing and understanding long documents. It includes a series of test scenarios that assess models' abilities in long text information extraction and reasoning. The dataset is structured to support tasks such as single-needle retrieval, multi-needle retrieval, multi-needle reasoning, and ancestral… See the full description on the dataset page: https://huggingface.co/datasets/opencompass/NeedleBench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
opencompass/MMBench dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CodeForce-SAGA: A Self-Correction-Augmented Code Generation Dataset
CodeForce-SAGA is a large-scale, high-quality training dataset designed to enhance the code generation and problem-solving capabilities of Large Language Models (LLMs). All problems and solutions are sourced from the competitive programming platform Codeforces. This dataset is built upon the SAGA (Strategic Adversarial & Constraint-differential Generative workflow) framework, a novel human-LLM collaborative… See the full description on the dataset page: https://huggingface.co/datasets/opencompass/CodeForce_SAGA.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MMMLU-Lite
Introduction
A lite version of the MMMLU dataset, which is an community version of the MMMLU dataset by OpenCompass. Due to the large size of the original dataset (about 200k questions), we have created a lite version of the dataset to make it easier to use. We sample 25 examples from each language subject in the original dataset with fixed seed to ensure reproducibility, finally we have 19950 examples in the lite version of the dataset, which is about 10% of… See the full description on the dataset page: https://huggingface.co/datasets/opencompass/mmmlu_lite.
Dataset Card for Dataset Name
This dataset provides data for the REST benchmark. They are identical to the original data of the corresponding benchmarks. REST combines multiple questions into one prompt by modifying the corresponding data loading method in OpenCompass.
Data preparation
REST constructs the multi-problem version when loading the datasets, implemented in the StressDataset class. So the data preparation is identical to the official practice of opencompass.… See the full description on the dataset page: https://huggingface.co/datasets/anonymous0523/REST.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages
This repository provides the dataset for evaluation SEA large language model.
Project Website: sailorllm.github.io Codebase: https://github.com/sail-sg/sailcompass
Acknowledgment
Thanks to the contributors of the opencompass.
Citing this work
If you use this repository or sailor models, please cite @misc{sailcompass, title={SailCompass: Towards Reproducible… See the full description on the dataset page: https://huggingface.co/datasets/sail/Sailcompass_data.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📘 Dataset Description
StaticEmbodiedBench is a dataset for evaluating vision-language models on embodied intelligence tasks, as featured in the OpenCompass leaderboard. It covers three key capabilities:
Macro Planning: Decomposing a complex task into a sequence of simpler subtasks. Micro Perception: Performing concrete simple tasks such as spatial understanding and fine-grained perception. Stage-wise Reasoning: Deciding the next action based on the agent’s current state and… See the full description on the dataset page: https://huggingface.co/datasets/xiaojiahao/StaticEmbodiedBench.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CMB: A Comprehensive Medical Benchmark in Chinese
🌐 Github • 🌐 Website • 🤗 HuggingFace
🌈 Update
[2024.02.21] The answers to the CMB-Exam test has been updated and some errors caused by omissions in version management have been fixed. [2024.01.08] In order to facilitate testing, we disclose the answers to the CMB-Exam test [2023.09.22] CMB is included in OpenCompass. [2023.08.21] Paper released. [2023.08.01] 🎉🎉🎉 CMB is published!🎉🎉🎉
🌐… See the full description on the dataset page: https://huggingface.co/datasets/fzkuji/CMB.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MMPR
[📂 GitHub] [🆕 Blog] [📜 Paper] [📖 Documents] 2025/04/11: We release a new version of MMPR (i.e., MMPR-v1.2), which greatly enhances the overall performance of InternVL3. 2024/12/20: We release a new version of MMPR (i.e., MMPR-v1.1). Based on this dataset, InternVL2.5 outperforms its counterparts without MPO by an average of 2 points across all scales on the OpenCompass leaderboard.
Introduction
MMPR is a large-scale and high-quality multimodal reasoning… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/MMPR.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for "LiveMathBench"
Homepage: https://open-compass.github.io/GPassK/ Repository: https://github.com/open-compass/GPassK Paper: Are Your LLMs Capable of Stable Reasoning?
Introduction
LiveMathBench is a mathematical dataset, specifically designed to include challenging latest question sets from various mathematical competitions, aiming to avoid data contamination issues in existing LLMs and public math benchmarks.
Leaderboard
The Latest… See the full description on the dataset page: https://huggingface.co/datasets/opencompass/LiveMathBench.