MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AIME 2025 Dataset
Dataset Description
This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.
Compass Academic Predictions
This dataset stores most of the reusable evaluation results of Opencompass, currently including predictions of models on different datasets.
[Dataset on HF] [Project Page] [Subjective LeaderBoard] [Objective LeaderBoard]
CriticBench is a novel benchmark designed to comprehensively and reliably evaluate the critique abilities of Large Language Models (LLMs). These critique abilities are crucial for scalable oversight and self-improvement of LLMs. While many recent studies explore how LLMs can judge and refine flaws in their generated outputs, the measurement of critique abilities remains under-explored.
Here are the key aspects of CriticBench:
Purpose: To assess LLMs' critique abilities across four dimensions:
Feedback: How well an LLM provides constructive feedback. Comparison: The ability to compare and contrast different responses. Refinement: How effectively an LLM can refine flawed or suboptimal outputs. Meta-feedback: The LLM's ability to reflect on its own performance.
Tasks: CriticBench encompasses nine diverse tasks, each evaluating LLMs' critique abilities at varying levels of quality granularity.
Evaluation: The benchmark evaluates both open-source and closed-source LLMs, revealing intriguing relationships between critique abilities, response qualities, and model scales.
Resources: Datasets, resources, and an evaluation toolkit for CriticBench will be publicly released.
In summary, CriticBench aims to provide a comprehensive framework for assessing LLMs' critique and self-improvement capabilities, contributing to the advancement of large-scale language models in various applications.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description
Dataset Summary
The NeedleBench dataset is a part of the OpenCompass project, designed to evaluate the capabilities of large language models (LLMs) in processing and understanding long documents. It includes a series of test scenarios that assess models' abilities in long text information extraction and reasoning. The dataset is structured to support tasks such as single-needle retrieval, multi-needle retrieval, multi-needle reasoning, and ancestralโฆ See the full description on the dataset page: https://huggingface.co/datasets/opencompass/NeedleBench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
opencompass/MMBench dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MMMLU-Lite
Introduction
A lite version of the MMMLU dataset, which is an community version of the MMMLU dataset by OpenCompass. Due to the large size of the original dataset (about 200k questions), we have created a lite version of the dataset to make it easier to use. We sample 25 examples from each language subject in the original dataset with fixed seed to ensure reproducibility, finally we have 19950 examples in the lite version of the dataset, which is about 10% ofโฆ See the full description on the dataset page: https://huggingface.co/datasets/opencompass/mmmlu_lite.
Dataset Card for Dataset Name
This dataset provides data for the REST benchmark. They are identical to the original data of the corresponding benchmarks. REST combines multiple questions into one prompt by modifying the corresponding data loading method in OpenCompass.
Data preparation
REST constructs the multi-problem version when loading the datasets, implemented in the StressDataset class. So the data preparation is identical to the official practice of opencompass.โฆ See the full description on the dataset page: https://huggingface.co/datasets/anonymous0523/REST.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages
This repository provides the dataset for evaluation SEA large language model.
Project Website: sailorllm.github.io Codebase: https://github.com/sail-sg/sailcompass
Acknowledgment
Thanks to the contributors of the opencompass.
Citing this work
If you use this repository or sailor models, please cite @misc{sailcompass, title={SailCompass: Towards Reproducibleโฆ See the full description on the dataset page: https://huggingface.co/datasets/sail/Sailcompass_data.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MMPR
[๐ GitHub] [๐ Blog] [๐ Paper] [๐ Documents] 2025/04/11: We release a new version of MMPR (i.e., MMPR-v1.2), which greatly enhances the overall performance of InternVL3. 2024/12/20: We release a new version of MMPR (i.e., MMPR-v1.1). Based on this dataset, InternVL2.5 outperforms its counterparts without MPO by an average of 2 points across all scales on the OpenCompass leaderboard.
Introduction
MMPR is a large-scale and high-quality multimodal reasoningโฆ See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/MMPR.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CMB: A Comprehensive Medical Benchmark in Chinese
๐ Github โข ๐ Website โข ๐ค HuggingFace
๐ Update
[2024.02.21] The answers to the CMB-Exam test has been updated and some errors caused by omissions in version management have been fixed. [2024.01.08] In order to facilitate testing, we disclose the answers to the CMB-Exam test [2023.09.22] CMB is included in OpenCompass. [2023.08.21] Paper released. [2023.08.01] ๐๐๐ CMB is published๏ผ๐๐๐
๐โฆ See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/CMB.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AIME 2025 Dataset
Dataset Description
This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.