10 datasets found
  1. h

    AIME2025

    • huggingface.co
    Updated Feb 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCompass (2025). AIME2025 [Dataset]. https://huggingface.co/datasets/opencompass/AIME2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2025
    Dataset authored and provided by
    OpenCompass
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AIME 2025 Dataset

      Dataset Description
    

    This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.

  2. h

    compass_academic_predictions

    • huggingface.co
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCompass (2025). compass_academic_predictions [Dataset]. https://huggingface.co/datasets/opencompass/compass_academic_predictions
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset authored and provided by
    OpenCompass
    Description

    Compass Academic Predictions

    This dataset stores most of the reusable evaluation results of Opencompass, currently including predictions of models on different datasets.

  3. P

    open-compass/CriticBench Dataset

    • paperswithcode.com
    Updated Feb 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tian Lan; Wenwei Zhang; Chen Xu; Heyan Huang; Dahua Lin; Kai Chen; Xian-Ling Mao (2024). open-compass/CriticBench Dataset [Dataset]. https://paperswithcode.com/dataset/open-compass-criticbench
    Explore at:
    Dataset updated
    Feb 24, 2024
    Authors
    Tian Lan; Wenwei Zhang; Chen Xu; Heyan Huang; Dahua Lin; Kai Chen; Xian-Ling Mao
    Description

    [Dataset on HF] [Project Page] [Subjective LeaderBoard] [Objective LeaderBoard]

    CriticBench is a novel benchmark designed to comprehensively and reliably evaluate the critique abilities of Large Language Models (LLMs). These critique abilities are crucial for scalable oversight and self-improvement of LLMs. While many recent studies explore how LLMs can judge and refine flaws in their generated outputs, the measurement of critique abilities remains under-explored.

    Here are the key aspects of CriticBench:

    Purpose: To assess LLMs' critique abilities across four dimensions:

    Feedback: How well an LLM provides constructive feedback. Comparison: The ability to compare and contrast different responses. Refinement: How effectively an LLM can refine flawed or suboptimal outputs. Meta-feedback: The LLM's ability to reflect on its own performance.

    Tasks: CriticBench encompasses nine diverse tasks, each evaluating LLMs' critique abilities at varying levels of quality granularity.

    Evaluation: The benchmark evaluates both open-source and closed-source LLMs, revealing intriguing relationships between critique abilities, response qualities, and model scales.

    Resources: Datasets, resources, and an evaluation toolkit for CriticBench will be publicly released.

    In summary, CriticBench aims to provide a comprehensive framework for assessing LLMs' critique and self-improvement capabilities, contributing to the advancement of large-scale language models in various applications.

  4. h

    NeedleBench

    • huggingface.co
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCompass (2024). NeedleBench [Dataset]. https://huggingface.co/datasets/opencompass/NeedleBench
    Explore at:
    Dataset updated
    Aug 2, 2024
    Dataset authored and provided by
    OpenCompass
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Description

      Dataset Summary
    

    The NeedleBench dataset is a part of the OpenCompass project, designed to evaluate the capabilities of large language models (LLMs) in processing and understanding long documents. It includes a series of test scenarios that assess models' abilities in long text information extraction and reasoning. The dataset is structured to support tasks such as single-needle retrieval, multi-needle retrieval, multi-needle reasoning, and ancestralโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/opencompass/NeedleBench.

  5. h

    MMBench

    • huggingface.co
    Updated Oct 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCompass (2023). MMBench [Dataset]. https://huggingface.co/datasets/opencompass/MMBench
    Explore at:
    Dataset updated
    Oct 17, 2023
    Dataset authored and provided by
    OpenCompass
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    opencompass/MMBench dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    mmmlu_lite

    • huggingface.co
    Updated Nov 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCompass (2024). mmmlu_lite [Dataset]. https://huggingface.co/datasets/opencompass/mmmlu_lite
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2024
    Dataset authored and provided by
    OpenCompass
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MMMLU-Lite

      Introduction
    

    A lite version of the MMMLU dataset, which is an community version of the MMMLU dataset by OpenCompass. Due to the large size of the original dataset (about 200k questions), we have created a lite version of the dataset to make it easier to use. We sample 25 examples from each language subject in the original dataset with fixed seed to ensure reproducibility, finally we have 19950 examples in the lite version of the dataset, which is about 10% ofโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/opencompass/mmmlu_lite.

  7. h

    REST

    • huggingface.co
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous0523 (2025). REST [Dataset]. https://huggingface.co/datasets/anonymous0523/REST
    Explore at:
    Dataset updated
    May 29, 2025
    Authors
    anonymous0523
    Description

    Dataset Card for Dataset Name

    This dataset provides data for the REST benchmark. They are identical to the original data of the corresponding benchmarks. REST combines multiple questions into one prompt by modifying the corresponding data loading method in OpenCompass.

      Data preparation
    

    REST constructs the multi-problem version when loading the datasets, implemented in the StressDataset class. So the data preparation is identical to the official practice of opencompass.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/anonymous0523/REST.

  8. h

    Sailcompass_data

    • huggingface.co
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sea AI Lab (2024). Sailcompass_data [Dataset]. https://huggingface.co/datasets/sail/Sailcompass_data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 13, 2024
    Dataset authored and provided by
    Sea AI Lab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

    This repository provides the dataset for evaluation SEA large language model.

    Project Website: sailorllm.github.io Codebase: https://github.com/sail-sg/sailcompass

      Acknowledgment
    

    Thanks to the contributors of the opencompass.

      Citing this work
    

    If you use this repository or sailor models, please cite @misc{sailcompass, title={SailCompass: Towards Reproducibleโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/sail/Sailcompass_data.

  9. h

    MMPR

    • huggingface.co
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenGVLab (2024). MMPR [Dataset]. https://huggingface.co/datasets/OpenGVLab/MMPR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 14, 2024
    Dataset authored and provided by
    OpenGVLab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MMPR

    [๐Ÿ“‚ GitHub] [๐Ÿ†• Blog] [๐Ÿ“œ Paper] [๐Ÿ“– Documents] 2025/04/11: We release a new version of MMPR (i.e., MMPR-v1.2), which greatly enhances the overall performance of InternVL3. 2024/12/20: We release a new version of MMPR (i.e., MMPR-v1.1). Based on this dataset, InternVL2.5 outperforms its counterparts without MPO by an average of 2 points across all scales on the OpenCompass leaderboard.

      Introduction
    

    MMPR is a large-scale and high-quality multimodal reasoningโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/MMPR.

  10. h

    CMB

    • huggingface.co
    • opendatalab.com
    Updated Aug 15, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FreedomAI (2017). CMB [Dataset]. https://huggingface.co/datasets/FreedomIntelligence/CMB
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2017
    Dataset authored and provided by
    FreedomAI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    CMB: A Comprehensive Medical Benchmark in Chinese

    ๐ŸŒ Github โ€ข ๐ŸŒ Website โ€ข ๐Ÿค— HuggingFace

      ๐ŸŒˆ Update
    

    [2024.02.21] The answers to the CMB-Exam test has been updated and some errors caused by omissions in version management have been fixed. [2024.01.08] In order to facilitate testing, we disclose the answers to the CMB-Exam test [2023.09.22] CMB is included in OpenCompass. [2023.08.21] Paper released. [2023.08.01] ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ CMB is published๏ผ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰

      ๐ŸŒโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/CMB.
    
  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenCompass (2025). AIME2025 [Dataset]. https://huggingface.co/datasets/opencompass/AIME2025

AIME2025

opencompass/AIME2025

Explore at:
28 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2025
Dataset authored and provided by
OpenCompass
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

AIME 2025 Dataset

  Dataset Description

This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.

Search
Clear search
Close search
Google apps
Main menu