MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MMLU-Pro Dataset
MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper |
🚀 What's New
[2025.04.06] We corrected 15 answers in medical domain based on the recommendations of medical professionals, thanks to Dr. Robert (Bob) Hoyt and the subspecialists… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
評価スコアの再現性確保と SB Intuitions 修正版の公開用クローン ソース: TIGER-Lab/MMLU-Pro on Hugging Face
MMLU-Pro
MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.
Licensing Information
MIT
Citation Information
@misc{wang2024mmlupro, title={MMLU-Pro: A More Robust and Challenging Multi-Task… See the full description on the dataset page: https://huggingface.co/datasets/sbintuitions/MMLU-Pro.
jaypyon/MMLU-Pro dataset hosted on Hugging Face and contributed by the HF Datasets community
Comparison of Independently conducted by Artificial Analysis by Model
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MMLU-Pro-ita Dataset Introduction
This is an Italian translation of MMLU-Pro, a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.
1. What's new about MMLU-Pro
Compared to the original MMLU, there are three major differences:
The original MMLU dataset only contains 4 options, MMLU-Pro increases it to 10… See the full description on the dataset page: https://huggingface.co/datasets/efederici/MMLU-Pro-ita.
Artificial intelligence models continue to push the boundaries of language understanding and generation, with DeepSeek-R1 leading the pack in 2025 with an impressive ** percent accuracy rate on the AI MMLU benchmark. This achievement highlights the rapid progress in AI capabilities, as all major programs now demonstrate success ratios exceeding ** percent, indicating a significant leap in machine comprehension across various domains. Multilingual capabilities The AI landscape is not just about general language understanding. In 2024, the artificial analysis multilingual index ranked AI models based on their ability to handle multiple languages, with o1 leading at ** percent. Testing includes Spanish, Bengali, German, Japanese, English, Chinese, Swahili and French. Challenging exams This multilingual proficiency is further tested by humanity's last exam (HLE), an exceptionally tough evaluation consisting of ***** challenging questions across numerous subjects. On this rigorous test, o1 again emerged as the top performer with an *** percent score, followed by Gemini *** Flash at *** percent, showcasing the current limits of AI in tackling highly complex, multidisciplinary problems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We investigated DeepSeek R1's ability to diagnose 162 medical scenarios that are part of MMLU-Pro question and answer dataset
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
leafspark/MMLU-Pro-Results dataset hosted on Hugging Face and contributed by the HF Datasets community
MMLU-Pro-NoMath
MMLU-Pro-NoMath and MMLU-Pro-NoMath-Sml are subsets of MMLU-Pro with questions requiring multi-step calculation removed (43% of the original test set). We used claude-3.5-sonnet as the classifier. Questions were capped to an upper length limit to make logprobs evals faster and less likely to OOM. It's fast! 20 mins for NoMath and 7 mins for NoMath-Sml to evaluate gemma-2-9b using Eleuther harness.
Contents
Why do this? NoMath Subset Details What… See the full description on the dataset page: https://huggingface.co/datasets/sam-paech/mmlu-pro-nomath-sml.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
bengaliAI/MMLU-PRO dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
基于 MMLU Pro 基准的最新大语言模型(LLM)性能排行榜,包含各模型的得分、发布机构、发布时间等数据。
MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
dododododo/MMLU-Pro-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MMLU-Pro json
This is a reupload of MMLU-Pro in json format. Please, refer to the original dataset for details.
Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model
guanning-ai/mmlu-pro dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for MMLU Pro with education levels
MMLU Pro dataset with education levels
Dataset Details
Dataset Description
A popular human-like complexity metric is an education level that is appropriate for a question. To get it for MMLU Pro dataset, we ask a large LLM (Mistral 123B) to act as a judge and return its estimate. Next, we query the large LLM again to estimate the quality of the previous assessment from 1 to 10 following the practice introduced… See the full description on the dataset page: https://huggingface.co/datasets/LabARSS/MMLU-Pro-education-level.
Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model
dvilasuero/mmlu-pro-prep-eval-Llama-3.1-8B-Instruct-thinking dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for MMLU Pro with reasoning scores
MMLU Pro dataset with reasoning scores
Dataset Details
Dataset Description
As discovered in "When an LLM is apprehensive about its answers -- and when its uncertainty is justified", amount of reasoning required to answer a question (a.k.a. reasoning score) is a beter metric to estimate model uncertainty compared to more human-like level of education. Following the foot steps outlined in that paper, we ask a… See the full description on the dataset page: https://huggingface.co/datasets/LabARSS/MMLU-Pro-reasoning-score.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MMLU-Pro Dataset
MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper |
🚀 What's New
[2025.04.06] We corrected 15 answers in medical domain based on the recommendations of medical professionals, thanks to Dr. Robert (Bob) Hoyt and the subspecialists… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.