94 datasets found
  1. h

    MMLU-Pro

    • huggingface.co
    • paperswithcode.com
    Updated May 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIGER-Lab (2024). MMLU-Pro [Dataset]. http://doi.org/10.57967/hf/2439
    Explore at:
    Dataset updated
    May 8, 2024
    Dataset authored and provided by
    TIGER-Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MMLU-Pro Dataset

    MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper |

      🚀 What's New
    

    [2025.04.06] We corrected 15 answers in medical domain based on the recommendations of medical professionals, thanks to Dr. Robert (Bob) Hoyt and the subspecialists… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.

  2. MMLU-Pro

    • huggingface.co
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SB Intuitions (2024). MMLU-Pro [Dataset]. https://huggingface.co/datasets/sbintuitions/MMLU-Pro
    Explore at:
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    Authors
    SB Intuitions
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    評価スコアの再現性確保と SB Intuitions 修正版の公開用クローン ソース: TIGER-Lab/MMLU-Pro on Hugging Face

      MMLU-Pro
    

    MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.

      Licensing Information
    

    MIT

      Citation Information
    

    @misc{wang2024mmlupro, title={MMLU-Pro: A More Robust and Challenging Multi-Task… See the full description on the dataset page: https://huggingface.co/datasets/sbintuitions/MMLU-Pro.

  3. h

    MMLU-Pro

    • huggingface.co
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaeyong Park (2025). MMLU-Pro [Dataset]. https://huggingface.co/datasets/jaypyon/MMLU-Pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 20, 2025
    Authors
    Jaeyong Park
    Description

    jaypyon/MMLU-Pro dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. Mmlu Pro by Model

    • artificialanalysis.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis, Mmlu Pro by Model [Dataset]. https://artificialanalysis.ai/evaluations/mmlu-pro
    Explore at:
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Independently conducted by Artificial Analysis by Model

  5. h

    MMLU-Pro-ita

    • huggingface.co
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edoardo Federici (2024). MMLU-Pro-ita [Dataset]. https://huggingface.co/datasets/efederici/MMLU-Pro-ita
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 16, 2024
    Authors
    Edoardo Federici
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MMLU-Pro-ita Dataset Introduction

    This is an Italian translation of MMLU-Pro, a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.

      1. What's new about MMLU-Pro
    

    Compared to the original MMLU, there are three major differences:

    The original MMLU dataset only contains 4 options, MMLU-Pro increases it to 10… See the full description on the dataset page: https://huggingface.co/datasets/efederici/MMLU-Pro-ita.

  6. Leading AI model performance on MMLU-Pro 2025

    • statista.com
    Updated Jun 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Leading AI model performance on MMLU-Pro 2025 [Dataset]. https://www.statista.com/statistics/1611886/mmlu-pro-accuracy/
    Explore at:
    Dataset updated
    Jun 10, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2024
    Area covered
    Worldwide
    Description

    Artificial intelligence models continue to push the boundaries of language understanding and generation, with DeepSeek-R1 leading the pack in 2025 with an impressive ** percent accuracy rate on the AI MMLU benchmark. This achievement highlights the rapid progress in AI capabilities, as all major programs now demonstrate success ratios exceeding ** percent, indicating a significant leap in machine comprehension across various domains. Multilingual capabilities The AI landscape is not just about general language understanding. In 2024, the artificial analysis multilingual index ranked AI models based on their ability to handle multiple languages, with o1 leading at ** percent. Testing includes Spanish, Bengali, German, Japanese, English, Chinese, Swahili and French. Challenging exams This multilingual proficiency is further tested by humanity's last exam (HLE), an exceptionally tough evaluation consisting of ***** challenging questions across numerous subjects. On this rigorous test, o1 again emerged as the top performer with an *** percent score, followed by Gemini *** Flash at *** percent, showcasing the current limits of AI in tackling highly complex, multidisciplinary problems.

  7. f

    MMLU-Pro Benchmark Questions

    • figshare.com
    xlsx
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Hoyt; Dacre Knight; Maria Bajwa; Maruf Haider (2025). MMLU-Pro Benchmark Questions [Dataset]. http://doi.org/10.6084/m9.figshare.28751756.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    figshare
    Authors
    Robert Hoyt; Dacre Knight; Maria Bajwa; Maruf Haider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We investigated DeepSeek R1's ability to diagnose 162 medical scenarios that are part of MMLU-Pro question and answer dataset

  8. h

    MMLU-Pro-Results

    • huggingface.co
    Updated Sep 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    leafspark (2024). MMLU-Pro-Results [Dataset]. https://huggingface.co/datasets/leafspark/MMLU-Pro-Results
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2024
    Authors
    leafspark
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    leafspark/MMLU-Pro-Results dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    mmlu-pro-nomath-sml

    • huggingface.co
    Updated Jul 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sam Paech (2024). mmlu-pro-nomath-sml [Dataset]. https://huggingface.co/datasets/sam-paech/mmlu-pro-nomath-sml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2024
    Authors
    Sam Paech
    Description

    MMLU-Pro-NoMath

    MMLU-Pro-NoMath and MMLU-Pro-NoMath-Sml are subsets of MMLU-Pro with questions requiring multi-step calculation removed (43% of the original test set). We used claude-3.5-sonnet as the classifier. Questions were capped to an upper length limit to make logprobs evals faster and less likely to OOM. It's fast! 20 mins for NoMath and 7 mins for NoMath-Sml to evaluate gemma-2-9b using Eleuther harness.

      Contents
    

    Why do this? NoMath Subset Details What… See the full description on the dataset page: https://huggingface.co/datasets/sam-paech/mmlu-pro-nomath-sml.

  10. h

    MMLU-PRO

    • huggingface.co
    Updated Jun 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bengali.AI (2025). MMLU-PRO [Dataset]. https://huggingface.co/datasets/bengaliAI/MMLU-PRO
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset provided by
    Bengali.AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    bengaliAI/MMLU-PRO dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. d

    MMLU Pro 大模型评测基准排行榜

    • datalearner.com
    Updated Feb 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    数据学习 (DataLearner) (2025). MMLU Pro 大模型评测基准排行榜 [Dataset]. https://www.datalearner.com/ai-models/llm-benchmark-tests/16
    Explore at:
    Dataset updated
    Feb 2, 2025
    Dataset authored and provided by
    数据学习 (DataLearner)
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    基于 MMLU Pro 基准的最新大语言模型(LLM)性能排行榜,包含各模型的得分、发布机构、发布时间等数据。

  12. P

    MML Dataset

    • paperswithcode.com
    Updated Jan 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt (2025). MML Dataset [Dataset]. https://paperswithcode.com/dataset/mmlu
    Explore at:
    Dataset updated
    Jan 10, 2025
    Authors
    Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt
    Description

    MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

  13. h

    MMLU-Pro-sample

    • huggingface.co
    Updated Sep 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    du (2024). MMLU-Pro-sample [Dataset]. https://huggingface.co/datasets/dododododo/MMLU-Pro-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2024
    Authors
    du
    Description

    dododododo/MMLU-Pro-sample dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    MMLU-Pro-json

    • huggingface.co
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro Cuenca (2025). MMLU-Pro-json [Dataset]. https://huggingface.co/datasets/pcuenq/MMLU-Pro-json
    Explore at:
    Dataset updated
    Jun 13, 2025
    Authors
    Pedro Cuenca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MMLU-Pro json

    This is a reupload of MMLU-Pro in json format. Please, refer to the original dataset for details.

  15. Intelligence Index by Gemini Endpoint

    • artificialanalysis.ai
    Updated Jun 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence Index by Gemini Endpoint [Dataset]. https://artificialanalysis.ai/models/gemini-2-5-pro
    Explore at:
    Dataset updated
    Jun 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model

  16. h

    mmlu-pro

    • huggingface.co
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    InterestingWorks (2025). mmlu-pro [Dataset]. https://huggingface.co/datasets/guanning-ai/mmlu-pro
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    InterestingWorks
    Description

    guanning-ai/mmlu-pro dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    MMLU-Pro-education-level

    • huggingface.co
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LARSS (2025). MMLU-Pro-education-level [Dataset]. https://huggingface.co/datasets/LabARSS/MMLU-Pro-education-level
    Explore at:
    Dataset updated
    May 29, 2025
    Dataset authored and provided by
    LARSS
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for MMLU Pro with education levels

    MMLU Pro dataset with education levels

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    A popular human-like complexity metric is an education level that is appropriate for a question. To get it for MMLU Pro dataset, we ask a large LLM (Mistral 123B) to act as a judge and return its estimate. Next, we query the large LLM again to estimate the quality of the previous assessment from 1 to 10 following the practice introduced… See the full description on the dataset page: https://huggingface.co/datasets/LabARSS/MMLU-Pro-education-level.

  18. Intelligence Index by o1-preview Endpoint

    • artificialanalysis.ai
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence Index by o1-preview Endpoint [Dataset]. https://artificialanalysis.ai/models/o1-preview
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model

  19. h

    mmlu-pro-prep-eval-Llama-3.1-8B-Instruct-thinking

    • huggingface.co
    Updated Oct 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Vila (2024). mmlu-pro-prep-eval-Llama-3.1-8B-Instruct-thinking [Dataset]. https://huggingface.co/datasets/dvilasuero/mmlu-pro-prep-eval-Llama-3.1-8B-Instruct-thinking
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2024
    Authors
    Daniel Vila
    Description

    dvilasuero/mmlu-pro-prep-eval-Llama-3.1-8B-Instruct-thinking dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    MMLU-Pro-reasoning-score

    • huggingface.co
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LARSS (2025). MMLU-Pro-reasoning-score [Dataset]. https://huggingface.co/datasets/LabARSS/MMLU-Pro-reasoning-score
    Explore at:
    Dataset updated
    May 29, 2025
    Dataset authored and provided by
    LARSS
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for MMLU Pro with reasoning scores

    MMLU Pro dataset with reasoning scores

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    As discovered in "When an LLM is apprehensive about its answers -- and when its uncertainty is justified", amount of reasoning required to answer a question (a.k.a. reasoning score) is a beter metric to estimate model uncertainty compared to more human-like level of education. Following the foot steps outlined in that paper, we ask a… See the full description on the dataset page: https://huggingface.co/datasets/LabARSS/MMLU-Pro-reasoning-score.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
TIGER-Lab (2024). MMLU-Pro [Dataset]. http://doi.org/10.57967/hf/2439

MMLU-Pro

MMLU-Pro

TIGER-Lab/MMLU-Pro

Explore at:
Dataset updated
May 8, 2024
Dataset authored and provided by
TIGER-Lab
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

MMLU-Pro Dataset

MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper |

  🚀 What's New

[2025.04.06] We corrected 15 answers in medical domain based on the recommendations of medical professionals, thanks to Dr. Robert (Bob) Hoyt and the subspecialists… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.

Search
Clear search
Close search
Google apps
Main menu