100+ datasets found
  1. h

    Data from: MERA

    • huggingface.co
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MERA (2025). MERA [Dataset]. https://huggingface.co/datasets/MERA-evaluation/MERA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2025
    Authors
    MERA
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MERA (Multimodal Evaluation for Russian-language Architectures)

      Summary
    

    MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.

  2. h

    evaluation

    • huggingface.co
    Updated May 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2023). evaluation [Dataset]. https://huggingface.co/datasets/bigcode/evaluation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 4, 2023
    Dataset authored and provided by
    BigCode
    Description

    bigcode/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    BeaverTails-Evaluation

    • huggingface.co
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PKU-Alignment (2025). BeaverTails-Evaluation [Dataset]. https://huggingface.co/datasets/PKU-Alignment/BeaverTails-Evaluation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    PKU-Alignment
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for BeaverTails-Evaluation

    BeaverTails is an AI safety-focused collection comprising a series of datasets. This repository contains test prompts specifically designed for evaluating language model safety. It is important to note that although each prompt can be connected to multiple categories, only one category is labeled for each prompt. The 14 harm categories are defined as follows:

    Animal Abuse: This involves any form of cruelty or harm inflicted on animals… See the full description on the dataset page: https://huggingface.co/datasets/PKU-Alignment/BeaverTails-Evaluation.

  4. model-written-evals

    • huggingface.co
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthropic (2023). model-written-evals [Dataset]. https://huggingface.co/datasets/Anthropic/model-written-evals
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2023
    Dataset authored and provided by
    Anthropichttps://anthropic.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Model-Written Evaluation Datasets

    This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to:

    Those who are interested in understanding the quality and properties of model-generated data Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/model-written-evals.

  5. h

    Evaluation

    • huggingface.co
    Updated Sep 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lunlin Yang (2024). Evaluation [Dataset]. https://huggingface.co/datasets/ygl1020/Evaluation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2024
    Authors
    Lunlin Yang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ygl1020/Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    do-not-answer

    • huggingface.co
    Updated Sep 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LibrAI (2023). do-not-answer [Dataset]. https://huggingface.co/datasets/LibrAI/do-not-answer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2023
    Dataset authored and provided by
    LibrAI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

      Overview
    

    Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

      Instruction… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
    
  7. h

    glue-ci

    • huggingface.co
    Updated Sep 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    evaluate (2022). glue-ci [Dataset]. https://huggingface.co/datasets/evaluate/glue-ci
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2022
    Dataset authored and provided by
    evaluate
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

  8. h

    autoeval-eval-cnn_dailymail-3.0.0-52cdb7-47832145225

    • huggingface.co
    Updated May 15, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2015). autoeval-eval-cnn_dailymail-3.0.0-52cdb7-47832145225 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-cnn_dailymail-3.0.0-52cdb7-47832145225
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2015
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Summarization Model: kworts/BARTxiv Dataset: cnn_dailymail Config: 3.0.0 Split: test

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @Raj P Sini for evaluating this model.

  9. evaluate-dependents

    • huggingface.co
    Updated Jul 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face OSS Metrics (2023). evaluate-dependents [Dataset]. https://huggingface.co/datasets/open-source-metrics/evaluate-dependents
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 19, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face OSS Metrics
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    evaluate metrics

    This dataset contains metrics about the huggingface/evaluate package. Number of repositories in the dataset: 106 Number of packages in the dataset: 3

      Package dependents
    

    This contains the data available in the used-by tab on GitHub.

      Package & Repository star count
    

    This section shows the package and repository star count, individually.

    Package Repository

    There are 1 packages that have more than 1000 stars. There are 2 repositories… See the full description on the dataset page: https://huggingface.co/datasets/open-source-metrics/evaluate-dependents.

  10. code_evaluation_prompts

    • huggingface.co
    Updated May 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). code_evaluation_prompts [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    Description

    Dataset Card for H4 Code Evaluation Prompts

    These are a filtered set of prompts for evaluating code instruction models. It will contain a variety of languages and task types. Currently, we used ChatGPT (GPT-3.5-tubro) to generate these, so we encourage using them only for qualatative evaluation and not to train your models. The generation of this data is similar to something like CodeAlpaca, which you can download here, but we intend to make these tasks botha) more challenging… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts.

  11. h

    details_huggingface_nebius_google_gemma-3-27b-it_private

    • huggingface.co
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morgan Funtowicz (2025). details_huggingface_nebius_google_gemma-3-27b-it_private [Dataset]. https://huggingface.co/datasets/mfuntowicz/details_huggingface_nebius_google_gemma-3-27b-it_private
    Explore at:
    Dataset updated
    Jun 17, 2025
    Authors
    Morgan Funtowicz
    Description

    Dataset Card for Evaluation run of huggingface/nebius/google/gemma-3-27b-it

    Dataset automatically created during the evaluation run of model huggingface/nebius/google/gemma-3-27b-it. The dataset is composed of 1 configuration, each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the… See the full description on the dataset page: https://huggingface.co/datasets/mfuntowicz/details_huggingface_nebius_google_gemma-3-27b-it_private.

  12. openai_humaneval

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for OpenAI HumanEval

      Dataset Summary
    

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

      Supported Tasks and Leaderboards
    
    
    
    
    
      Languages
    

    The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

  13. h

    ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset...

    • huggingface.co
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepNLP (2024). ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset [Dataset]. https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2024
    Authors
    DeepNLP
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ChatGPT Gemini Claude Perplexity Human Evaluation Multi Aspect Review Dataset

      Introduction
    

    Human evaluation and reviews with scalar score of AI Services responses are very usefuly in LLM Finetuning, Human Preference Alignment, Few-Shot Learning, Bad Case Shooting, etc, but extremely difficult to collect. This dataset is collected from DeepNLP AI Service User Review panel (http://www.deepnlp.org/store), which is an open review website for users to give reviews and upload… See the full description on the dataset page: https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset.

  14. h

    media

    • huggingface.co
    Updated Jun 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    media [Dataset]. https://huggingface.co/datasets/evaluate/media
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 2022
    Dataset authored and provided by
    evaluate
    Description

    evaluate/media dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    is-dlt-models-evaluation

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Exponential Science (2025). is-dlt-models-evaluation [Dataset]. https://huggingface.co/datasets/ExponentialScience/is-dlt-models-evaluation
    Explore at:
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    Exponential Science
    Description

    ExponentialScience/is-dlt-models-evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    Llama-3.1-405B-evals

    • huggingface.co
    Updated Jul 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meta Llama (2024). Llama-3.1-405B-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.1-405B-evals
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Metahttp://meta.com/
    Authors
    Meta Llama
    License

    https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/

    Description

    Dataset Card for Llama-3.1-405B Evaluation Result Details

    This dataset contains the Meta evaluation result details for Llama-3.1-405B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-405B-evals.

  17. h

    neural_code_search

    • huggingface.co
    Updated Sep 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2023). neural_code_search [Dataset]. https://huggingface.co/datasets/facebook/neural_code_search
    Explore at:
    Dataset updated
    Sep 3, 2023
    Dataset authored and provided by
    AI at Meta
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Neural-Code-Search-Evaluation-Dataset presents an evaluation dataset consisting of natural language query and code snippet pairs and a search corpus consisting of code snippets collected from the most popular Android repositories on GitHub.

  18. h

    Llama-3.1-8B-evals

    • huggingface.co
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meta Llama (2024). Llama-3.1-8B-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Metahttp://meta.com/
    Authors
    Meta Llama
    License

    https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/

    Description

    Dataset Card for Llama-3.1-8B Evaluation Result Details

    This dataset contains the Meta evaluation result details for Llama-3.1-8B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time and… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals.

  19. h

    Text-To-Image-Test-Prompts

    • huggingface.co
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Rosehill (2025). Text-To-Image-Test-Prompts [Dataset]. https://huggingface.co/datasets/danielrosehill/Text-To-Image-Test-Prompts
    Explore at:
    Dataset updated
    Jul 10, 2025
    Authors
    Daniel Rosehill
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Text To Image Test Prompt Library

    A comprehensive collection of evaluation prompts for testing text-to-image AI models across diverse parameters and use cases.

      Overview
    

    This repository contains a structured set of test prompts designed to evaluate the capabilities of text-to-image generation models. Rather than focusing on formal evaluation metrics, these prompts are intended for end users who want to test how well a model might perform for their specific use cases.… See the full description on the dataset page: https://huggingface.co/datasets/danielrosehill/Text-To-Image-Test-Prompts.

  20. h

    ceval-exam

    • huggingface.co
    • opendatalab.com
    Updated Jan 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ceval (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/ceval/ceval-exam
    Explore at:
    Dataset updated
    Jan 8, 2022
    Dataset authored and provided by
    ceval
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
MERA (2025). MERA [Dataset]. https://huggingface.co/datasets/MERA-evaluation/MERA

Data from: MERA

MERA-evaluation/MERA

Related Article
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2025
Authors
MERA
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

MERA (Multimodal Evaluation for Russian-language Architectures)

  Summary

MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.

Search
Clear search
Close search
Google apps
Main menu