100+ datasets found

h
Data from: MERA
huggingface.co
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MERA (2025). MERA [Dataset]. https://huggingface.co/datasets/MERA-evaluation/MERA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2025
Authors
MERA
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MERA (Multimodal Evaluation for Russian-language Architectures)

Summary

MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.
h
evaluation
huggingface.co
Updated May 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2023). evaluation [Dataset]. https://huggingface.co/datasets/bigcode/evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 4, 2023
Dataset authored and provided by
BigCode
Description
bigcode/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
BeaverTails-Evaluation
huggingface.co
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PKU-Alignment (2025). BeaverTails-Evaluation [Dataset]. https://huggingface.co/datasets/PKU-Alignment/BeaverTails-Evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 27, 2025
Dataset authored and provided by
PKU-Alignment
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for BeaverTails-Evaluation

BeaverTails is an AI safety-focused collection comprising a series of datasets. This repository contains test prompts specifically designed for evaluating language model safety. It is important to note that although each prompt can be connected to multiple categories, only one category is labeled for each prompt. The 14 harm categories are defined as follows:

Animal Abuse: This involves any form of cruelty or harm inflicted on animals… See the full description on the dataset page: https://huggingface.co/datasets/PKU-Alignment/BeaverTails-Evaluation.
model-written-evals
huggingface.co
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthropic (2023). model-written-evals [Dataset]. https://huggingface.co/datasets/Anthropic/model-written-evals
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset authored and provided by
Anthropichttps://anthropic.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model-Written Evaluation Datasets

This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to:

Those who are interested in understanding the quality and properties of model-generated data Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/model-written-evals.
h
Evaluation
huggingface.co
Updated Sep 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lunlin Yang (2024). Evaluation [Dataset]. https://huggingface.co/datasets/ygl1020/Evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2024
Authors
Lunlin Yang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ygl1020/Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
do-not-answer
huggingface.co
Updated Sep 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LibrAI (2023). do-not-answer [Dataset]. https://huggingface.co/datasets/LibrAI/do-not-answer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Dataset authored and provided by
LibrAI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Overview

Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

Instruction… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
h
glue-ci
huggingface.co
Updated Sep 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
evaluate (2022). glue-ci [Dataset]. https://huggingface.co/datasets/evaluate/glue-ci
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2022
Dataset authored and provided by
evaluate
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
h
autoeval-eval-cnn_dailymail-3.0.0-52cdb7-47832145225
huggingface.co
Updated May 15, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation on the Hub (2015). autoeval-eval-cnn_dailymail-3.0.0-52cdb7-47832145225 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-cnn_dailymail-3.0.0-52cdb7-47832145225
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2015
Dataset authored and provided by
Evaluation on the Hub
Description
Dataset Card for AutoTrain Evaluator

This repository contains model predictions generated by AutoTrain for the following task and dataset:

Task: Summarization Model: kworts/BARTxiv Dataset: cnn_dailymail Config: 3.0.0 Split: test

To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

Contributions

Thanks to @Raj P Sini for evaluating this model.
evaluate-dependents
huggingface.co
Updated Jul 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face OSS Metrics (2023). evaluate-dependents [Dataset]. https://huggingface.co/datasets/open-source-metrics/evaluate-dependents
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 19, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face OSS Metrics
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
evaluate metrics

This dataset contains metrics about the huggingface/evaluate package. Number of repositories in the dataset: 106 Number of packages in the dataset: 3

Package dependents

This contains the data available in the used-by tab on GitHub.

Package & Repository star count

This section shows the package and repository star count, individually.

Package Repository

There are 1 packages that have more than 1000 stars. There are 2 repositories… See the full description on the dataset page: https://huggingface.co/datasets/open-source-metrics/evaluate-dependents.
code_evaluation_prompts
huggingface.co
Updated May 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). code_evaluation_prompts [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 3, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
Description
Dataset Card for H4 Code Evaluation Prompts

These are a filtered set of prompts for evaluating code instruction models. It will contain a variety of languages and task types. Currently, we used ChatGPT (GPT-3.5-tubro) to generate these, so we encourage using them only for qualatative evaluation and not to train your models. The generation of this data is similar to something like CodeAlpaca, which you can download here, but we intend to make these tasks botha) more challenging… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts.
h
details_huggingface_nebius_google_gemma-3-27b-it_private
huggingface.co
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morgan Funtowicz (2025). details_huggingface_nebius_google_gemma-3-27b-it_private [Dataset]. https://huggingface.co/datasets/mfuntowicz/details_huggingface_nebius_google_gemma-3-27b-it_private
Explore at:
Dataset updated
Jun 17, 2025
Authors
Morgan Funtowicz
Description
Dataset Card for Evaluation run of huggingface/nebius/google/gemma-3-27b-it

Dataset automatically created during the evaluation run of model huggingface/nebius/google/gemma-3-27b-it. The dataset is composed of 1 configuration, each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the… See the full description on the dataset page: https://huggingface.co/datasets/mfuntowicz/details_huggingface_nebius_google_gemma-3-27b-it_private.
openai_humaneval
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for OpenAI HumanEval

Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

Supported Tasks and Leaderboards Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
h
ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset...
huggingface.co
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeepNLP (2024). ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset [Dataset]. https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 12, 2024
Authors
DeepNLP
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
ChatGPT Gemini Claude Perplexity Human Evaluation Multi Aspect Review Dataset

Introduction

Human evaluation and reviews with scalar score of AI Services responses are very usefuly in LLM Finetuning, Human Preference Alignment, Few-Shot Learning, Bad Case Shooting, etc, but extremely difficult to collect. This dataset is collected from DeepNLP AI Service User Review panel (http://www.deepnlp.org/store), which is an open review website for users to give reviews and upload… See the full description on the dataset page: https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset.
h
media
huggingface.co
Updated Jun 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
media [Dataset]. https://huggingface.co/datasets/evaluate/media
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2022
Dataset authored and provided by
evaluate
Description
evaluate/media dataset hosted on Hugging Face and contributed by the HF Datasets community
h
is-dlt-models-evaluation
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Exponential Science (2025). is-dlt-models-evaluation [Dataset]. https://huggingface.co/datasets/ExponentialScience/is-dlt-models-evaluation
Explore at:
Dataset updated
Jul 8, 2025
Dataset authored and provided by
Exponential Science
Description
ExponentialScience/is-dlt-models-evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Llama-3.1-405B-evals
huggingface.co
Updated Jul 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meta Llama (2024). Llama-3.1-405B-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.1-405B-evals
Explore at:
Dataset updated
Jul 23, 2024
Dataset provided by
Metahttp://meta.com/
Authors
Meta Llama
License
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Description
Dataset Card for Llama-3.1-405B Evaluation Result Details

This dataset contains the Meta evaluation result details for Llama-3.1-405B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-405B-evals.
h
neural_code_search
huggingface.co
Updated Sep 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2023). neural_code_search [Dataset]. https://huggingface.co/datasets/facebook/neural_code_search
Explore at:
Dataset updated
Sep 3, 2023
Dataset authored and provided by
AI at Meta
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Neural-Code-Search-Evaluation-Dataset presents an evaluation dataset consisting of natural language query and code snippet pairs and a search corpus consisting of code snippets collected from the most popular Android repositories on GitHub.
h
Llama-3.1-8B-evals
huggingface.co
Updated Jul 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meta Llama (2024). Llama-3.1-8B-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals
Explore at:
Dataset updated
Jul 23, 2024
Dataset provided by
Metahttp://meta.com/
Authors
Meta Llama
License
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Description
Dataset Card for Llama-3.1-8B Evaluation Result Details

This dataset contains the Meta evaluation result details for Llama-3.1-8B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time and… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals.
h
Text-To-Image-Test-Prompts
huggingface.co
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Rosehill (2025). Text-To-Image-Test-Prompts [Dataset]. https://huggingface.co/datasets/danielrosehill/Text-To-Image-Test-Prompts
Explore at:
Dataset updated
Jul 10, 2025
Authors
Daniel Rosehill
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Text To Image Test Prompt Library

A comprehensive collection of evaluation prompts for testing text-to-image AI models across diverse parameters and use cases.

Overview

This repository contains a structured set of test prompts designed to evaluate the capabilities of text-to-image generation models. Rather than focusing on formal evaluation metrics, these prompts are intended for end users who want to test how well a model might perform for their specific use cases.… See the full description on the dataset page: https://huggingface.co/datasets/danielrosehill/Text-To-Image-Test-Prompts.
h
ceval-exam
huggingface.co
opendatalab.com
Updated Jan 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ceval (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/ceval/ceval-exam
Explore at:
Dataset updated
Jan 8, 2022
Dataset authored and provided by
ceval
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.

Facebook

Twitter

Click to copy link

Link copied

Cite

MERA (2025). MERA [Dataset]. https://huggingface.co/datasets/MERA-evaluation/MERA

Data from: MERA

MERA-evaluation/MERA

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 17, 2025

Authors

MERA

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

MERA (Multimodal Evaluation for Russian-language Architectures)

  Summary

MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.

Clear search

Close search

Google apps

Main menu

Data from: MERA

evaluation

BeaverTails-Evaluation

model-written-evals

Evaluation

do-not-answer

glue-ci

autoeval-eval-cnn_dailymail-3.0.0-52cdb7-47832145225

evaluate-dependents

code_evaluation_prompts

details_huggingface_nebius_google_gemma-3-27b-it_private

openai_humaneval

ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset...

media

is-dlt-models-evaluation

Llama-3.1-405B-evals

neural_code_search

Llama-3.1-8B-evals

Text-To-Image-Test-Prompts

ceval-exam

Data from: MERA

MERA-evaluation/MERA