100+ datasets found

model-written-evals
huggingface.co
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthropic (2023). model-written-evals [Dataset]. https://huggingface.co/datasets/Anthropic/model-written-evals
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset authored and provided by
Anthropichttps://anthropic.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model-Written Evaluation Datasets

This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to:

Those who are interested in understanding the quality and properties of model-generated data Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/model-written-evals.
h
glue
huggingface.co
tensorflow.google.cn
+1more
Updated Mar 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NYU Machine Learning for Language (2024). glue [Dataset]. https://huggingface.co/datasets/nyu-mll/glue
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 6, 2024
Dataset authored and provided by
NYU Machine Learning for Language
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for GLUE

Dataset Summary

GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

Supported Tasks and Leaderboards

The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks:

ax

A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
ContextEval
huggingface.co
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2024). ContextEval [Dataset]. https://huggingface.co/datasets/allenai/ContextEval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2024
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations

Dataset Summary

We provide here the data accompanying the paper: Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations.

Dataset Structure Data Instances

We release the set of queries, as well as the autorater & human evaluation judgements collected for our experiments.

Data overview List of queries: Data Structure

The list… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ContextEval.
h
Data from: MERA
huggingface.co
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MERA (2025). MERA [Dataset]. https://huggingface.co/datasets/MERA-evaluation/MERA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2025
Authors
MERA
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MERA (Multimodal Evaluation for Russian-language Architectures)

Summary

MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.
h
Evaluation
huggingface.co
Updated Sep 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lunlin Yang (2024). Evaluation [Dataset]. https://huggingface.co/datasets/ygl1020/Evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2024
Authors
Lunlin Yang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ygl1020/Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
evaluation
huggingface.co
Updated May 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2023). evaluation [Dataset]. https://huggingface.co/datasets/bigcode/evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 4, 2023
Dataset authored and provided by
BigCode
Description
bigcode/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
BeaverTails-Evaluation
huggingface.co
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PKU-Alignment (2025). BeaverTails-Evaluation [Dataset]. https://huggingface.co/datasets/PKU-Alignment/BeaverTails-Evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 27, 2025
Dataset authored and provided by
PKU-Alignment
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for BeaverTails-Evaluation

BeaverTails is an AI safety-focused collection comprising a series of datasets. This repository contains test prompts specifically designed for evaluating language model safety. It is important to note that although each prompt can be connected to multiple categories, only one category is labeled for each prompt. The 14 harm categories are defined as follows:

Animal Abuse: This involves any form of cruelty or harm inflicted on animals… See the full description on the dataset page: https://huggingface.co/datasets/PKU-Alignment/BeaverTails-Evaluation.
h
super_tweeteval
huggingface.co
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cardiff NLP (2024). super_tweeteval [Dataset]. https://huggingface.co/datasets/cardiffnlp/super_tweeteval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 7, 2024
Dataset authored and provided by
Cardiff NLP
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
SuperTweetEval

Dataset Card for "super_tweeteval" Dataset Summary

This is the oficial repository for SuperTweetEval, a unified benchmark of 12 heterogeneous NLP tasks. More details on the task and an evaluation of language models can be found on the reference paper, published in EMNLP 2023 (Findings).

Data Splits

All tasks provide custom training, validation and test splits.

task dataset load dataset description number of instances

Topic… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/super_tweeteval.
h
GREEN
huggingface.co
Updated May 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford AIMI (2025). GREEN [Dataset]. https://huggingface.co/datasets/StanfordAIMI/GREEN
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 27, 2025
Dataset authored and provided by
Stanford AIMI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
GREEN Dataset

We share the dataset used to train the LLM metric introduced in "GREEN: Generative Radiology Report Evaluation and Error Notation". GREEN is a evaluation metric for radiology reports that uses language models to identify and explain clinically significant errors, offering better alignment with expert preferences and more interpretable results compared to existing metrics. The method provides both quantitative scores and qualitative explanations, has been validated… See the full description on the dataset page: https://huggingface.co/datasets/StanfordAIMI/GREEN.
h
is-dlt-models-evaluation
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Exponential Science, is-dlt-models-evaluation [Dataset]. https://huggingface.co/datasets/ExponentialScience/is-dlt-models-evaluation
Explore at:
Dataset authored and provided by
Exponential Science
Description
ExponentialScience/is-dlt-models-evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
openai_humaneval
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for OpenAI HumanEval

Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

Supported Tasks and Leaderboards Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
h
SLR-Bench
huggingface.co
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Intelligence & Machine Learning Lab at TU Darmstadt (2024). SLR-Bench [Dataset]. https://huggingface.co/datasets/AIML-TUDA/SLR-Bench
Explore at:
Dataset updated
Jun 15, 2024
Dataset authored and provided by
Artificial Intelligence & Machine Learning Lab at TU Darmstadt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SLR-Bench: Scalable Logical Reasoning Benchmark for LLMs

🆕 June 2024: Reward Model & Evaluation Pipeline Released!Systematically evaluate model-generated rules via symbolic execution, fully automatic and verifiable. Supports evaluation and RLVR. 👉 Demo on Hugging Face SpacesSLR-Bench is a scalable, fully-automated benchmark designed to systematically evaluate and train Large Language Models (LLMs) in logical reasoning via inductive logic programming (ILP) tasks. Built with… See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/SLR-Bench.
fleurs
huggingface.co
opendatalab.com
Updated Jun 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2022). fleurs [Dataset]. https://huggingface.co/datasets/google/fleurs
Explore at:
Dataset updated
Jun 4, 2022
Dataset authored and provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FLEURS

Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/google/fleurs.
h
evaluation
huggingface.co
Updated Feb 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Baker (2024). evaluation [Dataset]. https://huggingface.co/datasets/jlbaker361/evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Authors
James Baker
Description
jlbaker361/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
evaluation
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yellow, evaluation [Dataset]. https://huggingface.co/datasets/lilyyellow/evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
yellow
Description
lilyyellow/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
do-not-answer
huggingface.co
Updated Sep 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LibrAI (2023). do-not-answer [Dataset]. https://huggingface.co/datasets/LibrAI/do-not-answer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Dataset authored and provided by
LibrAI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Overview

Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

Instruction… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
code_evaluation_prompts
huggingface.co
Updated May 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). code_evaluation_prompts [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 3, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
Description
Dataset Card for H4 Code Evaluation Prompts

These are a filtered set of prompts for evaluating code instruction models. It will contain a variety of languages and task types. Currently, we used ChatGPT (GPT-3.5-tubro) to generate these, so we encourage using them only for qualatative evaluation and not to train your models. The generation of this data is similar to something like CodeAlpaca, which you can download here, but we intend to make these tasks botha) more challenging… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts.
mt_bench_prompts
huggingface.co
Updated Jul 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). mt_bench_prompts [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MT Bench by LMSYS

This set of evaluation prompts is created by the LMSYS org for better evaluation of chat models. For more information, see the paper.

Dataset loading

To load this dataset, use 🤗 datasets: from datasets import load_dataset data = load_dataset(HuggingFaceH4/mt_bench_prompts, split="train")

Dataset creation

To create the dataset, we do the following for our internal tooling.

rename turns to prompts, add empty reference to remaining prompts… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts.
h
ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset...
huggingface.co
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeepNLP (2024). ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset [Dataset]. https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 12, 2024
Authors
DeepNLP
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
ChatGPT Gemini Claude Perplexity Human Evaluation Multi Aspect Review Dataset

Introduction

Human evaluation and reviews with scalar score of AI Services responses are very usefuly in LLM Finetuning, Human Preference Alignment, Few-Shot Learning, Bad Case Shooting, etc, but extremely difficult to collect. This dataset is collected from DeepNLP AI Service User Review panel (http://www.deepnlp.org/store), which is an open review website for users to give reviews and upload… See the full description on the dataset page: https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset.
h
data-for-evaluation
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zst, data-for-evaluation [Dataset]. https://huggingface.co/datasets/ZzzStone/data-for-evaluation
Explore at:
Authors
zst
Description
ZzzStone/data-for-evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Anthropic (2023). model-written-evals [Dataset]. https://huggingface.co/datasets/Anthropic/model-written-evals

model-written-evals

Anthropic/model-written-evals

Evaluations from "Discovering Language Model Behaviors with Model-Written Evaluations"

Explore at:

9 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 12, 2023

Dataset authored and provided by

Anthropichttps://anthropic.com/

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Model-Written Evaluation Datasets

This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to:

Those who are interested in understanding the quality and properties of model-generated data Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/model-written-evals.

Clear search

Close search

Google apps

Main menu

model-written-evals

glue

ContextEval

Data from: MERA

Evaluation

evaluation

BeaverTails-Evaluation

super_tweeteval

GREEN

is-dlt-models-evaluation

openai_humaneval

SLR-Bench

fleurs

evaluation

evaluation

do-not-answer

code_evaluation_prompts

mt_bench_prompts

ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset...

data-for-evaluation

model-written-evals

Anthropic/model-written-evals

Evaluations from "Discovering Language Model Behaviors with Model-Written Evaluations"