Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model-Written Evaluation Datasets
This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to:
Those who are interested in understanding the quality and properties of model-generated data Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/model-written-evals.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for GLUE
Dataset Summary
GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
Supported Tasks and Leaderboards
The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks:
ax
A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations
Dataset Summary
We provide here the data accompanying the paper: Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations.
Dataset Structure
Data Instances
We release the set of queries, as well as the autorater & human evaluation judgements collected for our experiments.
Data overview
List of queries: Data Structure
The list… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ContextEval.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MERA (Multimodal Evaluation for Russian-language Architectures)
Summary
MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ygl1020/Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
bigcode/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for BeaverTails-Evaluation
BeaverTails is an AI safety-focused collection comprising a series of datasets. This repository contains test prompts specifically designed for evaluating language model safety. It is important to note that although each prompt can be connected to multiple categories, only one category is labeled for each prompt. The 14 harm categories are defined as follows:
Animal Abuse: This involves any form of cruelty or harm inflicted on animals… See the full description on the dataset page: https://huggingface.co/datasets/PKU-Alignment/BeaverTails-Evaluation.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
SuperTweetEval
Dataset Card for "super_tweeteval"
Dataset Summary
This is the oficial repository for SuperTweetEval, a unified benchmark of 12 heterogeneous NLP tasks. More details on the task and an evaluation of language models can be found on the reference paper, published in EMNLP 2023 (Findings).
Data Splits
All tasks provide custom training, validation and test splits.
task dataset load dataset description number of instances
Topic… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/super_tweeteval.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
GREEN Dataset
We share the dataset used to train the LLM metric introduced in "GREEN: Generative Radiology Report Evaluation and Error Notation". GREEN is a evaluation metric for radiology reports that uses language models to identify and explain clinically significant errors, offering better alignment with expert preferences and more interpretable results compared to existing metrics. The method provides both quantitative scores and qualitative explanations, has been validated… See the full description on the dataset page: https://huggingface.co/datasets/StanfordAIMI/GREEN.
ExponentialScience/is-dlt-models-evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for OpenAI HumanEval
Dataset Summary
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
Supported Tasks and Leaderboards
Languages
The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SLR-Bench: Scalable Logical Reasoning Benchmark for LLMs
🆕 June 2024: Reward Model & Evaluation Pipeline Released!Systematically evaluate model-generated rules via symbolic execution, fully automatic and verifiable. Supports evaluation and RLVR. 👉 Demo on Hugging Face SpacesSLR-Bench is a scalable, fully-automated benchmark designed to systematically evaluate and train Large Language Models (LLMs) in logical reasoning via inductive logic programming (ILP) tasks. Built with… See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/SLR-Bench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FLEURS
Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/google/fleurs.
jlbaker361/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
lilyyellow/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
Overview
Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.
Instruction… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
Dataset Card for H4 Code Evaluation Prompts
These are a filtered set of prompts for evaluating code instruction models. It will contain a variety of languages and task types. Currently, we used ChatGPT (GPT-3.5-tubro) to generate these, so we encourage using them only for qualatative evaluation and not to train your models. The generation of this data is similar to something like CodeAlpaca, which you can download here, but we intend to make these tasks botha) more challenging… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MT Bench by LMSYS
This set of evaluation prompts is created by the LMSYS org for better evaluation of chat models. For more information, see the paper.
Dataset loading
To load this dataset, use 🤗 datasets: from datasets import load_dataset data = load_dataset(HuggingFaceH4/mt_bench_prompts, split="train")
Dataset creation
To create the dataset, we do the following for our internal tooling.
rename turns to prompts, add empty reference to remaining prompts… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ChatGPT Gemini Claude Perplexity Human Evaluation Multi Aspect Review Dataset
Introduction
Human evaluation and reviews with scalar score of AI Services responses are very usefuly in LLM Finetuning, Human Preference Alignment, Few-Shot Learning, Bad Case Shooting, etc, but extremely difficult to collect. This dataset is collected from DeepNLP AI Service User Review panel (http://www.deepnlp.org/store), which is an open review website for users to give reviews and upload… See the full description on the dataset page: https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model-Written Evaluation Datasets
This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to:
Those who are interested in understanding the quality and properties of model-generated data Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/model-written-evals.