Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for IFEval
Dataset Summary
This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset
ifeval = load_dataset("google/IFEval")
Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
IFEval-Ko: Korean Instruction-Following Benchmark for LLMs
This dataset is originated from IFEval Dataset
Korean Version README IFEval-Ko is a Korean adaptation of Google's open-source IFEval benchmark utilized with lm-evaluation-harness framework. It enables evaluation of large language models (LLMs) for their instruction-following capabilities in the Korean language.
Dataset Details
Original Source: google/IFEvalAdaptation Author: Allganize Inc. LLM TEAM |… See the full description on the dataset page: https://huggingface.co/datasets/allganize/IFEval-Ko.
We introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models’ multilingual capabilities.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Persian IFEval
Dataset Summary
Persian IFEval is a localized and culturally adapted version of the IFEval benchmark, designed to evaluate the ability of language models to follow complex instructions in Persian. The dataset focuses on instruction-guided text generation, especially in cases that require adherence to specific constraints such as keyword inclusion, length limits, or structural properties. The dataset was translated from English using a combination of machine… See the full description on the dataset page: https://huggingface.co/datasets/MCINext/persian-ifeval.
Overview
This contains the IFEval correctness preference evaluation set for Preference Proxy Evaluations. The prompts are sampled from IFEval. This dataset is meant for benchmarking and evaluation, not for training. Paper Code
License
User prompts are licensed under Apache-2.0, and model outputs are governed by the terms of use set by the respective model providers.
Citation
@misc{frick2024evaluaterewardmodelsrlhf, title={How to Evaluate Reward… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/PPE-IFEval-Best-of-K.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for IFEval Greek
The IFEval Greek dataset contains 541 prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models, manually translated into Greek. The dataset contains "verifiable instructions" such as "απάντησε με περισσότερες από 400 λέξεις" and "ανάφερε τη λέξη ΤΝ τουλάχιστον 3 φορές" which can be verified by heuristics.
Dataset Details
Dataset Description
Curated by: ILSP/Athena RC
Language(s) (NLP): el… See the full description on the dataset page: https://huggingface.co/datasets/ilsp/ifeval_greek.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
IFEval is the first publicly available benchmark dataset specifically designed to evaluate Arabic Large Language Models (LLMs) on instruction-following capabilities in Arabic. The dataset includes 404 high-quality, manually verified samples covering various constraints such as linguistic patterns, punctuation rules, and formatting guidelines.
Loading the Dataset
To load this dataset in Python using the 🤗 Datasets library, run the following: from datasets import load_dataset
ifeval… See the full description on the dataset page: https://huggingface.co/datasets/inceptionai/Arabic_IFEval.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for IFEval
Dataset Summary
This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset
ifeval = load_dataset("mii-llm/ifeval-ita")
Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/mii-llm/ifeval-ita.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for IFEval_ca
IFEval_ca is a prompt dataset in Catalan, professionally translated from the main version of the IFEval dataset in English.
Dataset Details
Dataset Description
IFEval_ca (Instruction-Following Eval benchmark - Catalan) is designed to evaluating chat or instruction fine-tuned language models. The dataset comprises 541 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times"… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/IFEval_ca.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for IFEval
Dataset Summary
This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset
ifeval = load_dataset("google/IFEval")
Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.