16 datasets found

reward-bench-2
huggingface.co
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). reward-bench-2 [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-2
Explore at:
Dataset updated
Jun 3, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Code | Leaderboard | Results | Paper

RewardBench 2 Evaluation Dataset Card

The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:

Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.
h
reward-bench-2
huggingface.co
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Anugraha (2025). reward-bench-2 [Dataset]. https://huggingface.co/datasets/davidanugraha/reward-bench-2
Explore at:
Dataset updated
Jul 11, 2025
Authors
David Anugraha
Description
davidanugraha/reward-bench-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
reward-bench-2-converted
huggingface.co
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
john02171574 (2025). reward-bench-2-converted [Dataset]. https://huggingface.co/datasets/john02171574/reward-bench-2-converted
Explore at:
Dataset updated
Jun 19, 2025
Authors
john02171574
Description
john02171574/reward-bench-2-converted dataset hosted on Hugging Face and contributed by the HF Datasets community
reward-bench-results
huggingface.co
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). reward-bench-results [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-results
Explore at:
Dataset updated
Apr 30, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
Description
Results for Holisitic Evaluation of Reward Models (HERM) Benchmark

Here, you'll find the raw scores for the HERM project. The repository is structured as follows. ├── best-of-n/ <- Nested directory for different completions on Best of N challenge | ├── alpaca_eval/ └── results for each reward model | | ├── tulu-13b/{org}/{model}.json
| | └── zephyr-7b/{org}/{model}.json | └── mt_bench/
|… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-results.
reward-bench-2-results
huggingface.co
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). reward-bench-2-results [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-2-results
Explore at:
Dataset updated
Jun 3, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
Description
allenai/reward-bench-2-results dataset hosted on Hugging Face and contributed by the HF Datasets community
h
multilingual-reward-bench
huggingface.co
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere Labs Community (2025). multilingual-reward-bench [Dataset]. http://doi.org/10.57967/hf/3352
Explore at:
Unique identifier
https://doi.org/10.57967/hf/3352
Dataset updated
May 15, 2025
Dataset authored and provided by
Cohere Labs Community
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Multilingual Reward Bench (v1.0)

Reward models (RMs) have driven the development of state-of-the-art LLMs today, with unprecedented impact across the globe. However, their performance in multilingual settings still remains understudied. In order to probe reward model behavior on multilingual data, we present M-RewardBench, a benchmark for 23 typologically diverse languages. M-RewardBench contains prompt-chosen-rejected preference triples obtained by curating and translating chat… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabsCommunity/multilingual-reward-bench.
fc-reward-bench
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IBM Research, fc-reward-bench [Dataset]. https://huggingface.co/datasets/ibm-research/fc-reward-bench
Explore at:
Dataset provided by
IBMhttp://ibm.com/
IBM Research
Authors
IBM Research
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
fc-reward-bench

fc-reward-bench is a benchmark designed to evaluate reward model performance in function-calling tasks. It features 1,500 unique user inputs derived from the single-turn splits of the BFCL-v3 dataset. Each input is paired with both correct and incorrect function calls. Correct calls are sourced directly from BFCL, while incorrect calls are generated by 25 permissively licensed models.

Dataset Structure

Each entry in the dataset includes the following… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/fc-reward-bench.
h
agent-reward-bench
huggingface.co
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McGill NLP Group (2025). agent-reward-bench [Dataset]. https://huggingface.co/datasets/McGill-NLP/agent-reward-bench
Explore at:
Dataset updated
Apr 15, 2025
Dataset authored and provided by
McGill NLP Group
Description
AgentRewardBench

💾Code 📄Paper 🌐Website

🤗Dataset 💻Demo 🏆Leaderboard

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent TrajectoriesXing Han Lù, Amirhossein Kazemnejad*, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J. Pal, Siva Reddy*Core Contributor

Loading dataset

You can use the huggingface_hub library to load the dataset. The dataset is available on Huggingface Hub at… See the full description on the dataset page: https://huggingface.co/datasets/McGill-NLP/agent-reward-bench.
h
reward-bench-Llama-2-13b-hf-yes-no
huggingface.co
Updated Jan 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Singh (2025). reward-bench-Llama-2-13b-hf-yes-no [Dataset]. https://huggingface.co/datasets/Ayush-Singh/reward-bench-Llama-2-13b-hf-yes-no
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 14, 2025
Authors
Ayush Singh
Description
Ayush-Singh/reward-bench-Llama-2-13b-hf-yes-no dataset hosted on Hugging Face and contributed by the HF Datasets community
h
reward-bench-gemma-2-2b-it-set3-scores
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Singh, reward-bench-gemma-2-2b-it-set3-scores [Dataset]. https://huggingface.co/datasets/Ayush-Singh/reward-bench-gemma-2-2b-it-set3-scores
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ayush Singh
Description
Ayush-Singh/reward-bench-gemma-2-2b-it-set3-scores dataset hosted on Hugging Face and contributed by the HF Datasets community
h
RM-Bench-chat-Skywork-Reward-Llama-3.1-8B-v0.2-normal
huggingface.co
Updated Jan 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Singh (2025). RM-Bench-chat-Skywork-Reward-Llama-3.1-8B-v0.2-normal [Dataset]. https://huggingface.co/datasets/Ayush-Singh/RM-Bench-chat-Skywork-Reward-Llama-3.1-8B-v0.2-normal
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2025
Authors
Ayush Singh
Description
Ayush-Singh/RM-Bench-chat-Skywork-Reward-Llama-3.1-8B-v0.2-normal dataset hosted on Hugging Face and contributed by the HF Datasets community
h
reward-general-bench
huggingface.co
Updated May 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yannik Krone (2024). reward-general-bench [Dataset]. https://huggingface.co/datasets/NaykinYT/reward-general-bench
Explore at:
Dataset updated
May 15, 2024
Authors
Yannik Krone
Description
NaykinYT/reward-general-bench dataset hosted on Hugging Face and contributed by the HF Datasets community
h
MM-RLHF-RewardBench
huggingface.co
Updated Feb 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Fan Zhang (2025). MM-RLHF-RewardBench [Dataset]. https://huggingface.co/datasets/yifanzhang114/MM-RLHF-RewardBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 17, 2025
Authors
Yi-Fan Zhang
Description
[📖 arXiv Paper] [📊 MM-RLHF Data] [📝 Homepage] [🏆 Reward Model] [🔮 MM-RewardBench] [🔮 MM-SafetyBench] [📈 Evaluation Suite]

The Next Step Forward in Multimodal LLM Alignment

[2025/02/10] 🔥 We are proud to open-source MM-RLHF, a comprehensive project for aligning Multimodal Large Language Models (MLLMs) with human preferences. This release includes:

A high-quality MLLM alignment dataset. A strong Critique-Based MLLM reward model and its training algorithm. A novel… See the full description on the dataset page: https://huggingface.co/datasets/yifanzhang114/MM-RLHF-RewardBench.
h
R1-Reward-RL
huggingface.co
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Fan Zhang (2025). R1-Reward-RL [Dataset]. https://huggingface.co/datasets/yifanzhang114/R1-Reward-RL
Explore at:
Dataset updated
May 6, 2025
Authors
Yi-Fan Zhang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
[📖 arXiv Paper] [📊 R1-Reward Code] [📝 R1-Reward Model]

Training Multimodal Reward Model Through Stable Reinforcement Learning

🔥 We are proud to open-source R1-Reward, a comprehensive project for improve reward modeling through reinforcement learning. This release includes:

R1-Reward Model: A state-of-the-art (SOTA) multimodal reward model demonstrating substantial gains (Voting@15): 13.5% improvement on VL Reward-Bench.3.5% improvement on MM-RLHF Reward-Bench.… See the full description on the dataset page: https://huggingface.co/datasets/yifanzhang114/R1-Reward-RL.
h
RM-Bench-safety-response-Skywork-Reward-Llama-3.1-8B-v0.2-normal
huggingface.co
Updated Jan 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Singh (2025). RM-Bench-safety-response-Skywork-Reward-Llama-3.1-8B-v0.2-normal [Dataset]. https://huggingface.co/datasets/Ayush-Singh/RM-Bench-safety-response-Skywork-Reward-Llama-3.1-8B-v0.2-normal
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2025
Authors
Ayush Singh
Description
Ayush-Singh/RM-Bench-safety-response-Skywork-Reward-Llama-3.1-8B-v0.2-normal dataset hosted on Hugging Face and contributed by the HF Datasets community
Libra-Bench
huggingface.co
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
meituan (2025). Libra-Bench [Dataset]. https://huggingface.co/datasets/meituan/Libra-Bench
Explore at:
Dataset updated
Jul 30, 2025
Dataset provided by
Meituanhttps://www.aia.com/
Authors
meituan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Libra Bench

Overview

Libra Bench is a sophisticated, reasoning-oriented reward model (RM) benchmark, systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models. The Libra Bench is specifically designed to evaluate pointwise judging accuracy with respect to correctness. These attributes ensure that Libra Bench is well aligned with contemporary research, where reasoning models are primarily assessed and optimized… See the full description on the dataset page: https://huggingface.co/datasets/meituan/Libra-Bench.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ai2 (2025). reward-bench-2 [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-2

reward-bench-2

allenai/reward-bench-2

Explore at:

33 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jun 3, 2025

Dataset provided by

Allen Institute for AIhttp://allenai.org/

Authors

Ai2

License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

Code | Leaderboard | Results | Paper

  RewardBench 2 Evaluation Dataset Card

The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:

Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.

Clear search

Close search

Google apps

Main menu

reward-bench-2

reward-bench-2

reward-bench-2-converted

reward-bench-results

reward-bench-2-results

multilingual-reward-bench

fc-reward-bench

agent-reward-bench

reward-bench-Llama-2-13b-hf-yes-no

reward-bench-gemma-2-2b-it-set3-scores

RM-Bench-chat-Skywork-Reward-Llama-3.1-8B-v0.2-normal

reward-general-bench

MM-RLHF-RewardBench

R1-Reward-RL

RM-Bench-safety-response-Skywork-Reward-Llama-3.1-8B-v0.2-normal

Libra-Bench

reward-bench-2

allenai/reward-bench-2