https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Code | Leaderboard | Results | Paper
RewardBench 2 Evaluation Dataset Card
The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:
Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.
davidanugraha/reward-bench-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
john02171574/reward-bench-2-converted dataset hosted on Hugging Face and contributed by the HF Datasets community
Results for Holisitic Evaluation of Reward Models (HERM) Benchmark
Here, you'll find the raw scores for the HERM project.
The repository is structured as follows.
├── best-of-n/ <- Nested directory for different completions on Best of N challenge
| ├── alpaca_eval/ └── results for each reward model
| | ├── tulu-13b/{org}/{model}.json
| | └── zephyr-7b/{org}/{model}.json
| └── mt_bench/
|… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-results.
allenai/reward-bench-2-results dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Multilingual Reward Bench (v1.0)
Reward models (RMs) have driven the development of state-of-the-art LLMs today, with unprecedented impact across the globe. However, their performance in multilingual settings still remains understudied. In order to probe reward model behavior on multilingual data, we present M-RewardBench, a benchmark for 23 typologically diverse languages. M-RewardBench contains prompt-chosen-rejected preference triples obtained by curating and translating chat… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabsCommunity/multilingual-reward-bench.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
fc-reward-bench
fc-reward-bench is a benchmark designed to evaluate reward model performance in function-calling tasks. It features 1,500 unique user inputs derived from the single-turn splits of the BFCL-v3 dataset. Each input is paired with both correct and incorrect function calls. Correct calls are sourced directly from BFCL, while incorrect calls are generated by 25 permissively licensed models.
Dataset Structure
Each entry in the dataset includes the following… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/fc-reward-bench.
AgentRewardBench
💾Code 📄Paper 🌐Website
🤗Dataset 💻Demo 🏆Leaderboard
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent TrajectoriesXing Han Lù, Amirhossein Kazemnejad*, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J. Pal, Siva Reddy*Core Contributor
Loading dataset
You can use the huggingface_hub library to load the dataset. The dataset is available on Huggingface Hub at… See the full description on the dataset page: https://huggingface.co/datasets/McGill-NLP/agent-reward-bench.
Ayush-Singh/reward-bench-Llama-2-13b-hf-yes-no dataset hosted on Hugging Face and contributed by the HF Datasets community
Ayush-Singh/reward-bench-gemma-2-2b-it-set3-scores dataset hosted on Hugging Face and contributed by the HF Datasets community
Ayush-Singh/RM-Bench-chat-Skywork-Reward-Llama-3.1-8B-v0.2-normal dataset hosted on Hugging Face and contributed by the HF Datasets community
NaykinYT/reward-general-bench dataset hosted on Hugging Face and contributed by the HF Datasets community
[📖 arXiv Paper] [📊 MM-RLHF Data] [📝 Homepage] [🏆 Reward Model] [🔮 MM-RewardBench] [🔮 MM-SafetyBench] [📈 Evaluation Suite]
The Next Step Forward in Multimodal LLM Alignment
[2025/02/10] 🔥 We are proud to open-source MM-RLHF, a comprehensive project for aligning Multimodal Large Language Models (MLLMs) with human preferences. This release includes:
A high-quality MLLM alignment dataset. A strong Critique-Based MLLM reward model and its training algorithm. A novel… See the full description on the dataset page: https://huggingface.co/datasets/yifanzhang114/MM-RLHF-RewardBench.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
[📖 arXiv Paper] [📊 R1-Reward Code] [📝 R1-Reward Model]
Training Multimodal Reward Model Through Stable Reinforcement Learning
🔥 We are proud to open-source R1-Reward, a comprehensive project for improve reward modeling through reinforcement learning. This release includes:
R1-Reward Model: A state-of-the-art (SOTA) multimodal reward model demonstrating substantial gains (Voting@15): 13.5% improvement on VL Reward-Bench.3.5% improvement on MM-RLHF Reward-Bench.… See the full description on the dataset page: https://huggingface.co/datasets/yifanzhang114/R1-Reward-RL.
Ayush-Singh/RM-Bench-safety-response-Skywork-Reward-Llama-3.1-8B-v0.2-normal dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Libra Bench
Overview
Libra Bench is a sophisticated, reasoning-oriented reward model (RM) benchmark, systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models. The Libra Bench is specifically designed to evaluate pointwise judging accuracy with respect to correctness. These attributes ensure that Libra Bench is well aligned with contemporary research, where reasoning models are primarily assessed and optimized… See the full description on the dataset page: https://huggingface.co/datasets/meituan/Libra-Bench.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Code | Leaderboard | Results | Paper
RewardBench 2 Evaluation Dataset Card
The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:
Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.