https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Code | Leaderboard | Results | Paper
RewardBench 2 Evaluation Dataset Card
The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:
Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This repository contains the data presented in RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment. Code: https://github.com/jinzhuoran/RAG-RewardBench/
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🏆 [VideoGen-RewardBench Leaderboard]
Introduction
VideoGen-RewardBench is a comprehensive benchmark designed to evaluate the performance of video reward models on modern text-to-video (T2V) systems. Derived from the third-party VideoGen-Eval (Zeng et.al, 2024), we constructing 26.5k (prompt, Video A, Video B) triplets and employing expert annotators to provide pairwise preference labels. These annotations are based on key evaluation dimensions—Visual Quality (VQ), Motion… See the full description on the dataset page: https://huggingface.co/datasets/KwaiVGI/VideoGen-RewardBench.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
website | paper
AceMath-RewardBench Evaluation Dataset Card
The AceMath-RewardBench evaluation dataset evaluates capabilities of a math reward model using the best-of-N (N=8) setting for 7 datasets:
GSM8K: 1319 questions Math500: 500 questions Minerva Math: 272 questions Gaokao 2023 en: 385 questions OlympiadBench: 675 questions College Math: 2818 questions MMLU STEM: 3018 questions
Each example in the dataset contains:
A mathematical question 64 solution attempts with varying… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/AceMath-RewardBench.
heyyjudes/rewardbench dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
[📖 arXiv Paper] [📊 Training Code] [📝 Homepage] [🏆 Reward Model] [🔮 MM-RewardBench] [🔮 MM-SafetyBench] [📈 Evaluation Suite]
The Next Step Forward in Multimodal LLM Alignment
[2025/02/10] 🔥 We are proud to open-source MM-RLHF, a comprehensive project for aligning Multimodal Large Language Models (MLLMs) with human preferences. This release includes:
A high-quality MLLM alignment dataset. A strong Critique-Based MLLM reward model and its training algorithm. A novel… See the full description on the dataset page: https://huggingface.co/datasets/yifanzhang114/MM-RLHF.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
fc-reward-bench (HF papers) (arxiv)
fc-reward-bench is a benchmark designed to evaluate reward model performance in function-calling tasks. It features 1,500 unique user inputs derived from the single-turn splits of the BFCL-v3 dataset. Each input is paired with both correct and incorrect function calls. Correct calls are sourced directly from BFCL, while incorrect calls are generated by 25 permissively licensed models.
Performance of ToolRM, top reward models from… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/fc-reward-bench.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Multilingual Reward Bench (v1.0)
Reward models (RMs) have driven the development of state-of-the-art LLMs today, with unprecedented impact across the globe. However, their performance in multilingual settings still remains understudied. In order to probe reward model behavior on multilingual data, we present M-RewardBench, a benchmark for 23 typologically diverse languages. M-RewardBench contains prompt-chosen-rejected preference triples obtained by curating and translating chat… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabsCommunity/multilingual-reward-bench.
Reward Bench overlap with Skyworks Preferences 80k
This dataset includes the overlap between the SkyWorks prompts, which are being used to train top reward models, with the original test set. More information found here.
rubricreward/reward-bench dataset hosted on Hugging Face and contributed by the HF Datasets community
saepark/rewardbench-binarized-preferences-nomedical dataset hosted on Hugging Face and contributed by the HF Datasets community
stochastic-parrots/rewardbench-lighteval dataset hosted on Hugging Face and contributed by the HF Datasets community
PatronusAI/glider-reward-bench-suite-chat dataset hosted on Hugging Face and contributed by the HF Datasets community
rewardbench_eval_1203280225: RewardBench CLI Eval. Outputs
See https://github.com/allenai/reward-bench for more details Built with the rewardbench CLI tool. Accuracy: 0.475 Command used to run: python /scratch/yl13579/.conda/envs/rewardbench/bin/rewardbench --model=sfairXC/FsfairX-LLaMA3-RM-v0.1 --dataset=Yuhan123/eval_data_oasst1_1k_en_their --split=train --chat_template=raw --batch_size=16 --push_results_to_hub --upload_model_metadata_to_hf
Configs
args:… See the full description on the dataset page: https://huggingface.co/datasets/Yuhan123/rewardbench_eval_1203280225.
rewardbench_eval_1702280225: RewardBench CLI Eval. Outputs
See https://github.com/allenai/reward-bench for more details Built with the rewardbench CLI tool. Accuracy: 0.597 Command used to run: python /scratch/yl13579/.conda/envs/rewardbench/bin/rewardbench --model=sfairXC/FsfairX-LLaMA3-RM-v0.1 --dataset=Yuhan123/mistral-7b-our-lowest-update-gen_vs_mistral-7b --split=train --chat_template=raw --push_results_to_hub --upload_model_metadata_to_hf
Configs
args:… See the full description on the dataset page: https://huggingface.co/datasets/Yuhan123/rewardbench_eval_1702280225.
rewardbench_eval_2000300924: RewardBench CLI Eval. Outputs
See https://github.com/allenai/reward-bench for more details Built with the rewardbench CLI tool. Command used to run: python /home/nathanl/.local/bin/rewardbench --model vwxyzjn/reward_modeling_EleutherAI_pythia-14m --dataset HuggingFaceH4/no_robots --split test --batch_size 128 --tokenizer=EleutherAI/pythia-14m --push_results_to_hub --chat_template oasst_pythia
Configs
args: {'batch_size': 128… See the full description on the dataset page: https://huggingface.co/datasets/natolambert/rewardbench_eval_2000300924.
saumyamalik/helpsteer2-rewardbench-contamination dataset hosted on Hugging Face and contributed by the HF Datasets community
multilingual-reward-bench/code-python dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Repurposed allenai/reward-bench dataset for binary classification task, using the following script: import random
import pandas as pd from datasets import Dataset, load_dataset from sklearn.model_selection import GroupKFold
data_df = load_dataset("allenai/reward-bench", split="raw").to_pandas()
rng = random.Random(43)
examples = [] for idx, row in data_df.iterrows(): if rng.random() > 0.5: response_a = row["chosen"] response_b = row["rejected"] label = 0… See the full description on the dataset page: https://huggingface.co/datasets/rbiswasfc/reward-bench-binary-classification.
rewardbench_eval_0328290924: RewardBench CLI Eval. Outputs
See https://github.com/allenai/rewardbench for more details
Configs
args: {'batch_size': 128, 'chat_template': 'oasst_pythia', 'dataset': 'HuggingFaceH4/no_robots', 'debug': False, 'force_truncation': False, 'hf_entity': 'natolambert', 'hf_name': 'rewardbench_eval_0328290924', 'load_json': False, 'max_length': 512, 'model': 'vwxyzjn/reward_modeling_EleutherAI_pythia-14m', 'not_quantized': False… See the full description on the dataset page: https://huggingface.co/datasets/natolambert/rewardbench_eval_0328290924.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Code | Leaderboard | Results | Paper
RewardBench 2 Evaluation Dataset Card
The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:
Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.