53 datasets found

reward-bench-2
huggingface.co
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). reward-bench-2 [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-2
Explore at:
Dataset updated
Jun 3, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Code | Leaderboard | Results | Paper

RewardBench 2 Evaluation Dataset Card

The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:

Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.
h
RAG-RewardBench
huggingface.co
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhuoran Jin (2024). RAG-RewardBench [Dataset]. https://huggingface.co/datasets/jinzhuoran/RAG-RewardBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2024
Authors
Zhuoran Jin
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This repository contains the data presented in RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment. Code: https://github.com/jinzhuoran/RAG-RewardBench/
VideoGen-RewardBench
huggingface.co
Updated Feb 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kuaishou Visual Generation and Interaction Center (2025). VideoGen-RewardBench [Dataset]. https://huggingface.co/datasets/KwaiVGI/VideoGen-RewardBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 9, 2025
Dataset provided by
Kuaishou Technologyhttps://www.kwai.com/
Authors
Kuaishou Visual Generation and Interaction Center
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🏆 [VideoGen-RewardBench Leaderboard]

Introduction

VideoGen-RewardBench is a comprehensive benchmark designed to evaluate the performance of video reward models on modern text-to-video (T2V) systems. Derived from the third-party VideoGen-Eval (Zeng et.al, 2024), we constructing 26.5k (prompt, Video A, Video B) triplets and employing expert annotators to provide pairwise preference labels. These annotations are based on key evaluation dimensions—Visual Quality (VQ), Motion… See the full description on the dataset page: https://huggingface.co/datasets/KwaiVGI/VideoGen-RewardBench.
AceMath-RewardBench
huggingface.co
Updated Jan 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NVIDIA (2025). AceMath-RewardBench [Dataset]. https://huggingface.co/datasets/nvidia/AceMath-RewardBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 17, 2025
Dataset provided by
Nvidiahttp://nvidia.com/
Authors
NVIDIA
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
website | paper

AceMath-RewardBench Evaluation Dataset Card

The AceMath-RewardBench evaluation dataset evaluates capabilities of a math reward model using the best-of-N (N=8) setting for 7 datasets:

GSM8K: 1319 questions Math500: 500 questions Minerva Math: 272 questions Gaokao 2023 en: 385 questions OlympiadBench: 675 questions College Math: 2818 questions MMLU STEM: 3018 questions

Each example in the dataset contains:

A mathematical question 64 solution attempts with varying… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/AceMath-RewardBench.
h
rewardbench
huggingface.co
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Judy Hanwen Shen (2024). rewardbench [Dataset]. https://huggingface.co/datasets/heyyjudes/rewardbench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 29, 2024
Authors
Judy Hanwen Shen
Description
heyyjudes/rewardbench dataset hosted on Hugging Face and contributed by the HF Datasets community
h
MM-RLHF
huggingface.co
Updated Feb 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Fan Zhang (2025). MM-RLHF [Dataset]. https://huggingface.co/datasets/yifanzhang114/MM-RLHF
Explore at:
Dataset updated
Feb 17, 2025
Authors
Yi-Fan Zhang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
[📖 arXiv Paper] [📊 Training Code] [📝 Homepage] [🏆 Reward Model] [🔮 MM-RewardBench] [🔮 MM-SafetyBench] [📈 Evaluation Suite]

The Next Step Forward in Multimodal LLM Alignment

[2025/02/10] 🔥 We are proud to open-source MM-RLHF, a comprehensive project for aligning Multimodal Large Language Models (MLLMs) with human preferences. This release includes:

A high-quality MLLM alignment dataset. A strong Critique-Based MLLM reward model and its training algorithm. A novel… See the full description on the dataset page: https://huggingface.co/datasets/yifanzhang114/MM-RLHF.
fc-reward-bench
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IBM Research, fc-reward-bench [Dataset]. https://huggingface.co/datasets/ibm-research/fc-reward-bench
Explore at:
Dataset provided by
IBMhttp://ibm.com/
IBM Research
Authors
IBM Research
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
fc-reward-bench (HF papers) (arxiv)

fc-reward-bench is a benchmark designed to evaluate reward model performance in function-calling tasks. It features 1,500 unique user inputs derived from the single-turn splits of the BFCL-v3 dataset. Each input is paired with both correct and incorrect function calls. Correct calls are sourced directly from BFCL, while incorrect calls are generated by 25 permissively licensed models.

Performance of ToolRM, top reward models from… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/fc-reward-bench.
h
multilingual-reward-bench
huggingface.co
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere Labs Community (2025). multilingual-reward-bench [Dataset]. http://doi.org/10.57967/hf/3352
Explore at:
Unique identifier
https://doi.org/10.57967/hf/3352
Dataset updated
May 15, 2025
Dataset authored and provided by
Cohere Labs Community
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Multilingual Reward Bench (v1.0)

Reward models (RMs) have driven the development of state-of-the-art LLMs today, with unprecedented impact across the globe. However, their performance in multilingual settings still remains understudied. In order to probe reward model behavior on multilingual data, we present M-RewardBench, a benchmark for 23 typologically diverse languages. M-RewardBench contains prompt-chosen-rejected preference triples obtained by curating and translating chat… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabsCommunity/multilingual-reward-bench.
h
skyworks-rewardbench-contamination
huggingface.co
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Lambert (2024). skyworks-rewardbench-contamination [Dataset]. https://huggingface.co/datasets/natolambert/skyworks-rewardbench-contamination
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 10, 2024
Authors
Nathan Lambert
Description
Reward Bench overlap with Skyworks Preferences 80k

This dataset includes the overlap between the SkyWorks prompts, which are being used to train top reward models, with the original test set. More information found here.
h
reward-bench
huggingface.co
Updated Sep 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rubricreward (2025). reward-bench [Dataset]. https://huggingface.co/datasets/rubricreward/reward-bench
Explore at:
Dataset updated
Sep 29, 2025
Dataset authored and provided by
rubricreward
Description
rubricreward/reward-bench dataset hosted on Hugging Face and contributed by the HF Datasets community
h
rewardbench-binarized-preferences-nomedical
huggingface.co
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sangeon Park (2024). rewardbench-binarized-preferences-nomedical [Dataset]. https://huggingface.co/datasets/saepark/rewardbench-binarized-preferences-nomedical
Explore at:
Dataset updated
Mar 19, 2024
Authors
Sangeon Park
Description
saepark/rewardbench-binarized-preferences-nomedical dataset hosted on Hugging Face and contributed by the HF Datasets community
h
rewardbench-lighteval
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stochastic Parrots, rewardbench-lighteval [Dataset]. https://huggingface.co/datasets/stochastic-parrots/rewardbench-lighteval
Explore at:
Dataset authored and provided by
Stochastic Parrots
Description
stochastic-parrots/rewardbench-lighteval dataset hosted on Hugging Face and contributed by the HF Datasets community
glider-reward-bench-suite-chat
huggingface.co
Updated Dec 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patronus AI (2024). glider-reward-bench-suite-chat [Dataset]. https://huggingface.co/datasets/PatronusAI/glider-reward-bench-suite-chat
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2024
Dataset provided by
Patronus AI, Inc.
Authors
Patronus AI
Description
PatronusAI/glider-reward-bench-suite-chat dataset hosted on Hugging Face and contributed by the HF Datasets community
h
rewardbench_eval_1203280225
huggingface.co
Updated Apr 20, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu (2016). rewardbench_eval_1203280225 [Dataset]. https://huggingface.co/datasets/Yuhan123/rewardbench_eval_1203280225
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 20, 2016
Authors
Liu
Description
rewardbench_eval_1203280225: RewardBench CLI Eval. Outputs

See https://github.com/allenai/reward-bench for more details Built with the rewardbench CLI tool. Accuracy: 0.475 Command used to run: python /scratch/yl13579/.conda/envs/rewardbench/bin/rewardbench --model=sfairXC/FsfairX-LLaMA3-RM-v0.1 --dataset=Yuhan123/eval_data_oasst1_1k_en_their --split=train --chat_template=raw --batch_size=16 --push_results_to_hub --upload_model_metadata_to_hf

Configs

args:… See the full description on the dataset page: https://huggingface.co/datasets/Yuhan123/rewardbench_eval_1203280225.
h
rewardbench_eval_1702280225
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu, rewardbench_eval_1702280225 [Dataset]. https://huggingface.co/datasets/Yuhan123/rewardbench_eval_1702280225
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Liu
Description
rewardbench_eval_1702280225: RewardBench CLI Eval. Outputs

See https://github.com/allenai/reward-bench for more details Built with the rewardbench CLI tool. Accuracy: 0.597 Command used to run: python /scratch/yl13579/.conda/envs/rewardbench/bin/rewardbench --model=sfairXC/FsfairX-LLaMA3-RM-v0.1 --dataset=Yuhan123/mistral-7b-our-lowest-update-gen_vs_mistral-7b --split=train --chat_template=raw --push_results_to_hub --upload_model_metadata_to_hf

Configs

args:… See the full description on the dataset page: https://huggingface.co/datasets/Yuhan123/rewardbench_eval_1702280225.
h
rewardbench_eval_2000300924
huggingface.co
Updated Oct 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Lambert (2024). rewardbench_eval_2000300924 [Dataset]. https://huggingface.co/datasets/natolambert/rewardbench_eval_2000300924
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2024
Authors
Nathan Lambert
Description
rewardbench_eval_2000300924: RewardBench CLI Eval. Outputs

See https://github.com/allenai/reward-bench for more details Built with the rewardbench CLI tool. Command used to run: python /home/nathanl/.local/bin/rewardbench --model vwxyzjn/reward_modeling_EleutherAI_pythia-14m --dataset HuggingFaceH4/no_robots --split test --batch_size 128 --tokenizer=EleutherAI/pythia-14m --push_results_to_hub --chat_template oasst_pythia

Configs

args: {'batch_size': 128… See the full description on the dataset page: https://huggingface.co/datasets/natolambert/rewardbench_eval_2000300924.
h
helpsteer2-rewardbench-contamination
huggingface.co
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saumya Malik (2021). helpsteer2-rewardbench-contamination [Dataset]. https://huggingface.co/datasets/saumyamalik/helpsteer2-rewardbench-contamination
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 30, 2021
Authors
Saumya Malik
Description
saumyamalik/helpsteer2-rewardbench-contamination dataset hosted on Hugging Face and contributed by the HF Datasets community
h
code-python
huggingface.co
Updated Oct 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
multilingual-reward-bench (2024). code-python [Dataset]. https://huggingface.co/datasets/multilingual-reward-bench/code-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 15, 2024
Dataset authored and provided by
multilingual-reward-bench
Description
multilingual-reward-bench/code-python dataset hosted on Hugging Face and contributed by the HF Datasets community
h
reward-bench-binary-classification
huggingface.co
Updated Dec 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2021). reward-bench-binary-classification [Dataset]. https://huggingface.co/datasets/rbiswasfc/reward-bench-binary-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2021
Authors
Raja Biswas
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Repurposed allenai/reward-bench dataset for binary classification task, using the following script: import random

import pandas as pd from datasets import Dataset, load_dataset from sklearn.model_selection import GroupKFold

data_df = load_dataset("allenai/reward-bench", split="raw").to_pandas()

rng = random.Random(43)

examples = [] for idx, row in data_df.iterrows(): if rng.random() > 0.5: response_a = row["chosen"] response_b = row["rejected"] label = 0… See the full description on the dataset page: https://huggingface.co/datasets/rbiswasfc/reward-bench-binary-classification.
h
rewardbench_eval_0328290924
huggingface.co
Updated Sep 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Lambert (2024). rewardbench_eval_0328290924 [Dataset]. https://huggingface.co/datasets/natolambert/rewardbench_eval_0328290924
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2024
Authors
Nathan Lambert
Description
rewardbench_eval_0328290924: RewardBench CLI Eval. Outputs

See https://github.com/allenai/rewardbench for more details

Configs

args: {'batch_size': 128, 'chat_template': 'oasst_pythia', 'dataset': 'HuggingFaceH4/no_robots', 'debug': False, 'force_truncation': False, 'hf_entity': 'natolambert', 'hf_name': 'rewardbench_eval_0328290924', 'load_json': False, 'max_length': 512, 'model': 'vwxyzjn/reward_modeling_EleutherAI_pythia-14m', 'not_quantized': False… See the full description on the dataset page: https://huggingface.co/datasets/natolambert/rewardbench_eval_0328290924.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ai2 (2025). reward-bench-2 [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-2

reward-bench-2

allenai/reward-bench-2

Explore at:

37 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jun 3, 2025

Dataset provided by

Allen Institute for AIhttp://allenai.org/

Authors

Ai2

License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

Code | Leaderboard | Results | Paper

  RewardBench 2 Evaluation Dataset Card

The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:

Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.

Clear search

Close search

Google apps

Main menu

reward-bench-2

RAG-RewardBench

VideoGen-RewardBench

AceMath-RewardBench

rewardbench

MM-RLHF

fc-reward-bench

multilingual-reward-bench

skyworks-rewardbench-contamination

reward-bench

rewardbench-binarized-preferences-nomedical

rewardbench-lighteval

glider-reward-bench-suite-chat

rewardbench_eval_1203280225

rewardbench_eval_1702280225

rewardbench_eval_2000300924

helpsteer2-rewardbench-contamination

code-python

reward-bench-binary-classification

rewardbench_eval_0328290924

reward-bench-2

allenai/reward-bench-2