53 datasets found
  1. reward-bench-2

    • huggingface.co
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). reward-bench-2 [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-2
    Explore at:
    Dataset updated
    Jun 3, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Code | Leaderboard | Results | Paper

      RewardBench 2 Evaluation Dataset Card
    

    The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:

    Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.

  2. h

    RAG-RewardBench

    • huggingface.co
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhuoran Jin (2024). RAG-RewardBench [Dataset]. https://huggingface.co/datasets/jinzhuoran/RAG-RewardBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2024
    Authors
    Zhuoran Jin
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This repository contains the data presented in RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment. Code: https://github.com/jinzhuoran/RAG-RewardBench/

  3. VideoGen-RewardBench

    • huggingface.co
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kuaishou Visual Generation and Interaction Center (2025). VideoGen-RewardBench [Dataset]. https://huggingface.co/datasets/KwaiVGI/VideoGen-RewardBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2025
    Dataset provided by
    Kuaishou Technologyhttps://www.kwai.com/
    Authors
    Kuaishou Visual Generation and Interaction Center
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🏆 [VideoGen-RewardBench Leaderboard]

      Introduction
    

    VideoGen-RewardBench is a comprehensive benchmark designed to evaluate the performance of video reward models on modern text-to-video (T2V) systems. Derived from the third-party VideoGen-Eval (Zeng et.al, 2024), we constructing 26.5k (prompt, Video A, Video B) triplets and employing expert annotators to provide pairwise preference labels. These annotations are based on key evaluation dimensions—Visual Quality (VQ), Motion… See the full description on the dataset page: https://huggingface.co/datasets/KwaiVGI/VideoGen-RewardBench.

  4. AceMath-RewardBench

    • huggingface.co
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NVIDIA (2025). AceMath-RewardBench [Dataset]. https://huggingface.co/datasets/nvidia/AceMath-RewardBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 17, 2025
    Dataset provided by
    Nvidiahttp://nvidia.com/
    Authors
    NVIDIA
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    website | paper

      AceMath-RewardBench Evaluation Dataset Card
    

    The AceMath-RewardBench evaluation dataset evaluates capabilities of a math reward model using the best-of-N (N=8) setting for 7 datasets:

    GSM8K: 1319 questions Math500: 500 questions Minerva Math: 272 questions Gaokao 2023 en: 385 questions OlympiadBench: 675 questions College Math: 2818 questions MMLU STEM: 3018 questions

    Each example in the dataset contains:

    A mathematical question 64 solution attempts with varying… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/AceMath-RewardBench.

  5. h

    rewardbench

    • huggingface.co
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Judy Hanwen Shen (2024). rewardbench [Dataset]. https://huggingface.co/datasets/heyyjudes/rewardbench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 29, 2024
    Authors
    Judy Hanwen Shen
    Description

    heyyjudes/rewardbench dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    MM-RLHF

    • huggingface.co
    Updated Feb 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Fan Zhang (2025). MM-RLHF [Dataset]. https://huggingface.co/datasets/yifanzhang114/MM-RLHF
    Explore at:
    Dataset updated
    Feb 17, 2025
    Authors
    Yi-Fan Zhang
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    [📖 arXiv Paper] [📊 Training Code] [📝 Homepage] [🏆 Reward Model] [🔮 MM-RewardBench] [🔮 MM-SafetyBench] [📈 Evaluation Suite]

      The Next Step Forward in Multimodal LLM Alignment
    

    [2025/02/10] 🔥 We are proud to open-source MM-RLHF, a comprehensive project for aligning Multimodal Large Language Models (MLLMs) with human preferences. This release includes:

    A high-quality MLLM alignment dataset. A strong Critique-Based MLLM reward model and its training algorithm. A novel… See the full description on the dataset page: https://huggingface.co/datasets/yifanzhang114/MM-RLHF.

  7. fc-reward-bench

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IBM Research, fc-reward-bench [Dataset]. https://huggingface.co/datasets/ibm-research/fc-reward-bench
    Explore at:
    Dataset provided by
    IBMhttp://ibm.com/
    IBM Research
    Authors
    IBM Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    fc-reward-bench (HF papers) (arxiv)

    fc-reward-bench is a benchmark designed to evaluate reward model performance in function-calling tasks. It features 1,500 unique user inputs derived from the single-turn splits of the BFCL-v3 dataset. Each input is paired with both correct and incorrect function calls. Correct calls are sourced directly from BFCL, while incorrect calls are generated by 25 permissively licensed models.

    Performance of ToolRM, top reward models from… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/fc-reward-bench.

  8. h

    multilingual-reward-bench

    • huggingface.co
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere Labs Community (2025). multilingual-reward-bench [Dataset]. http://doi.org/10.57967/hf/3352
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    Cohere Labs Community
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Multilingual Reward Bench (v1.0)

    Reward models (RMs) have driven the development of state-of-the-art LLMs today, with unprecedented impact across the globe. However, their performance in multilingual settings still remains understudied. In order to probe reward model behavior on multilingual data, we present M-RewardBench, a benchmark for 23 typologically diverse languages. M-RewardBench contains prompt-chosen-rejected preference triples obtained by curating and translating chat… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabsCommunity/multilingual-reward-bench.

  9. h

    skyworks-rewardbench-contamination

    • huggingface.co
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Lambert (2024). skyworks-rewardbench-contamination [Dataset]. https://huggingface.co/datasets/natolambert/skyworks-rewardbench-contamination
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 10, 2024
    Authors
    Nathan Lambert
    Description

    Reward Bench overlap with Skyworks Preferences 80k

    This dataset includes the overlap between the SkyWorks prompts, which are being used to train top reward models, with the original test set. More information found here.

  10. h

    reward-bench

    • huggingface.co
    Updated Sep 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    rubricreward (2025). reward-bench [Dataset]. https://huggingface.co/datasets/rubricreward/reward-bench
    Explore at:
    Dataset updated
    Sep 29, 2025
    Dataset authored and provided by
    rubricreward
    Description

    rubricreward/reward-bench dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    rewardbench-binarized-preferences-nomedical

    • huggingface.co
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sangeon Park (2024). rewardbench-binarized-preferences-nomedical [Dataset]. https://huggingface.co/datasets/saepark/rewardbench-binarized-preferences-nomedical
    Explore at:
    Dataset updated
    Mar 19, 2024
    Authors
    Sangeon Park
    Description

    saepark/rewardbench-binarized-preferences-nomedical dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    rewardbench-lighteval

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stochastic Parrots, rewardbench-lighteval [Dataset]. https://huggingface.co/datasets/stochastic-parrots/rewardbench-lighteval
    Explore at:
    Dataset authored and provided by
    Stochastic Parrots
    Description

    stochastic-parrots/rewardbench-lighteval dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. glider-reward-bench-suite-chat

    • huggingface.co
    Updated Dec 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patronus AI (2024). glider-reward-bench-suite-chat [Dataset]. https://huggingface.co/datasets/PatronusAI/glider-reward-bench-suite-chat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    Patronus AI, Inc.
    Authors
    Patronus AI
    Description

    PatronusAI/glider-reward-bench-suite-chat dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    rewardbench_eval_1203280225

    • huggingface.co
    Updated Apr 20, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu (2016). rewardbench_eval_1203280225 [Dataset]. https://huggingface.co/datasets/Yuhan123/rewardbench_eval_1203280225
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2016
    Authors
    Liu
    Description

    rewardbench_eval_1203280225: RewardBench CLI Eval. Outputs

    See https://github.com/allenai/reward-bench for more details Built with the rewardbench CLI tool. Accuracy: 0.475 Command used to run: python /scratch/yl13579/.conda/envs/rewardbench/bin/rewardbench --model=sfairXC/FsfairX-LLaMA3-RM-v0.1 --dataset=Yuhan123/eval_data_oasst1_1k_en_their --split=train --chat_template=raw --batch_size=16 --push_results_to_hub --upload_model_metadata_to_hf

      Configs
    

    args:… See the full description on the dataset page: https://huggingface.co/datasets/Yuhan123/rewardbench_eval_1203280225.

  15. h

    rewardbench_eval_1702280225

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu, rewardbench_eval_1702280225 [Dataset]. https://huggingface.co/datasets/Yuhan123/rewardbench_eval_1702280225
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Liu
    Description

    rewardbench_eval_1702280225: RewardBench CLI Eval. Outputs

    See https://github.com/allenai/reward-bench for more details Built with the rewardbench CLI tool. Accuracy: 0.597 Command used to run: python /scratch/yl13579/.conda/envs/rewardbench/bin/rewardbench --model=sfairXC/FsfairX-LLaMA3-RM-v0.1 --dataset=Yuhan123/mistral-7b-our-lowest-update-gen_vs_mistral-7b --split=train --chat_template=raw --push_results_to_hub --upload_model_metadata_to_hf

      Configs
    

    args:… See the full description on the dataset page: https://huggingface.co/datasets/Yuhan123/rewardbench_eval_1702280225.

  16. h

    rewardbench_eval_2000300924

    • huggingface.co
    Updated Oct 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Lambert (2024). rewardbench_eval_2000300924 [Dataset]. https://huggingface.co/datasets/natolambert/rewardbench_eval_2000300924
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2024
    Authors
    Nathan Lambert
    Description

    rewardbench_eval_2000300924: RewardBench CLI Eval. Outputs

    See https://github.com/allenai/reward-bench for more details Built with the rewardbench CLI tool. Command used to run: python /home/nathanl/.local/bin/rewardbench --model vwxyzjn/reward_modeling_EleutherAI_pythia-14m --dataset HuggingFaceH4/no_robots --split test --batch_size 128 --tokenizer=EleutherAI/pythia-14m --push_results_to_hub --chat_template oasst_pythia

      Configs
    

    args: {'batch_size': 128… See the full description on the dataset page: https://huggingface.co/datasets/natolambert/rewardbench_eval_2000300924.

  17. h

    helpsteer2-rewardbench-contamination

    • huggingface.co
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saumya Malik (2021). helpsteer2-rewardbench-contamination [Dataset]. https://huggingface.co/datasets/saumyamalik/helpsteer2-rewardbench-contamination
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 30, 2021
    Authors
    Saumya Malik
    Description

    saumyamalik/helpsteer2-rewardbench-contamination dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    code-python

    • huggingface.co
    Updated Oct 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    multilingual-reward-bench (2024). code-python [Dataset]. https://huggingface.co/datasets/multilingual-reward-bench/code-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2024
    Dataset authored and provided by
    multilingual-reward-bench
    Description

    multilingual-reward-bench/code-python dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    reward-bench-binary-classification

    • huggingface.co
    Updated Dec 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2021). reward-bench-binary-classification [Dataset]. https://huggingface.co/datasets/rbiswasfc/reward-bench-binary-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2021
    Authors
    Raja Biswas
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Repurposed allenai/reward-bench dataset for binary classification task, using the following script: import random

    import pandas as pd from datasets import Dataset, load_dataset from sklearn.model_selection import GroupKFold

    data_df = load_dataset("allenai/reward-bench", split="raw").to_pandas()

    rng = random.Random(43)

    examples = [] for idx, row in data_df.iterrows(): if rng.random() > 0.5: response_a = row["chosen"] response_b = row["rejected"] label = 0… See the full description on the dataset page: https://huggingface.co/datasets/rbiswasfc/reward-bench-binary-classification.

  20. h

    rewardbench_eval_0328290924

    • huggingface.co
    Updated Sep 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Lambert (2024). rewardbench_eval_0328290924 [Dataset]. https://huggingface.co/datasets/natolambert/rewardbench_eval_0328290924
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2024
    Authors
    Nathan Lambert
    Description

    rewardbench_eval_0328290924: RewardBench CLI Eval. Outputs

    See https://github.com/allenai/rewardbench for more details

      Configs
    

    args: {'batch_size': 128, 'chat_template': 'oasst_pythia', 'dataset': 'HuggingFaceH4/no_robots', 'debug': False, 'force_truncation': False, 'hf_entity': 'natolambert', 'hf_name': 'rewardbench_eval_0328290924', 'load_json': False, 'max_length': 512, 'model': 'vwxyzjn/reward_modeling_EleutherAI_pythia-14m', 'not_quantized': False… See the full description on the dataset page: https://huggingface.co/datasets/natolambert/rewardbench_eval_0328290924.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ai2 (2025). reward-bench-2 [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-2
Organization logo

reward-bench-2

allenai/reward-bench-2

Explore at:
37 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 3, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

Code | Leaderboard | Results | Paper

  RewardBench 2 Evaluation Dataset Card

The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:

Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.

Search
Clear search
Close search
Google apps
Main menu