16 datasets found
  1. reward-bench-2

    • huggingface.co
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). reward-bench-2 [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-2
    Explore at:
    Dataset updated
    Jun 3, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Code | Leaderboard | Results | Paper

      RewardBench 2 Evaluation Dataset Card
    

    The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:

    Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.

  2. h

    reward-bench-2

    • huggingface.co
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Anugraha (2025). reward-bench-2 [Dataset]. https://huggingface.co/datasets/davidanugraha/reward-bench-2
    Explore at:
    Dataset updated
    Jul 11, 2025
    Authors
    David Anugraha
    Description

    davidanugraha/reward-bench-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    reward-bench-2-converted

    • huggingface.co
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    john02171574 (2025). reward-bench-2-converted [Dataset]. https://huggingface.co/datasets/john02171574/reward-bench-2-converted
    Explore at:
    Dataset updated
    Jun 19, 2025
    Authors
    john02171574
    Description

    john02171574/reward-bench-2-converted dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. reward-bench-results

    • huggingface.co
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). reward-bench-results [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-results
    Explore at:
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    Results for Holisitic Evaluation of Reward Models (HERM) Benchmark

    Here, you'll find the raw scores for the HERM project. The repository is structured as follows. ├── best-of-n/ <- Nested directory for different completions on Best of N challenge | ├── alpaca_eval/ └── results for each reward model | | ├── tulu-13b/{org}/{model}.json
    | | └── zephyr-7b/{org}/{model}.json | └── mt_bench/
    |… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-results.

  5. reward-bench-2-results

    • huggingface.co
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). reward-bench-2-results [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-2-results
    Explore at:
    Dataset updated
    Jun 3, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    allenai/reward-bench-2-results dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    multilingual-reward-bench

    • huggingface.co
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere Labs Community (2025). multilingual-reward-bench [Dataset]. http://doi.org/10.57967/hf/3352
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    Cohere Labs Community
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Multilingual Reward Bench (v1.0)

    Reward models (RMs) have driven the development of state-of-the-art LLMs today, with unprecedented impact across the globe. However, their performance in multilingual settings still remains understudied. In order to probe reward model behavior on multilingual data, we present M-RewardBench, a benchmark for 23 typologically diverse languages. M-RewardBench contains prompt-chosen-rejected preference triples obtained by curating and translating chat… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabsCommunity/multilingual-reward-bench.

  7. fc-reward-bench

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IBM Research, fc-reward-bench [Dataset]. https://huggingface.co/datasets/ibm-research/fc-reward-bench
    Explore at:
    Dataset provided by
    IBMhttp://ibm.com/
    IBM Research
    Authors
    IBM Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    fc-reward-bench

    fc-reward-bench is a benchmark designed to evaluate reward model performance in function-calling tasks. It features 1,500 unique user inputs derived from the single-turn splits of the BFCL-v3 dataset. Each input is paired with both correct and incorrect function calls. Correct calls are sourced directly from BFCL, while incorrect calls are generated by 25 permissively licensed models.

      Dataset Structure
    

    Each entry in the dataset includes the following… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/fc-reward-bench.

  8. h

    agent-reward-bench

    • huggingface.co
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McGill NLP Group (2025). agent-reward-bench [Dataset]. https://huggingface.co/datasets/McGill-NLP/agent-reward-bench
    Explore at:
    Dataset updated
    Apr 15, 2025
    Dataset authored and provided by
    McGill NLP Group
    Description

    AgentRewardBench

    💾Code 📄Paper 🌐Website

    🤗Dataset 💻Demo 🏆Leaderboard

    AgentRewardBench: Evaluating Automatic Evaluations of Web Agent TrajectoriesXing Han Lù, Amirhossein Kazemnejad*, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J. Pal, Siva Reddy*Core Contributor

      Loading dataset
    

    You can use the huggingface_hub library to load the dataset. The dataset is available on Huggingface Hub at… See the full description on the dataset page: https://huggingface.co/datasets/McGill-NLP/agent-reward-bench.

  9. h

    reward-bench-Llama-2-13b-hf-yes-no

    • huggingface.co
    Updated Jan 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Singh (2025). reward-bench-Llama-2-13b-hf-yes-no [Dataset]. https://huggingface.co/datasets/Ayush-Singh/reward-bench-Llama-2-13b-hf-yes-no
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2025
    Authors
    Ayush Singh
    Description

    Ayush-Singh/reward-bench-Llama-2-13b-hf-yes-no dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    reward-bench-gemma-2-2b-it-set3-scores

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Singh, reward-bench-gemma-2-2b-it-set3-scores [Dataset]. https://huggingface.co/datasets/Ayush-Singh/reward-bench-gemma-2-2b-it-set3-scores
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ayush Singh
    Description

    Ayush-Singh/reward-bench-gemma-2-2b-it-set3-scores dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    RM-Bench-chat-Skywork-Reward-Llama-3.1-8B-v0.2-normal

    • huggingface.co
    Updated Jan 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Singh (2025). RM-Bench-chat-Skywork-Reward-Llama-3.1-8B-v0.2-normal [Dataset]. https://huggingface.co/datasets/Ayush-Singh/RM-Bench-chat-Skywork-Reward-Llama-3.1-8B-v0.2-normal
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2025
    Authors
    Ayush Singh
    Description

    Ayush-Singh/RM-Bench-chat-Skywork-Reward-Llama-3.1-8B-v0.2-normal dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    reward-general-bench

    • huggingface.co
    Updated May 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yannik Krone (2024). reward-general-bench [Dataset]. https://huggingface.co/datasets/NaykinYT/reward-general-bench
    Explore at:
    Dataset updated
    May 15, 2024
    Authors
    Yannik Krone
    Description

    NaykinYT/reward-general-bench dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    MM-RLHF-RewardBench

    • huggingface.co
    Updated Feb 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Fan Zhang (2025). MM-RLHF-RewardBench [Dataset]. https://huggingface.co/datasets/yifanzhang114/MM-RLHF-RewardBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 17, 2025
    Authors
    Yi-Fan Zhang
    Description

    [📖 arXiv Paper] [📊 MM-RLHF Data] [📝 Homepage] [🏆 Reward Model] [🔮 MM-RewardBench] [🔮 MM-SafetyBench] [📈 Evaluation Suite]

      The Next Step Forward in Multimodal LLM Alignment
    

    [2025/02/10] 🔥 We are proud to open-source MM-RLHF, a comprehensive project for aligning Multimodal Large Language Models (MLLMs) with human preferences. This release includes:

    A high-quality MLLM alignment dataset. A strong Critique-Based MLLM reward model and its training algorithm. A novel… See the full description on the dataset page: https://huggingface.co/datasets/yifanzhang114/MM-RLHF-RewardBench.

  14. h

    R1-Reward-RL

    • huggingface.co
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Fan Zhang (2025). R1-Reward-RL [Dataset]. https://huggingface.co/datasets/yifanzhang114/R1-Reward-RL
    Explore at:
    Dataset updated
    May 6, 2025
    Authors
    Yi-Fan Zhang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    [📖 arXiv Paper] [📊 R1-Reward Code] [📝 R1-Reward Model]

      Training Multimodal Reward Model Through Stable Reinforcement Learning
    

    🔥 We are proud to open-source R1-Reward, a comprehensive project for improve reward modeling through reinforcement learning. This release includes:

    R1-Reward Model: A state-of-the-art (SOTA) multimodal reward model demonstrating substantial gains (Voting@15): 13.5% improvement on VL Reward-Bench.3.5% improvement on MM-RLHF Reward-Bench.… See the full description on the dataset page: https://huggingface.co/datasets/yifanzhang114/R1-Reward-RL.

  15. h

    RM-Bench-safety-response-Skywork-Reward-Llama-3.1-8B-v0.2-normal

    • huggingface.co
    Updated Jan 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Singh (2025). RM-Bench-safety-response-Skywork-Reward-Llama-3.1-8B-v0.2-normal [Dataset]. https://huggingface.co/datasets/Ayush-Singh/RM-Bench-safety-response-Skywork-Reward-Llama-3.1-8B-v0.2-normal
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2025
    Authors
    Ayush Singh
    Description

    Ayush-Singh/RM-Bench-safety-response-Skywork-Reward-Llama-3.1-8B-v0.2-normal dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. Libra-Bench

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    meituan (2025). Libra-Bench [Dataset]. https://huggingface.co/datasets/meituan/Libra-Bench
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    Meituanhttps://www.aia.com/
    Authors
    meituan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Libra Bench

      Overview
    

    Libra Bench is a sophisticated, reasoning-oriented reward model (RM) benchmark, systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models. The Libra Bench is specifically designed to evaluate pointwise judging accuracy with respect to correctness. These attributes ensure that Libra Bench is well aligned with contemporary research, where reasoning models are primarily assessed and optimized… See the full description on the dataset page: https://huggingface.co/datasets/meituan/Libra-Bench.

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ai2 (2025). reward-bench-2 [Dataset]. https://huggingface.co/datasets/allenai/reward-bench-2
Organization logo

reward-bench-2

allenai/reward-bench-2

Explore at:
33 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 3, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

Code | Leaderboard | Results | Paper

  RewardBench 2 Evaluation Dataset Card

The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:

Factuality (NEW!): Tests the ability of RMs to detect hallucinations and other basic errors in completions. Precise Instruction Following (NEW!): Tests the ability of RMs… See the full description on the dataset page: https://huggingface.co/datasets/allenai/reward-bench-2.

Search
Clear search
Close search
Google apps
Main menu