HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Summary
Existing LLM-based tools and coding agents respond to every issue and generate a patch for every case, even when the input is vague or their own output is incorrect. There are no mechanisms in place to abstain when confidence is low. BouncerBench checks if AI agents know when not to act.
This is one of 3 datasets released as part of the paper Is Your Automated Software Engineer Trustworthy?.
input_bouncerTasks on bug‐report text. The model decides if a report is… See the full description on the dataset page: https://huggingface.co/datasets/uw-swag/bouncerbench-lite.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for Situations With Adversarial Generations
Dataset Summary
Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate what might come next ("then, she examined the engine"). SWAG (Situations With Adversarial Generations) is a large-scale dataset for this task of grounded commonsense inference, unifying natural language inference and physically grounded reasoning. The dataset consists of 113k… See the full description on the dataset page: https://huggingface.co/datasets/allenai/swag.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy).