Dataset Card for "winogrande"
Dataset Summary
WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.
Supported Tasks and Leaderboards
More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: Evaluation code for each benchmark dataset is under preparation and will be released soon to support standardized model assessment.
Dataset Card for Ko-WinoGrande
Dataset Summary
Ko-WinoGrande is a Korean adaptation of the WinoGrande dataset, which tests language models' commonsense reasoning through pronoun resolution tasks. Each item is a fill-in-the-blank sentence with two possible antecedents. Models must determine which choice best fits the blank given the… See the full description on the dataset page: https://huggingface.co/datasets/thunder-research-group/SNU_Ko-WinoGrande.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
"The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.
To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4 – 79.1%, which are ?15 - 35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% – 100% respectively).
Furthermore, we establish new state-of-the-art results on five related benchmarks — WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation."
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is a deduplicated subset of the XL train split of WinoGrande, as used in the paper How Much Can We Forget about Data Contamination?. The deduplication was performed using this script. The data fields are the same as in https://huggingface.co/datasets/allenai/winogrande, with the additional "split-id" column that can be used to partition the benchmark questions into different subsets. The dataset can be used as a plug-in replacement if you want to work with the deduplicated… See the full description on the dataset page: https://huggingface.co/datasets/sbordt/forgetting-contamination-winogrande.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs
Overview
The AraDiCE dataset is designed to evaluate dialectal and cultural capabilities in large language models (LLMs). The dataset consists of post-edited versions of various benchmark datasets, curated for validation in cultural and dialectal contexts relevant to Arabic. In this repository we show the winogrande split of the data.
Evaluation
We have used lm-harness eval framework to for… See the full description on the dataset page: https://huggingface.co/datasets/QCRI/AraDiCE-WinoGrande.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Rainbow is multi-task benchmark for common-sense reasoning that uses different existing QA datasets: aNLI, Cosmos QA, HellaSWAG. Physical IQa, Social IQa, WinoGrande.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fluid Language Model Benchmarking
This dataset provides IRT models for ARC Challenge,
GSM8K, HellaSwag, MMLU, TruthfulQA, and WinoGrande. Furthermore, it contains results for pretraining checkpoints of Amber-6.7B, K2-65B, OLMo1-7B, OLMo2-7B, Pythia-2.8B, and Pythia-6.9B, evaluated on these six benchmarks.
🚀 Usage
For utilities to use the dataset and to replicate the results from the paper, please see the corresponding GitHub… See the full description on the dataset page: https://huggingface.co/datasets/allenai/fluid-benchmarking.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Slovenian LLM Evaluation Dataset
This dataset is designed for evaluating Slovenian language models and builds upon the work of gordicaleksa/slovenian-llm-eval-v0 which translated some of the popular English benchmarks into Slovenian by using Google Translate. We have further improved the quality of the Slovenian translations. The dataset contains the following benchmarks:
ARC Challenge ARC Easy BoolQ GSM8K HellaSwag NQ Open OpenBookQA PIQA TriviaQA TruthfulQA Winogrande… See the full description on the dataset page: https://huggingface.co/datasets/cjvt/slovenian-llm-eval.
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Dataset Card for Llama-3.1-8B Evaluation Result Details
This dataset contains the Meta evaluation result details for Llama-3.1-8B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time and… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Dataset Card for "winogrande"
Dataset Summary
WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.
Supported Tasks and Leaderboards
More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.