9 datasets found

winogrande
huggingface.co
Updated Oct 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2022). winogrande [Dataset]. https://huggingface.co/datasets/allenai/winogrande
Explore at:
Dataset updated
Oct 28, 2022
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
Description
Dataset Card for "winogrande"

Dataset Summary

WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.

Supported Tasks and Leaderboards

More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.
h
SNU_Ko-WinoGrande
huggingface.co
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
THUNDER Research Group (2025). SNU_Ko-WinoGrande [Dataset]. https://huggingface.co/datasets/thunder-research-group/SNU_Ko-WinoGrande
Explore at:
Dataset updated
Aug 20, 2025
Dataset authored and provided by
THUNDER Research Group
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Note: Evaluation code for each benchmark dataset is under preparation and will be released soon to support standardized model assessment.

Dataset Card for Ko-WinoGrande Dataset Summary

Ko-WinoGrande is a Korean adaptation of the WinoGrande dataset, which tests language models' commonsense reasoning through pronoun resolution tasks. Each item is a fill-in-the-blank sentence with two possible antecedents. Models must determine which choice best fits the blank given the… See the full description on the dataset page: https://huggingface.co/datasets/thunder-research-group/SNU_Ko-WinoGrande.
WinoGrande
opendatalab.com
tensorflow.org
+1more
zip
Updated Sep 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allen Institute for Artificial Intelligence (2023). WinoGrande [Dataset]. https://opendatalab.com/OpenDataLab/WinoGrande
Explore at:
zip(14491841 bytes)Available download formats
Dataset updated
Sep 30, 2023
Dataset provided by
艾伦人工智能研究院http://allenai.org/
University of Washington
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
"The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.

To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4 – 79.1%, which are ?15 - 35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% – 100% respectively).

Furthermore, we establish new state-of-the-art results on five related benchmarks — WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation."
h
forgetting-contamination-winogrande
huggingface.co
Updated Sep 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Bordt (2025). forgetting-contamination-winogrande [Dataset]. https://huggingface.co/datasets/sbordt/forgetting-contamination-winogrande
Explore at:
Dataset updated
Sep 5, 2025
Authors
Sebastian Bordt
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is a deduplicated subset of the XL train split of WinoGrande, as used in the paper How Much Can We Forget about Data Contamination?. The deduplication was performed using this script. The data fields are the same as in https://huggingface.co/datasets/allenai/winogrande, with the additional "split-id" column that can be used to partition the benchmark questions into different subsets. The dataset can be used as a plug-in replacement if you want to work with the deduplicated… See the full description on the dataset page: https://huggingface.co/datasets/sbordt/forgetting-contamination-winogrande.
h
AraDiCE-WinoGrande
huggingface.co
Updated May 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qatar Computing Research Institute (2025). AraDiCE-WinoGrande [Dataset]. https://huggingface.co/datasets/QCRI/AraDiCE-WinoGrande
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2025
Dataset authored and provided by
Qatar Computing Research Institute
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

Overview

The AraDiCE dataset is designed to evaluate dialectal and cultural capabilities in large language models (LLMs). The dataset consists of post-edited versions of various benchmark datasets, curated for validation in cultural and dialectal contexts relevant to Arabic. In this repository we show the winogrande split of the data.

Evaluation

We have used lm-harness eval framework to for… See the full description on the dataset page: https://huggingface.co/datasets/QCRI/AraDiCE-WinoGrande.
Rainbow
opendatalab.com
zip
Updated Mar 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allen Institute for Artificial Intelligence (2020). Rainbow [Dataset]. https://opendatalab.com/OpenDataLab/Rainbow
Explore at:
zip(162849661 bytes)Available download formats
Dataset updated
Mar 12, 2020
Dataset provided by
艾伦人工智能研究院http://allenai.org/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Rainbow is multi-task benchmark for common-sense reasoning that uses different existing QA datasets: aNLI, Cosmos QA, HellaSWAG. Physical IQa, Social IQa, WinoGrande.
fluid-benchmarking
huggingface.co
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). fluid-benchmarking [Dataset]. https://huggingface.co/datasets/allenai/fluid-benchmarking
Explore at:
Dataset updated
Sep 15, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fluid Language Model Benchmarking

This dataset provides IRT models for ARC Challenge,

GSM8K, HellaSwag, MMLU, TruthfulQA, and WinoGrande. Furthermore, it contains results for pretraining checkpoints of Amber-6.7B, K2-65B, OLMo1-7B, OLMo2-7B, Pythia-2.8B, and Pythia-6.9B, evaluated on these six benchmarks.

🚀 Usage

For utilities to use the dataset and to replicate the results from the paper, please see the corresponding GitHub… See the full description on the dataset page: https://huggingface.co/datasets/allenai/fluid-benchmarking.
h
slovenian-llm-eval
huggingface.co
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center za jezikovne vire in tehnologije Univerze v Ljubljani (2024). slovenian-llm-eval [Dataset]. https://huggingface.co/datasets/cjvt/slovenian-llm-eval
Explore at:
Dataset updated
Dec 18, 2024
Dataset authored and provided by
Center za jezikovne vire in tehnologije Univerze v Ljubljani
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Slovenian LLM Evaluation Dataset

This dataset is designed for evaluating Slovenian language models and builds upon the work of gordicaleksa/slovenian-llm-eval-v0 which translated some of the popular English benchmarks into Slovenian by using Google Translate. We have further improved the quality of the Slovenian translations. The dataset contains the following benchmarks:

ARC Challenge ARC Easy BoolQ GSM8K HellaSwag NQ Open OpenBookQA PIQA TriviaQA TruthfulQA Winogrande… See the full description on the dataset page: https://huggingface.co/datasets/cjvt/slovenian-llm-eval.
Llama-3.1-8B-evals
huggingface.co
Updated Jul 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meta Llama (2024). Llama-3.1-8B-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals
Explore at:
Dataset updated
Jul 23, 2024
Dataset provided by
Metahttp://meta.com/
Authors
Meta Llama
License
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Description
Dataset Card for Llama-3.1-8B Evaluation Result Details

This dataset contains the Meta evaluation result details for Llama-3.1-8B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time and… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ai2 (2022). winogrande [Dataset]. https://huggingface.co/datasets/allenai/winogrande

winogrande

WinoGrande

allenai/winogrande

Explore at:

Dataset updated

Oct 28, 2022

Dataset provided by

Allen Institute for AIhttp://allenai.org/

Authors

Ai2

Description

Dataset Card for "winogrande"

  Dataset Summary

WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.

  Supported Tasks and Leaderboards

More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.

Clear search

Close search

Google apps

Main menu

winogrande

SNU_Ko-WinoGrande

WinoGrande

forgetting-contamination-winogrande

AraDiCE-WinoGrande

Rainbow

fluid-benchmarking

slovenian-llm-eval

Llama-3.1-8B-evals

winogrande

WinoGrande

allenai/winogrande