9 datasets found
  1. winogrande

    • huggingface.co
    Updated Oct 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2022). winogrande [Dataset]. https://huggingface.co/datasets/allenai/winogrande
    Explore at:
    Dataset updated
    Oct 28, 2022
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    Dataset Card for "winogrande"

      Dataset Summary
    

    WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.

      Supported Tasks and Leaderboards
    

    More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.

  2. h

    SNU_Ko-WinoGrande

    • huggingface.co
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    THUNDER Research Group (2025). SNU_Ko-WinoGrande [Dataset]. https://huggingface.co/datasets/thunder-research-group/SNU_Ko-WinoGrande
    Explore at:
    Dataset updated
    Aug 20, 2025
    Dataset authored and provided by
    THUNDER Research Group
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Note: Evaluation code for each benchmark dataset is under preparation and will be released soon to support standardized model assessment.

      Dataset Card for Ko-WinoGrande
    
    
    
    
    
      Dataset Summary
    

    Ko-WinoGrande is a Korean adaptation of the WinoGrande dataset, which tests language models' commonsense reasoning through pronoun resolution tasks. Each item is a fill-in-the-blank sentence with two possible antecedents. Models must determine which choice best fits the blank given the… See the full description on the dataset page: https://huggingface.co/datasets/thunder-research-group/SNU_Ko-WinoGrande.

  3. WinoGrande

    • opendatalab.com
    • tensorflow.org
    • +1more
    zip
    Updated Sep 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Institute for Artificial Intelligence (2023). WinoGrande [Dataset]. https://opendatalab.com/OpenDataLab/WinoGrande
    Explore at:
    zip(14491841 bytes)Available download formats
    Dataset updated
    Sep 30, 2023
    Dataset provided by
    艾伦人工智能研究院http://allenai.org/
    University of Washington
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    "The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.

    To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4 – 79.1%, which are ?15 - 35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% – 100% respectively).

    Furthermore, we establish new state-of-the-art results on five related benchmarks — WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation."

  4. h

    forgetting-contamination-winogrande

    • huggingface.co
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Bordt (2025). forgetting-contamination-winogrande [Dataset]. https://huggingface.co/datasets/sbordt/forgetting-contamination-winogrande
    Explore at:
    Dataset updated
    Sep 5, 2025
    Authors
    Sebastian Bordt
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is a deduplicated subset of the XL train split of WinoGrande, as used in the paper How Much Can We Forget about Data Contamination?. The deduplication was performed using this script. The data fields are the same as in https://huggingface.co/datasets/allenai/winogrande, with the additional "split-id" column that can be used to partition the benchmark questions into different subsets. The dataset can be used as a plug-in replacement if you want to work with the deduplicated… See the full description on the dataset page: https://huggingface.co/datasets/sbordt/forgetting-contamination-winogrande.

  5. h

    AraDiCE-WinoGrande

    • huggingface.co
    Updated May 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qatar Computing Research Institute (2025). AraDiCE-WinoGrande [Dataset]. https://huggingface.co/datasets/QCRI/AraDiCE-WinoGrande
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2025
    Dataset authored and provided by
    Qatar Computing Research Institute
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

      Overview
    

    The AraDiCE dataset is designed to evaluate dialectal and cultural capabilities in large language models (LLMs). The dataset consists of post-edited versions of various benchmark datasets, curated for validation in cultural and dialectal contexts relevant to Arabic. In this repository we show the winogrande split of the data.

      Evaluation
    

    We have used lm-harness eval framework to for… See the full description on the dataset page: https://huggingface.co/datasets/QCRI/AraDiCE-WinoGrande.

  6. Rainbow

    • opendatalab.com
    zip
    Updated Mar 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Institute for Artificial Intelligence (2020). Rainbow [Dataset]. https://opendatalab.com/OpenDataLab/Rainbow
    Explore at:
    zip(162849661 bytes)Available download formats
    Dataset updated
    Mar 12, 2020
    Dataset provided by
    艾伦人工智能研究院http://allenai.org/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Rainbow is multi-task benchmark for common-sense reasoning that uses different existing QA datasets: aNLI, Cosmos QA, HellaSWAG. Physical IQa, Social IQa, WinoGrande.

  7. fluid-benchmarking

    • huggingface.co
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). fluid-benchmarking [Dataset]. https://huggingface.co/datasets/allenai/fluid-benchmarking
    Explore at:
    Dataset updated
    Sep 15, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fluid Language Model Benchmarking

    This dataset provides IRT models for ARC Challenge, 
    

    GSM8K, HellaSwag, MMLU, TruthfulQA, and WinoGrande. Furthermore, it contains results for pretraining checkpoints of Amber-6.7B, K2-65B, OLMo1-7B, OLMo2-7B, Pythia-2.8B, and Pythia-6.9B, evaluated on these six benchmarks.

      🚀 Usage
    

    For utilities to use the dataset and to replicate the results from the paper, please see the corresponding GitHub… See the full description on the dataset page: https://huggingface.co/datasets/allenai/fluid-benchmarking.

  8. h

    slovenian-llm-eval

    • huggingface.co
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Center za jezikovne vire in tehnologije Univerze v Ljubljani (2024). slovenian-llm-eval [Dataset]. https://huggingface.co/datasets/cjvt/slovenian-llm-eval
    Explore at:
    Dataset updated
    Dec 18, 2024
    Dataset authored and provided by
    Center za jezikovne vire in tehnologije Univerze v Ljubljani
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Slovenian LLM Evaluation Dataset

    This dataset is designed for evaluating Slovenian language models and builds upon the work of gordicaleksa/slovenian-llm-eval-v0 which translated some of the popular English benchmarks into Slovenian by using Google Translate. We have further improved the quality of the Slovenian translations. The dataset contains the following benchmarks:

    ARC Challenge ARC Easy BoolQ GSM8K HellaSwag NQ Open OpenBookQA PIQA TriviaQA TruthfulQA Winogrande… See the full description on the dataset page: https://huggingface.co/datasets/cjvt/slovenian-llm-eval.

  9. Llama-3.1-8B-evals

    • huggingface.co
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meta Llama (2024). Llama-3.1-8B-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Metahttp://meta.com/
    Authors
    Meta Llama
    License

    https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/

    Description

    Dataset Card for Llama-3.1-8B Evaluation Result Details

    This dataset contains the Meta evaluation result details for Llama-3.1-8B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time and… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ai2 (2022). winogrande [Dataset]. https://huggingface.co/datasets/allenai/winogrande
Organization logo

winogrande

WinoGrande

allenai/winogrande

Explore at:
Dataset updated
Oct 28, 2022
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
Description

Dataset Card for "winogrande"

  Dataset Summary

WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.

  Supported Tasks and Leaderboards

More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.

Search
Clear search
Close search
Google apps
Main menu