100+ datasets found
  1. h

    evaluation-results

    • huggingface.co
    Updated Aug 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Workshop (2022). evaluation-results [Dataset]. https://huggingface.co/datasets/bigscience/evaluation-results
    Explore at:
    Dataset updated
    Aug 31, 2022
    Dataset authored and provided by
    BigScience Workshop
    Description

    @misc{muennighoff2022crosslingual, title={Crosslingual Generalization through Multitask Finetuning}, author={Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel}, year={2022}, eprint={2211.01786}, archivePrefix={arXiv}, primaryClass={cs.CL} }

  2. h

    Evaluation

    • huggingface.co
    Updated Sep 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lunlin Yang (2024). Evaluation [Dataset]. https://huggingface.co/datasets/ygl1020/Evaluation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2024
    Authors
    Lunlin Yang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ygl1020/Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. evaluate-dependents

    • huggingface.co
    Updated Jul 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face OSS Metrics (2023). evaluate-dependents [Dataset]. https://huggingface.co/datasets/open-source-metrics/evaluate-dependents
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 19, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face OSS Metrics
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    evaluate metrics

    This dataset contains metrics about the huggingface/evaluate package. Number of repositories in the dataset: 106 Number of packages in the dataset: 3

      Package dependents
    

    This contains the data available in the used-by tab on GitHub.

      Package & Repository star count
    

    This section shows the package and repository star count, individually.

    Package Repository

    There are 1 packages that have more than 1000 stars. There are 2 repositories… See the full description on the dataset page: https://huggingface.co/datasets/open-source-metrics/evaluate-dependents.

  4. h

    evaluation

    • huggingface.co
    Updated May 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2023). evaluation [Dataset]. https://huggingface.co/datasets/bigcode/evaluation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 4, 2023
    Dataset authored and provided by
    BigCode
    Description

    bigcode/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    media

    • huggingface.co
    Updated Jun 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    evaluate (2022). media [Dataset]. https://huggingface.co/datasets/evaluate/media
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 2022
    Dataset authored and provided by
    evaluate
    Description

    evaluate/media dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    Data from: MERA

    • huggingface.co
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MERA (2025). MERA [Dataset]. https://huggingface.co/datasets/MERA-evaluation/MERA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2025
    Authors
    MERA
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MERA (Multimodal Evaluation for Russian-language Architectures)

      Summary
    

    MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.

  7. h

    seven-wonders-eval

    • huggingface.co
    Updated Sep 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZanSara (2024). seven-wonders-eval [Dataset]. https://huggingface.co/datasets/ZanSara/seven-wonders-eval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2024
    Authors
    ZanSara
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Seven Wonders (Evaluation Dataset)

    Eval dataset generated with RAGAS for https://huggingface.co/datasets/ZanSara/seven-wonders Original data from https://huggingface.co/datasets/bilgeyucel/seven-wonders

  8. h

    glue

    • huggingface.co
    • tensorflow.google.cn
    • +1more
    Updated Mar 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYU Machine Learning for Language (2024). glue [Dataset]. https://huggingface.co/datasets/nyu-mll/glue
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 6, 2024
    Dataset authored and provided by
    NYU Machine Learning for Language
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for GLUE

      Dataset Summary
    

    GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

      Supported Tasks and Leaderboards
    

    The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks:

      ax
    

    A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.

  9. fleurs

    • huggingface.co
    • opendatalab.com
    Updated Jun 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2022). fleurs [Dataset]. https://huggingface.co/datasets/google/fleurs
    Explore at:
    Dataset updated
    Jun 4, 2022
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FLEURS

    Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/google/fleurs.

  10. Llama-3.1-8B-evals

    • huggingface.co
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meta Llama (2024). Llama-3.1-8B-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Metahttp://meta.com/
    Authors
    Meta Llama
    License

    https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/

    Description

    Dataset Card for Llama-3.1-8B Evaluation Result Details

    This dataset contains the Meta evaluation result details for Llama-3.1-8B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time and… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals.

  11. h

    eka-medical-asr-evaluation-dataset

    • huggingface.co
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eka Care (2025). eka-medical-asr-evaluation-dataset [Dataset]. https://huggingface.co/datasets/ekacare/eka-medical-asr-evaluation-dataset
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset authored and provided by
    Eka Care
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Eka Medical ASR Evaluation Dataset

      Dataset Overview and Sourcing
    

    The Eka Medical ASR Evaluation Dataset enables comprehensive evaluation of automatic speech recognition systems designed to transcribe medical speech into accurate text—a fundamental component of any medical scribe system. This dataset captures the unique challenges of processing medical terminology, particularly branded drugs, which is specific to the Indian context. The dataset comprises over 3,900+… See the full description on the dataset page: https://huggingface.co/datasets/ekacare/eka-medical-asr-evaluation-dataset.

  12. h

    wmt-mqm-human-evaluation

    • huggingface.co
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricardo Rei (2024). wmt-mqm-human-evaluation [Dataset]. https://huggingface.co/datasets/RicardoRei/wmt-mqm-human-evaluation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2024
    Authors
    Ricardo Rei
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary

    This dataset contains all MQM human annotations from previous WMT Metrics shared tasks and the MQM annotations from Experts, Errors, and Context. The data is organised into 8 columns:

    lp: language pair src: input text mt: translation ref: reference translation score: MQM score system: MT Engine that produced the translation annotators: number of annotators domain: domain of the input text (e.g. news) year: collection year

    You can also find the original data here.… See the full description on the dataset page: https://huggingface.co/datasets/RicardoRei/wmt-mqm-human-evaluation.

  13. h

    InfiBench

    • huggingface.co
    Updated Jun 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linyi Li (2024). InfiBench [Dataset]. http://doi.org/10.57967/hf/2474
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 12, 2024
    Authors
    Linyi Li
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    InfiBench (Data Part)

    Note: For full description, please visit our main website https://infi-coder.github.io/infibench. This repo contains all data of our code LLM evaluation dataset InfiBench. suite_v2.1.yaml lists the case list and suite_v2.1_data.csv records all data (prompt, reference answer, evaluation metric). The data can be directly consumed by our automatic evaluation tool to evaluate any model's response.

      Dataset Card
    

    Name: InfiBench Description: Evaluation… See the full description on the dataset page: https://huggingface.co/datasets/llylly001/InfiBench.

  14. h

    Medical_Multimodal_Evaluation_Data

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FreedomAI (2025). Medical_Multimodal_Evaluation_Data [Dataset]. https://huggingface.co/datasets/FreedomIntelligence/Medical_Multimodal_Evaluation_Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2025
    Dataset authored and provided by
    FreedomAI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Evaluation Guide

    This dataset is used to evaluate medical multimodal LLMs, as used in HuatuoGPT-Vision. It includes benchmarks such as VQA-RAD, SLAKE, PathVQA, PMC-VQA, OmniMedVQA, and MMMU-Medical-Tracks.
    To get started:

    Download the dataset and extract the images.zip file.
    Find evaluation code on our GitHub: HuatuoGPT-Vision.

    This open-source release aims to simplify the evaluation of medical multimodal capabilities in large models. Please cite the relevant benchmark… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/Medical_Multimodal_Evaluation_Data.

  15. model-written-evals

    • huggingface.co
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthropic (2023). model-written-evals [Dataset]. https://huggingface.co/datasets/Anthropic/model-written-evals
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2023
    Dataset authored and provided by
    Anthropichttps://anthropic.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Model-Written Evaluation Datasets

    This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to:

    Those who are interested in understanding the quality and properties of model-generated data Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/model-written-evals.

  16. Llama-3.2-3B-evals

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meta Llama, Llama-3.2-3B-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-evals
    Explore at:
    Dataset provided by
    Metahttp://meta.com/
    Authors
    Meta Llama
    License

    https://choosealicense.com/licenses/llama3.2/https://choosealicense.com/licenses/llama3.2/

    Description

    Dataset Card for Meta Evaluation Result Details for Llama-3.2-3B

    This dataset contains the results of the Meta evaluation result details for Llama-3.2-3B. The dataset has been created from 8 evaluation tasks. The tasks are: needle_in_haystack, mmlu, squad, quac, drop, arc_challenge, multi_needle, agieval_english. Each task detail can be found as a specific subset in each configuration nd each subset is named using the task name plus the timestamp of the upload time and ends with… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-evals.

  17. h

    T-Eval

    • huggingface.co
    Updated Feb 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zehui Chen (2024). T-Eval [Dataset]. https://huggingface.co/datasets/lovesnowbest/T-Eval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 18, 2024
    Authors
    Zehui Chen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

      ✨ Introduction
    

    This is an evaluation harness for the benchmark described in T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step. [Paper] [Project Page] [LeaderBoard] [HuggingFace]

    Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and… See the full description on the dataset page: https://huggingface.co/datasets/lovesnowbest/T-Eval.

  18. h

    toxicity_benchmark-evaluation

    • huggingface.co
    Updated Sep 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onepane.ai (2024). toxicity_benchmark-evaluation [Dataset]. https://huggingface.co/datasets/onepaneai/toxicity_benchmark-evaluation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Dataset provided by
    Onepane.ai
    Description

    onepaneai/toxicity_benchmark-evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    sla-evaluate

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Speech and Machine Intelligence Laboratory (2025). sla-evaluate [Dataset]. https://huggingface.co/datasets/ntnu-smil/sla-evaluate
    Explore at:
    Dataset updated
    Apr 3, 2025
    Dataset authored and provided by
    Speech and Machine Intelligence Laboratory
    Description

    ntnu-smil/sla-evaluate dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    LLaVA-UHD-v2-Evaluation

    • huggingface.co
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YipengZhang (2025). LLaVA-UHD-v2-Evaluation [Dataset]. https://huggingface.co/datasets/YipengZhang/LLaVA-UHD-v2-Evaluation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 5, 2025
    Authors
    YipengZhang
    Description

    YipengZhang/LLaVA-UHD-v2-Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BigScience Workshop (2022). evaluation-results [Dataset]. https://huggingface.co/datasets/bigscience/evaluation-results

evaluation-results

evaluation-results

bigscience/evaluation-results

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 31, 2022
Dataset authored and provided by
BigScience Workshop
Description

@misc{muennighoff2022crosslingual, title={Crosslingual Generalization through Multitask Finetuning}, author={Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel}, year={2022}, eprint={2211.01786}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Search
Clear search
Close search
Google apps
Main menu