100+ datasets found

h
evaluation-results
huggingface.co
Updated Aug 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Workshop (2022). evaluation-results [Dataset]. https://huggingface.co/datasets/bigscience/evaluation-results
Explore at:
Dataset updated
Aug 31, 2022
Dataset authored and provided by
BigScience Workshop
Description
@misc{muennighoff2022crosslingual, title={Crosslingual Generalization through Multitask Finetuning}, author={Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel}, year={2022}, eprint={2211.01786}, archivePrefix={arXiv}, primaryClass={cs.CL} }
h
Evaluation
huggingface.co
Updated Sep 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lunlin Yang (2024). Evaluation [Dataset]. https://huggingface.co/datasets/ygl1020/Evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2024
Authors
Lunlin Yang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ygl1020/Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
evaluate-dependents
huggingface.co
Updated Jul 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face OSS Metrics (2023). evaluate-dependents [Dataset]. https://huggingface.co/datasets/open-source-metrics/evaluate-dependents
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 19, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face OSS Metrics
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
evaluate metrics

This dataset contains metrics about the huggingface/evaluate package. Number of repositories in the dataset: 106 Number of packages in the dataset: 3

Package dependents

This contains the data available in the used-by tab on GitHub.

Package & Repository star count

This section shows the package and repository star count, individually.

Package Repository

There are 1 packages that have more than 1000 stars. There are 2 repositories… See the full description on the dataset page: https://huggingface.co/datasets/open-source-metrics/evaluate-dependents.
h
evaluation
huggingface.co
Updated May 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2023). evaluation [Dataset]. https://huggingface.co/datasets/bigcode/evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 4, 2023
Dataset authored and provided by
BigCode
Description
bigcode/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
media
huggingface.co
Updated Jun 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
evaluate (2022). media [Dataset]. https://huggingface.co/datasets/evaluate/media
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2022
Dataset authored and provided by
evaluate
Description
evaluate/media dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Data from: MERA
huggingface.co
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MERA (2025). MERA [Dataset]. https://huggingface.co/datasets/MERA-evaluation/MERA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2025
Authors
MERA
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MERA (Multimodal Evaluation for Russian-language Architectures)

Summary

MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.
h
seven-wonders-eval
huggingface.co
Updated Sep 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZanSara (2024). seven-wonders-eval [Dataset]. https://huggingface.co/datasets/ZanSara/seven-wonders-eval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 28, 2024
Authors
ZanSara
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Seven Wonders (Evaluation Dataset)

Eval dataset generated with RAGAS for https://huggingface.co/datasets/ZanSara/seven-wonders Original data from https://huggingface.co/datasets/bilgeyucel/seven-wonders
h
glue
huggingface.co
tensorflow.google.cn
+1more
Updated Mar 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NYU Machine Learning for Language (2024). glue [Dataset]. https://huggingface.co/datasets/nyu-mll/glue
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 6, 2024
Dataset authored and provided by
NYU Machine Learning for Language
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for GLUE

Dataset Summary

GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

Supported Tasks and Leaderboards

The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks:

ax

A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
fleurs
huggingface.co
opendatalab.com
Updated Jun 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2022). fleurs [Dataset]. https://huggingface.co/datasets/google/fleurs
Explore at:
Dataset updated
Jun 4, 2022
Dataset authored and provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FLEURS

Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/google/fleurs.
Llama-3.1-8B-evals
huggingface.co
Updated Jul 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meta Llama (2024). Llama-3.1-8B-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals
Explore at:
Dataset updated
Jul 23, 2024
Dataset provided by
Metahttp://meta.com/
Authors
Meta Llama
License
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Description
Dataset Card for Llama-3.1-8B Evaluation Result Details

This dataset contains the Meta evaluation result details for Llama-3.1-8B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time and… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals.
h
eka-medical-asr-evaluation-dataset
huggingface.co
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eka Care (2025). eka-medical-asr-evaluation-dataset [Dataset]. https://huggingface.co/datasets/ekacare/eka-medical-asr-evaluation-dataset
Explore at:
Dataset updated
Nov 14, 2025
Dataset authored and provided by
Eka Care
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Eka Medical ASR Evaluation Dataset

Dataset Overview and Sourcing

The Eka Medical ASR Evaluation Dataset enables comprehensive evaluation of automatic speech recognition systems designed to transcribe medical speech into accurate text—a fundamental component of any medical scribe system. This dataset captures the unique challenges of processing medical terminology, particularly branded drugs, which is specific to the Indian context. The dataset comprises over 3,900+… See the full description on the dataset page: https://huggingface.co/datasets/ekacare/eka-medical-asr-evaluation-dataset.
h
wmt-mqm-human-evaluation
huggingface.co
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo Rei (2024). wmt-mqm-human-evaluation [Dataset]. https://huggingface.co/datasets/RicardoRei/wmt-mqm-human-evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2024
Authors
Ricardo Rei
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Summary

This dataset contains all MQM human annotations from previous WMT Metrics shared tasks and the MQM annotations from Experts, Errors, and Context. The data is organised into 8 columns:

lp: language pair src: input text mt: translation ref: reference translation score: MQM score system: MT Engine that produced the translation annotators: number of annotators domain: domain of the input text (e.g. news) year: collection year

You can also find the original data here.… See the full description on the dataset page: https://huggingface.co/datasets/RicardoRei/wmt-mqm-human-evaluation.
h
InfiBench
huggingface.co
Updated Jun 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linyi Li (2024). InfiBench [Dataset]. http://doi.org/10.57967/hf/2474
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2474
Dataset updated
Jun 12, 2024
Authors
Linyi Li
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
InfiBench (Data Part)

Note: For full description, please visit our main website https://infi-coder.github.io/infibench. This repo contains all data of our code LLM evaluation dataset InfiBench. suite_v2.1.yaml lists the case list and suite_v2.1_data.csv records all data (prompt, reference answer, evaluation metric). The data can be directly consumed by our automatic evaluation tool to evaluate any model's response.

Dataset Card

Name: InfiBench Description: Evaluation… See the full description on the dataset page: https://huggingface.co/datasets/llylly001/InfiBench.
h
Medical_Multimodal_Evaluation_Data
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FreedomAI (2025). Medical_Multimodal_Evaluation_Data [Dataset]. https://huggingface.co/datasets/FreedomIntelligence/Medical_Multimodal_Evaluation_Data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2025
Dataset authored and provided by
FreedomAI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Evaluation Guide

This dataset is used to evaluate medical multimodal LLMs, as used in HuatuoGPT-Vision. It includes benchmarks such as VQA-RAD, SLAKE, PathVQA, PMC-VQA, OmniMedVQA, and MMMU-Medical-Tracks.
To get started:

Download the dataset and extract the images.zip file.
Find evaluation code on our GitHub: HuatuoGPT-Vision.

This open-source release aims to simplify the evaluation of medical multimodal capabilities in large models. Please cite the relevant benchmark… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/Medical_Multimodal_Evaluation_Data.
model-written-evals
huggingface.co
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthropic (2023). model-written-evals [Dataset]. https://huggingface.co/datasets/Anthropic/model-written-evals
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset authored and provided by
Anthropichttps://anthropic.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model-Written Evaluation Datasets

This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to:

Those who are interested in understanding the quality and properties of model-generated data Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/model-written-evals.
Llama-3.2-3B-evals
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meta Llama, Llama-3.2-3B-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-evals
Explore at:
Dataset provided by
Metahttp://meta.com/
Authors
Meta Llama
License
https://choosealicense.com/licenses/llama3.2/https://choosealicense.com/licenses/llama3.2/
Description
Dataset Card for Meta Evaluation Result Details for Llama-3.2-3B

This dataset contains the results of the Meta evaluation result details for Llama-3.2-3B. The dataset has been created from 8 evaluation tasks. The tasks are: needle_in_haystack, mmlu, squad, quac, drop, arc_challenge, multi_needle, agieval_english. Each task detail can be found as a specific subset in each configuration nd each subset is named using the task name plus the timestamp of the upload time and ends with… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-evals.
h
T-Eval
huggingface.co
Updated Feb 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zehui Chen (2024). T-Eval [Dataset]. https://huggingface.co/datasets/lovesnowbest/T-Eval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 18, 2024
Authors
Zehui Chen
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

✨ Introduction

This is an evaluation harness for the benchmark described in T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step. [Paper] [Project Page] [LeaderBoard] [HuggingFace]

Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and… See the full description on the dataset page: https://huggingface.co/datasets/lovesnowbest/T-Eval.
h
toxicity_benchmark-evaluation
huggingface.co
Updated Sep 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Onepane.ai (2024). toxicity_benchmark-evaluation [Dataset]. https://huggingface.co/datasets/onepaneai/toxicity_benchmark-evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Dataset provided by
Onepane.ai
Description
onepaneai/toxicity_benchmark-evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
sla-evaluate
huggingface.co
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Speech and Machine Intelligence Laboratory (2025). sla-evaluate [Dataset]. https://huggingface.co/datasets/ntnu-smil/sla-evaluate
Explore at:
Dataset updated
Apr 3, 2025
Dataset authored and provided by
Speech and Machine Intelligence Laboratory
Description
ntnu-smil/sla-evaluate dataset hosted on Hugging Face and contributed by the HF Datasets community
h
LLaVA-UHD-v2-Evaluation
huggingface.co
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YipengZhang (2025). LLaVA-UHD-v2-Evaluation [Dataset]. https://huggingface.co/datasets/YipengZhang/LLaVA-UHD-v2-Evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 5, 2025
Authors
YipengZhang
Description
YipengZhang/LLaVA-UHD-v2-Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

BigScience Workshop (2022). evaluation-results [Dataset]. https://huggingface.co/datasets/bigscience/evaluation-results

evaluation-results

bigscience/evaluation-results

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Aug 31, 2022

Dataset authored and provided by

BigScience Workshop

Description

@misc{muennighoff2022crosslingual, title={Crosslingual Generalization through Multitask Finetuning}, author={Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel}, year={2022}, eprint={2211.01786}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Clear search

Close search

Google apps

Main menu

evaluation-results

Evaluation

evaluate-dependents

evaluation

media

Data from: MERA

seven-wonders-eval

glue

fleurs

Llama-3.1-8B-evals

eka-medical-asr-evaluation-dataset

wmt-mqm-human-evaluation

InfiBench

Medical_Multimodal_Evaluation_Data

model-written-evals

Llama-3.2-3B-evals

T-Eval

toxicity_benchmark-evaluation

sla-evaluate

LLaVA-UHD-v2-Evaluation

evaluation-results

evaluation-results

bigscience/evaluation-results