Facebook
Twitter@misc{muennighoff2022crosslingual, title={Crosslingual Generalization through Multitask Finetuning}, author={Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel}, year={2022}, eprint={2211.01786}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ygl1020/Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
evaluate metrics
This dataset contains metrics about the huggingface/evaluate package. Number of repositories in the dataset: 106 Number of packages in the dataset: 3
Package dependents
This contains the data available in the used-by tab on GitHub.
Package & Repository star count
This section shows the package and repository star count, individually.
Package Repository
There are 1 packages that have more than 1000 stars. There are 2 repositories… See the full description on the dataset page: https://huggingface.co/datasets/open-source-metrics/evaluate-dependents.
Facebook
Twitterbigcode/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterevaluate/media dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MERA (Multimodal Evaluation for Russian-language Architectures)
Summary
MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Seven Wonders (Evaluation Dataset)
Eval dataset generated with RAGAS for https://huggingface.co/datasets/ZanSara/seven-wonders Original data from https://huggingface.co/datasets/bilgeyucel/seven-wonders
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for GLUE
Dataset Summary
GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
Supported Tasks and Leaderboards
The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks:
ax
A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FLEURS
Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/google/fleurs.
Facebook
Twitterhttps://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Dataset Card for Llama-3.1-8B Evaluation Result Details
This dataset contains the Meta evaluation result details for Llama-3.1-8B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time and… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Eka Medical ASR Evaluation Dataset
Dataset Overview and Sourcing
The Eka Medical ASR Evaluation Dataset enables comprehensive evaluation of automatic speech recognition systems designed to transcribe medical speech into accurate text—a fundamental component of any medical scribe system. This dataset captures the unique challenges of processing medical terminology, particularly branded drugs, which is specific to the Indian context. The dataset comprises over 3,900+… See the full description on the dataset page: https://huggingface.co/datasets/ekacare/eka-medical-asr-evaluation-dataset.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary
This dataset contains all MQM human annotations from previous WMT Metrics shared tasks and the MQM annotations from Experts, Errors, and Context. The data is organised into 8 columns:
lp: language pair src: input text mt: translation ref: reference translation score: MQM score system: MT Engine that produced the translation annotators: number of annotators domain: domain of the input text (e.g. news) year: collection year
You can also find the original data here.… See the full description on the dataset page: https://huggingface.co/datasets/RicardoRei/wmt-mqm-human-evaluation.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
InfiBench (Data Part)
Note: For full description, please visit our main website https://infi-coder.github.io/infibench. This repo contains all data of our code LLM evaluation dataset InfiBench. suite_v2.1.yaml lists the case list and suite_v2.1_data.csv records all data (prompt, reference answer, evaluation metric). The data can be directly consumed by our automatic evaluation tool to evaluate any model's response.
Dataset Card
Name: InfiBench Description: Evaluation… See the full description on the dataset page: https://huggingface.co/datasets/llylly001/InfiBench.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Evaluation Guide
This dataset is used to evaluate medical multimodal LLMs, as used in HuatuoGPT-Vision. It includes benchmarks such as VQA-RAD, SLAKE, PathVQA, PMC-VQA, OmniMedVQA, and MMMU-Medical-Tracks.
To get started:
Download the dataset and extract the images.zip file.
Find evaluation code on our GitHub: HuatuoGPT-Vision.
This open-source release aims to simplify the evaluation of medical multimodal capabilities in large models. Please cite the relevant benchmark… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/Medical_Multimodal_Evaluation_Data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model-Written Evaluation Datasets
This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to:
Those who are interested in understanding the quality and properties of model-generated data Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/model-written-evals.
Facebook
Twitterhttps://choosealicense.com/licenses/llama3.2/https://choosealicense.com/licenses/llama3.2/
Dataset Card for Meta Evaluation Result Details for Llama-3.2-3B
This dataset contains the results of the Meta evaluation result details for Llama-3.2-3B. The dataset has been created from 8 evaluation tasks. The tasks are: needle_in_haystack, mmlu, squad, quac, drop, arc_challenge, multi_needle, agieval_english. Each task detail can be found as a specific subset in each configuration nd each subset is named using the task name plus the timestamp of the upload time and ends with… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-evals.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step
✨ Introduction
This is an evaluation harness for the benchmark described in T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step. [Paper] [Project Page] [LeaderBoard] [HuggingFace]
Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and… See the full description on the dataset page: https://huggingface.co/datasets/lovesnowbest/T-Eval.
Facebook
Twitteronepaneai/toxicity_benchmark-evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterntnu-smil/sla-evaluate dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterYipengZhang/LLaVA-UHD-v2-Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitter@misc{muennighoff2022crosslingual, title={Crosslingual Generalization through Multitask Finetuning}, author={Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel}, year={2022}, eprint={2211.01786}, archivePrefix={arXiv}, primaryClass={cs.CL} }