MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MERA (Multimodal Evaluation for Russian-language Architectures)
Summary
MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.
bigcode/evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for BeaverTails-Evaluation
BeaverTails is an AI safety-focused collection comprising a series of datasets. This repository contains test prompts specifically designed for evaluating language model safety. It is important to note that although each prompt can be connected to multiple categories, only one category is labeled for each prompt. The 14 harm categories are defined as follows:
Animal Abuse: This involves any form of cruelty or harm inflicted on animals… See the full description on the dataset page: https://huggingface.co/datasets/PKU-Alignment/BeaverTails-Evaluation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model-Written Evaluation Datasets
This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to:
Those who are interested in understanding the quality and properties of model-generated data Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/model-written-evals.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ygl1020/Evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
Overview
Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.
Instruction… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
Dataset Card for AutoTrain Evaluator
This repository contains model predictions generated by AutoTrain for the following task and dataset:
Task: Summarization Model: kworts/BARTxiv Dataset: cnn_dailymail Config: 3.0.0 Split: test
To run new evaluation jobs, visit Hugging Face's automatic model evaluator.
Contributions
Thanks to @Raj P Sini for evaluating this model.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
evaluate metrics
This dataset contains metrics about the huggingface/evaluate package. Number of repositories in the dataset: 106 Number of packages in the dataset: 3
Package dependents
This contains the data available in the used-by tab on GitHub.
Package & Repository star count
This section shows the package and repository star count, individually.
Package Repository
There are 1 packages that have more than 1000 stars. There are 2 repositories… See the full description on the dataset page: https://huggingface.co/datasets/open-source-metrics/evaluate-dependents.
Dataset Card for H4 Code Evaluation Prompts
These are a filtered set of prompts for evaluating code instruction models. It will contain a variety of languages and task types. Currently, we used ChatGPT (GPT-3.5-tubro) to generate these, so we encourage using them only for qualatative evaluation and not to train your models. The generation of this data is similar to something like CodeAlpaca, which you can download here, but we intend to make these tasks botha) more challenging… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts.
Dataset Card for Evaluation run of huggingface/nebius/google/gemma-3-27b-it
Dataset automatically created during the evaluation run of model huggingface/nebius/google/gemma-3-27b-it. The dataset is composed of 1 configuration, each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the… See the full description on the dataset page: https://huggingface.co/datasets/mfuntowicz/details_huggingface_nebius_google_gemma-3-27b-it_private.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for OpenAI HumanEval
Dataset Summary
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
Supported Tasks and Leaderboards
Languages
The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ChatGPT Gemini Claude Perplexity Human Evaluation Multi Aspect Review Dataset
Introduction
Human evaluation and reviews with scalar score of AI Services responses are very usefuly in LLM Finetuning, Human Preference Alignment, Few-Shot Learning, Bad Case Shooting, etc, but extremely difficult to collect. This dataset is collected from DeepNLP AI Service User Review panel (http://www.deepnlp.org/store), which is an open review website for users to give reviews and upload… See the full description on the dataset page: https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset.
evaluate/media dataset hosted on Hugging Face and contributed by the HF Datasets community
ExponentialScience/is-dlt-models-evaluation dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Dataset Card for Llama-3.1-405B Evaluation Result Details
This dataset contains the Meta evaluation result details for Llama-3.1-405B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-405B-evals.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Neural-Code-Search-Evaluation-Dataset presents an evaluation dataset consisting of natural language query and code snippet pairs and a search corpus consisting of code snippets collected from the most popular Android repositories on GitHub.
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Dataset Card for Llama-3.1-8B Evaluation Result Details
This dataset contains the Meta evaluation result details for Llama-3.1-8B. The dataset has been created from 12 evaluation tasks. These tasks are triviaqa_wiki, mmlu_pro, commonsenseqa, winogrande, mmlu, boolq, squad, quac, drop, bbh, arc_challenge, agieval_english. Each task detail can be found as a specific subset in each configuration and each subset is named using the task name plus the timestamp of the upload time and… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-evals.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Text To Image Test Prompt Library
A comprehensive collection of evaluation prompts for testing text-to-image AI models across diverse parameters and use cases.
Overview
This repository contains a structured set of test prompts designed to evaluate the capabilities of text-to-image generation models. Rather than focusing on formal evaluation metrics, these prompts are intended for end users who want to test how well a model might perform for their specific use cases.… See the full description on the dataset page: https://huggingface.co/datasets/danielrosehill/Text-To-Image-Test-Prompts.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MERA (Multimodal Evaluation for Russian-language Architectures)
Summary
MERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language. The MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.