100+ datasets found

h
openai_humaneval
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for OpenAI HumanEval

Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

Supported Tasks and Leaderboards Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
P
HumanEval Dataset
paperswithcode.com
opendatalab.com
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Chen; Jerry Tworek; Heewoo Jun; Qiming Yuan; Henrique Ponde de Oliveira Pinto; Jared Kaplan; Harri Edwards; Yuri Burda; Nicholas Joseph; Greg Brockman; Alex Ray; Raul Puri; Gretchen Krueger; Michael Petrov; Heidy Khlaaf; Girish Sastry; Pamela Mishkin; Brooke Chan; Scott Gray; Nick Ryder; Mikhail Pavlov; Alethea Power; Lukasz Kaiser; Mohammad Bavarian; Clemens Winter; Philippe Tillet; Felipe Petroski Such; Dave Cummings; Matthias Plappert; Fotios Chantzis; Elizabeth Barnes; Ariel Herbert-Voss; William Hebgen Guss; Alex Nichol; Alex Paino; Nikolas Tezak; Jie Tang; Igor Babuschkin; Suchir Balaji; Shantanu Jain; William Saunders; Christopher Hesse; Andrew N. Carr; Jan Leike; Josh Achiam; Vedant Misra; Evan Morikawa; Alec Radford; Matthew Knight; Miles Brundage; Mira Murati; Katie Mayer; Peter Welinder; Bob McGrew; Dario Amodei; Sam McCandlish; Ilya Sutskever; Wojciech Zaremba (2024). HumanEval Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval
Explore at:
Dataset updated
Dec 31, 2024
Authors
Mark Chen; Jerry Tworek; Heewoo Jun; Qiming Yuan; Henrique Ponde de Oliveira Pinto; Jared Kaplan; Harri Edwards; Yuri Burda; Nicholas Joseph; Greg Brockman; Alex Ray; Raul Puri; Gretchen Krueger; Michael Petrov; Heidy Khlaaf; Girish Sastry; Pamela Mishkin; Brooke Chan; Scott Gray; Nick Ryder; Mikhail Pavlov; Alethea Power; Lukasz Kaiser; Mohammad Bavarian; Clemens Winter; Philippe Tillet; Felipe Petroski Such; Dave Cummings; Matthias Plappert; Fotios Chantzis; Elizabeth Barnes; Ariel Herbert-Voss; William Hebgen Guss; Alex Nichol; Alex Paino; Nikolas Tezak; Jie Tang; Igor Babuschkin; Suchir Balaji; Shantanu Jain; William Saunders; Christopher Hesse; Andrew N. Carr; Jan Leike; Josh Achiam; Vedant Misra; Evan Morikawa; Alec Radford; Matthew Knight; Miles Brundage; Mira Murati; Katie Mayer; Peter Welinder; Bob McGrew; Dario Amodei; Sam McCandlish; Ilya Sutskever; Wojciech Zaremba
Description
This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code". It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
P
HumanEval-X Dataset
paperswithcode.com
opendatalab.com
Updated Jun 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang (2025). HumanEval-X Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval-x
Explore at:
Dataset updated
Jun 9, 2025
Authors
Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang
Description
HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.
h
proofdb-human-eval
huggingface.co
Updated May 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Reichel (2024). proofdb-human-eval [Dataset]. https://huggingface.co/datasets/tomreichel/proofdb-human-eval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 4, 2024
Authors
Tom Reichel
Description
tomreichel/proofdb-human-eval dataset hosted on Hugging Face and contributed by the HF Datasets community
openai-human-eval
kaggle.com
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inoichan (2024). openai-human-eval [Dataset]. https://www.kaggle.com/datasets/inoueu1/openai-human-eval/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Inoichan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Inoichan

Released under MIT

Contents
h
Human-Eval
huggingface.co
Updated Jan 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RefineBench (2025). Human-Eval [Dataset]. https://huggingface.co/datasets/RefineBench/Human-Eval
Explore at:
Dataset updated
Jan 1, 2025
Dataset authored and provided by
RefineBench
Description
RefineBench/Human-Eval dataset hosted on Hugging Face and contributed by the HF Datasets community
h
instructhumaneval
huggingface.co
opendatalab.com
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2023). instructhumaneval [Dataset]. https://huggingface.co/datasets/codeparrot/instructhumaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2023
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
Description
Instruct HumanEval

Summary

InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. Here is an example of use The prompt can be built as follows… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/instructhumaneval.
Data from: WikiSum: Coherent Summarization Dataset for Efficient...
registry.opendata.aws
Updated May 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2021). WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation [Dataset]. https://registry.opendata.aws/wikisum/
Explore at:
Dataset updated
May 20, 2021
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset provides how-to articles from wikihow.com and their summaries, written as a coherent paragraph. The dataset itself is available at wikisum.zip, and contains the article, the summary, the wikihow url, and an official fold (train, val, or test). In addition, human evaluation results are available at wikisum-human-eval.zip. It consists of human evaluation of the summary of the Pegasus system, annotators response regarding the difficulty of the task, and words they marked as unknown.
h
wmt-da-human-evaluation
huggingface.co
Updated Feb 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo Rei (2023). wmt-da-human-evaluation [Dataset]. https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2023
Authors
Ricardo Rei
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Summary

This dataset contains all DA human annotations from previous WMT News Translation shared tasks. The data is organised into 8 columns:

lp: language pair src: input text mt: translation ref: reference translation score: z score raw: direct assessment annotators: number of annotators domain: domain of the input text (e.g. news) year: collection year

You can also find the original data for each year in the results section https://www.statmt.org/wmt{YEAR}/results.html… See the full description on the dataset page: https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation.
P
Robust Summarization Evaluation Benchmark Dataset
paperswithcode.com
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yixin Liu; Alexander R. Fabbri; PengFei Liu; Yilun Zhao; Linyong Nan; Ruilin Han; Simeng Han; Shafiq Joty; Chien-Sheng Wu; Caiming Xiong; Dragomir Radev (2022). Robust Summarization Evaluation Benchmark Dataset [Dataset]. https://paperswithcode.com/dataset/robust-summarization-evaluation-benchmark
Explore at:
Dataset updated
Dec 14, 2022
Authors
Yixin Liu; Alexander R. Fabbri; PengFei Liu; Yilun Zhao; Linyong Nan; Ruilin Han; Simeng Han; Shafiq Joty; Chien-Sheng Wu; Caiming Xiong; Dragomir Radev
Description
Robust Summarization Evaluation Benchmark is a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets.
f
Evaluation results using automatic and human evaluation metrics for...
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja (2023). Evaluation results using automatic and human evaluation metrics for baselines, ablation, and our proposed model on the CMU_DoG dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0280458.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0280458.t005
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Evaluation results using automatic and human evaluation metrics for baselines, ablation, and our proposed model on the CMU_DoG dataset.
h
human-eval-instructions
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
XUANMAO LI (2022). human-eval-instructions [Dataset]. https://huggingface.co/datasets/lmong/human-eval-instructions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Authors
XUANMAO LI
Description
lmong/human-eval-instructions dataset hosted on Hugging Face and contributed by the HF Datasets community
P
HumanEval-ET Dataset
paperswithcode.com
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). HumanEval-ET Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval-et
Explore at:
Dataset updated
Jun 14, 2023
Description
Extension test cases of HumanEval, as well as generated code.
P
GENIE Dataset
paperswithcode.com
Updated Mar 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Khashabi; Gabriel Stanovsky; Jonathan Bragg; Nicholas Lourie; Jungo Kasai; Yejin Choi; Noah A. Smith; Daniel S. Weld (2022). GENIE Dataset [Dataset]. https://paperswithcode.com/dataset/genie
Explore at:
Dataset updated
Mar 24, 2022
Authors
Daniel Khashabi; Gabriel Stanovsky; Jonathan Bragg; Nicholas Lourie; Jungo Kasai; Yejin Choi; Noah A. Smith; Daniel S. Weld
Description
GENIE, which stands for GENeratIve Evaluation, is a system designed to standardize human evaluations across different text generation tasks. It was introduced to produce consistent evaluations that are reproducible over time and across different populations. The system is instantiated with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. For each task, GENIE offers a leaderboard that automatically crowdsources annotations for submissions, evaluating them along axes such as correctness, conciseness, and fluency.
f
Evaluation results using automatic and human evaluation metrics for...
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja (2023). Evaluation results using automatic and human evaluation metrics for baselines, ablation, and our proposed model on the Topical Chat Frequent dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0280458.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0280458.t004
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Evaluation results using automatic and human evaluation metrics for baselines, ablation, and our proposed model on the Topical Chat Frequent dataset.
h
ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset...
huggingface.co
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeepNLP (2024). ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset [Dataset]. https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 12, 2024
Authors
DeepNLP
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
ChatGPT Gemini Claude Perplexity Human Evaluation Multi Aspect Review Dataset

Introduction

Human evaluation and reviews with scalar score of AI Services responses are very usefuly in LLM Finetuning, Human Preference Alignment, Few-Shot Learning, Bad Case Shooting, etc, but extremely difficult to collect. This dataset is collected from DeepNLP AI Service User Review panel (http://www.deepnlp.org/store), which is an open review website for users to give reviews and upload… See the full description on the dataset page: https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset.
LJS-MOS-120: Human MOS Ratings for 120 Samples of the LJ Speech Dataset
zenodo.org
bin
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Taubert; Stefan Taubert (2025). LJS-MOS-120: Human MOS Ratings for 120 Samples of the LJ Speech Dataset [Dataset]. http://doi.org/10.57967/hf/5368
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.57967/hf/5368
Dataset updated
May 12, 2025
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Stefan Taubert; Stefan Taubert
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
For a detailed description, see Hugging Face.
f
A snippet of dialogue from the topical chat conversation.
figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja (2023). A snippet of dialogue from the topical chat conversation. [Dataset]. http://doi.org/10.1371/journal.pone.0280458.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0280458.t001
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A snippet of dialogue from the topical chat conversation.
P
RedEval Dataset
paperswithcode.com
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishabh Bhardwaj; Soujanya Poria (2024). RedEval Dataset [Dataset]. https://paperswithcode.com/dataset/redeval
Explore at:
Dataset updated
Apr 10, 2024
Authors
Rishabh Bhardwaj; Soujanya Poria
Description
RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:

Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.

Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).

Question Banks:

HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.

Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.

Installation:

Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.

Prompt Templates:

Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.

How to Perform Red-Teaming:

Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.
f
Distribution of emotion classes in topical chat dataset.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja (2023). Distribution of emotion classes in topical chat dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0280458.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0280458.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Distribution of emotion classes in topical chat dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval

openai_humaneval

OpenAI HumanEval

openai/openai_humaneval

Explore at:

18 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 1, 2022

Dataset authored and provided by

OpenAIhttps://openai.com/

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for OpenAI HumanEval

  Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

  Supported Tasks and Leaderboards





  Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

Clear search

Close search

Google apps

Main menu

openai_humaneval

HumanEval Dataset

HumanEval-X Dataset

proofdb-human-eval

openai-human-eval

Dataset

Contents

Human-Eval

instructhumaneval

Data from: WikiSum: Coherent Summarization Dataset for Efficient...

wmt-da-human-evaluation

Robust Summarization Evaluation Benchmark Dataset

Evaluation results using automatic and human evaluation metrics for...

human-eval-instructions

HumanEval-ET Dataset

GENIE Dataset

Evaluation results using automatic and human evaluation metrics for...

ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset...

LJS-MOS-120: Human MOS Ratings for 120 Samples of the LJ Speech Dataset

A snippet of dialogue from the topical chat conversation.

RedEval Dataset

Distribution of emotion classes in topical chat dataset.

openai_humaneval

OpenAI HumanEval

openai/openai_humaneval