100+ datasets found
  1. h

    openai_humaneval

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for OpenAI HumanEval

      Dataset Summary
    

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

      Supported Tasks and Leaderboards
    
    
    
    
    
      Languages
    

    The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

  2. P

    HumanEval Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Chen; Jerry Tworek; Heewoo Jun; Qiming Yuan; Henrique Ponde de Oliveira Pinto; Jared Kaplan; Harri Edwards; Yuri Burda; Nicholas Joseph; Greg Brockman; Alex Ray; Raul Puri; Gretchen Krueger; Michael Petrov; Heidy Khlaaf; Girish Sastry; Pamela Mishkin; Brooke Chan; Scott Gray; Nick Ryder; Mikhail Pavlov; Alethea Power; Lukasz Kaiser; Mohammad Bavarian; Clemens Winter; Philippe Tillet; Felipe Petroski Such; Dave Cummings; Matthias Plappert; Fotios Chantzis; Elizabeth Barnes; Ariel Herbert-Voss; William Hebgen Guss; Alex Nichol; Alex Paino; Nikolas Tezak; Jie Tang; Igor Babuschkin; Suchir Balaji; Shantanu Jain; William Saunders; Christopher Hesse; Andrew N. Carr; Jan Leike; Josh Achiam; Vedant Misra; Evan Morikawa; Alec Radford; Matthew Knight; Miles Brundage; Mira Murati; Katie Mayer; Peter Welinder; Bob McGrew; Dario Amodei; Sam McCandlish; Ilya Sutskever; Wojciech Zaremba (2024). HumanEval Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval
    Explore at:
    Dataset updated
    Dec 31, 2024
    Authors
    Mark Chen; Jerry Tworek; Heewoo Jun; Qiming Yuan; Henrique Ponde de Oliveira Pinto; Jared Kaplan; Harri Edwards; Yuri Burda; Nicholas Joseph; Greg Brockman; Alex Ray; Raul Puri; Gretchen Krueger; Michael Petrov; Heidy Khlaaf; Girish Sastry; Pamela Mishkin; Brooke Chan; Scott Gray; Nick Ryder; Mikhail Pavlov; Alethea Power; Lukasz Kaiser; Mohammad Bavarian; Clemens Winter; Philippe Tillet; Felipe Petroski Such; Dave Cummings; Matthias Plappert; Fotios Chantzis; Elizabeth Barnes; Ariel Herbert-Voss; William Hebgen Guss; Alex Nichol; Alex Paino; Nikolas Tezak; Jie Tang; Igor Babuschkin; Suchir Balaji; Shantanu Jain; William Saunders; Christopher Hesse; Andrew N. Carr; Jan Leike; Josh Achiam; Vedant Misra; Evan Morikawa; Alec Radford; Matthew Knight; Miles Brundage; Mira Murati; Katie Mayer; Peter Welinder; Bob McGrew; Dario Amodei; Sam McCandlish; Ilya Sutskever; Wojciech Zaremba
    Description

    This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code". It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

  3. P

    HumanEval-X Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang (2025). HumanEval-X Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval-x
    Explore at:
    Dataset updated
    Jun 9, 2025
    Authors
    Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang
    Description

    HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.

  4. h

    proofdb-human-eval

    • huggingface.co
    Updated May 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom Reichel (2024). proofdb-human-eval [Dataset]. https://huggingface.co/datasets/tomreichel/proofdb-human-eval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 4, 2024
    Authors
    Tom Reichel
    Description

    tomreichel/proofdb-human-eval dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. openai-human-eval

    • kaggle.com
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Inoichan (2024). openai-human-eval [Dataset]. https://www.kaggle.com/datasets/inoueu1/openai-human-eval/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Inoichan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Inoichan

    Released under MIT

    Contents

  6. h

    Human-Eval

    • huggingface.co
    Updated Jan 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RefineBench (2025). Human-Eval [Dataset]. https://huggingface.co/datasets/RefineBench/Human-Eval
    Explore at:
    Dataset updated
    Jan 1, 2025
    Dataset authored and provided by
    RefineBench
    Description

    RefineBench/Human-Eval dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    instructhumaneval

    • huggingface.co
    • opendatalab.com
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2023). instructhumaneval [Dataset]. https://huggingface.co/datasets/codeparrot/instructhumaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    Description

    Instruct HumanEval

      Summary
    

    InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. Here is an example of use The prompt can be built as follows… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/instructhumaneval.

  8. Data from: WikiSum: Coherent Summarization Dataset for Efficient...

    • registry.opendata.aws
    Updated May 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2021). WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation [Dataset]. https://registry.opendata.aws/wikisum/
    Explore at:
    Dataset updated
    May 20, 2021
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset provides how-to articles from wikihow.com and their summaries, written as a coherent paragraph. The dataset itself is available at wikisum.zip, and contains the article, the summary, the wikihow url, and an official fold (train, val, or test). In addition, human evaluation results are available at wikisum-human-eval.zip. It consists of human evaluation of the summary of the Pegasus system, annotators response regarding the difficulty of the task, and words they marked as unknown.

  9. h

    wmt-da-human-evaluation

    • huggingface.co
    Updated Feb 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricardo Rei (2023). wmt-da-human-evaluation [Dataset]. https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2023
    Authors
    Ricardo Rei
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary

    This dataset contains all DA human annotations from previous WMT News Translation shared tasks. The data is organised into 8 columns:

    lp: language pair src: input text mt: translation ref: reference translation score: z score raw: direct assessment annotators: number of annotators domain: domain of the input text (e.g. news) year: collection year

    You can also find the original data for each year in the results section https://www.statmt.org/wmt{YEAR}/results.html… See the full description on the dataset page: https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation.

  10. P

    Robust Summarization Evaluation Benchmark Dataset

    • paperswithcode.com
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yixin Liu; Alexander R. Fabbri; PengFei Liu; Yilun Zhao; Linyong Nan; Ruilin Han; Simeng Han; Shafiq Joty; Chien-Sheng Wu; Caiming Xiong; Dragomir Radev (2022). Robust Summarization Evaluation Benchmark Dataset [Dataset]. https://paperswithcode.com/dataset/robust-summarization-evaluation-benchmark
    Explore at:
    Dataset updated
    Dec 14, 2022
    Authors
    Yixin Liu; Alexander R. Fabbri; PengFei Liu; Yilun Zhao; Linyong Nan; Ruilin Han; Simeng Han; Shafiq Joty; Chien-Sheng Wu; Caiming Xiong; Dragomir Radev
    Description

    Robust Summarization Evaluation Benchmark is a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets.

  11. f

    Evaluation results using automatic and human evaluation metrics for...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja (2023). Evaluation results using automatic and human evaluation metrics for baselines, ablation, and our proposed model on the CMU_DoG dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0280458.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Evaluation results using automatic and human evaluation metrics for baselines, ablation, and our proposed model on the CMU_DoG dataset.

  12. h

    human-eval-instructions

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    XUANMAO LI (2022). human-eval-instructions [Dataset]. https://huggingface.co/datasets/lmong/human-eval-instructions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2022
    Authors
    XUANMAO LI
    Description

    lmong/human-eval-instructions dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. P

    HumanEval-ET Dataset

    • paperswithcode.com
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). HumanEval-ET Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval-et
    Explore at:
    Dataset updated
    Jun 14, 2023
    Description

    Extension test cases of HumanEval, as well as generated code.

  14. P

    GENIE Dataset

    • paperswithcode.com
    Updated Mar 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Khashabi; Gabriel Stanovsky; Jonathan Bragg; Nicholas Lourie; Jungo Kasai; Yejin Choi; Noah A. Smith; Daniel S. Weld (2022). GENIE Dataset [Dataset]. https://paperswithcode.com/dataset/genie
    Explore at:
    Dataset updated
    Mar 24, 2022
    Authors
    Daniel Khashabi; Gabriel Stanovsky; Jonathan Bragg; Nicholas Lourie; Jungo Kasai; Yejin Choi; Noah A. Smith; Daniel S. Weld
    Description

    GENIE, which stands for GENeratIve Evaluation, is a system designed to standardize human evaluations across different text generation tasks. It was introduced to produce consistent evaluations that are reproducible over time and across different populations. The system is instantiated with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. For each task, GENIE offers a leaderboard that automatically crowdsources annotations for submissions, evaluating them along axes such as correctness, conciseness, and fluency.

  15. f

    Evaluation results using automatic and human evaluation metrics for...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja (2023). Evaluation results using automatic and human evaluation metrics for baselines, ablation, and our proposed model on the Topical Chat Frequent dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0280458.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Evaluation results using automatic and human evaluation metrics for baselines, ablation, and our proposed model on the Topical Chat Frequent dataset.

  16. h

    ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset...

    • huggingface.co
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepNLP (2024). ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset [Dataset]. https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2024
    Authors
    DeepNLP
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ChatGPT Gemini Claude Perplexity Human Evaluation Multi Aspect Review Dataset

      Introduction
    

    Human evaluation and reviews with scalar score of AI Services responses are very usefuly in LLM Finetuning, Human Preference Alignment, Few-Shot Learning, Bad Case Shooting, etc, but extremely difficult to collect. This dataset is collected from DeepNLP AI Service User Review panel (http://www.deepnlp.org/store), which is an open review website for users to give reviews and upload… See the full description on the dataset page: https://huggingface.co/datasets/DeepNLP/ChatGPT-Gemini-Claude-Perplexity-Human-Evaluation-Multi-Aspects-Review-Dataset.

  17. LJS-MOS-120: Human MOS Ratings for 120 Samples of the LJ Speech Dataset

    • zenodo.org
    bin
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Taubert; Stefan Taubert (2025). LJS-MOS-120: Human MOS Ratings for 120 Samples of the LJ Speech Dataset [Dataset]. http://doi.org/10.57967/hf/5368
    Explore at:
    binAvailable download formats
    Dataset updated
    May 12, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Stefan Taubert; Stefan Taubert
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    For a detailed description, see Hugging Face.

  18. f

    A snippet of dialogue from the topical chat conversation.

    • figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja (2023). A snippet of dialogue from the topical chat conversation. [Dataset]. http://doi.org/10.1371/journal.pone.0280458.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A snippet of dialogue from the topical chat conversation.

  19. P

    RedEval Dataset

    • paperswithcode.com
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishabh Bhardwaj; Soujanya Poria (2024). RedEval Dataset [Dataset]. https://paperswithcode.com/dataset/redeval
    Explore at:
    Dataset updated
    Apr 10, 2024
    Authors
    Rishabh Bhardwaj; Soujanya Poria
    Description

    RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:

    Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.

    Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).

    Question Banks:

    HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.

    Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.

    Installation:

    Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.

    Prompt Templates:

    Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.

    How to Perform Red-Teaming:

    Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.

  20. f

    Distribution of emotion classes in topical chat dataset.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja (2023). Distribution of emotion classes in topical chat dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0280458.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Deeksha Varshney; Asif Ekbal; Mrigank Tiwari; Ganesh Prasad Nagaraja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distribution of emotion classes in topical chat dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval

openai_humaneval

OpenAI HumanEval

openai/openai_humaneval

Explore at:
18 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for OpenAI HumanEval

  Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

  Supported Tasks and Leaderboards





  Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

Search
Clear search
Close search
Google apps
Main menu