83 datasets found
  1. h

    openai_humaneval

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for OpenAI HumanEval

      Dataset Summary
    

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

      Supported Tasks and Leaderboards
    
    
    
    
    
      Languages
    

    The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

  2. P

    HumanEval Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jul 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Chen; Jerry Tworek; Heewoo Jun; Qiming Yuan; Henrique Ponde de Oliveira Pinto; Jared Kaplan; Harri Edwards; Yuri Burda; Nicholas Joseph; Greg Brockman; Alex Ray; Raul Puri; Gretchen Krueger; Michael Petrov; Heidy Khlaaf; Girish Sastry; Pamela Mishkin; Brooke Chan; Scott Gray; Nick Ryder; Mikhail Pavlov; Alethea Power; Lukasz Kaiser; Mohammad Bavarian; Clemens Winter; Philippe Tillet; Felipe Petroski Such; Dave Cummings; Matthias Plappert; Fotios Chantzis; Elizabeth Barnes; Ariel Herbert-Voss; William Hebgen Guss; Alex Nichol; Alex Paino; Nikolas Tezak; Jie Tang; Igor Babuschkin; Suchir Balaji; Shantanu Jain; William Saunders; Christopher Hesse; Andrew N. Carr; Jan Leike; Josh Achiam; Vedant Misra; Evan Morikawa; Alec Radford; Matthew Knight; Miles Brundage; Mira Murati; Katie Mayer; Peter Welinder; Bob McGrew; Dario Amodei; Sam McCandlish; Ilya Sutskever; Wojciech Zaremba (2021). HumanEval Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval
    Explore at:
    Dataset updated
    Jul 16, 2021
    Authors
    Mark Chen; Jerry Tworek; Heewoo Jun; Qiming Yuan; Henrique Ponde de Oliveira Pinto; Jared Kaplan; Harri Edwards; Yuri Burda; Nicholas Joseph; Greg Brockman; Alex Ray; Raul Puri; Gretchen Krueger; Michael Petrov; Heidy Khlaaf; Girish Sastry; Pamela Mishkin; Brooke Chan; Scott Gray; Nick Ryder; Mikhail Pavlov; Alethea Power; Lukasz Kaiser; Mohammad Bavarian; Clemens Winter; Philippe Tillet; Felipe Petroski Such; Dave Cummings; Matthias Plappert; Fotios Chantzis; Elizabeth Barnes; Ariel Herbert-Voss; William Hebgen Guss; Alex Nichol; Alex Paino; Nikolas Tezak; Jie Tang; Igor Babuschkin; Suchir Balaji; Shantanu Jain; William Saunders; Christopher Hesse; Andrew N. Carr; Jan Leike; Josh Achiam; Vedant Misra; Evan Morikawa; Alec Radford; Matthew Knight; Miles Brundage; Mira Murati; Katie Mayer; Peter Welinder; Bob McGrew; Dario Amodei; Sam McCandlish; Ilya Sutskever; Wojciech Zaremba
    Description

    This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code". It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

  3. P

    HumanEval-X Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang (2025). HumanEval-X Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval-x
    Explore at:
    Dataset updated
    Jun 9, 2025
    Authors
    Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang
    Description

    HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.

  4. P

    HumanEval-ET Dataset

    • paperswithcode.com
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). HumanEval-ET Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval-et
    Explore at:
    Dataset updated
    Jun 14, 2023
    Description

    Extension test cases of HumanEval, as well as generated code.

  5. h

    instructhumaneval

    • huggingface.co
    • opendatalab.com
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2023). instructhumaneval [Dataset]. https://huggingface.co/datasets/codeparrot/instructhumaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2023
    Dataset authored and provided by
    CodeParrot
    Description

    Instruct HumanEval

      Summary
    

    InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. Here is an example of use The prompt can be built as follows… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/instructhumaneval.

  6. humaneval-pro

    • huggingface.co
    Updated Dec 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeEval-Pro (2024). humaneval-pro [Dataset]. https://huggingface.co/datasets/CodeEval-Pro/humaneval-pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    CodeEval, Inc.
    Authors
    CodeEval-Pro
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Evaluation dataset for umanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task (arxiv.org/abs/2412.21199).

  7. O

    openai-humaneval

    • opendatalab.com
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthropic AI (2023). openai-humaneval [Dataset]. https://opendatalab.com/OpenDataLab/openai-humaneval
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    OpenAIhttps://openai.com/
    Zipline
    Anthropic AI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

  8. Comparison in capability with HumanEval benchmark for generative AI programs...

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Comparison in capability with HumanEval benchmark for generative AI programs 2023 [Dataset]. https://www.statista.com/statistics/1447778/humaneval-benchmark-comparison-of-major-ai-programs/
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    Worldwide
    Description

    Claude 2, developed by rising startup star Anthropic, is the most capable large language model generative AI on the current market. It reached a success ratio of ** percent with the HumanEval benchmark. This is particularly noteworthy as it is a 0-shot evaluation, meaning all AI programs benchmarked against it had not had previous data of this sort nor previous training with the tasks. This means that Claude 2 was the quickest at absorbing and understanding the task given to it.

  9. h

    Reorganized-humaneval

    • huggingface.co
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yixin He (2025). Reorganized-humaneval [Dataset]. https://huggingface.co/datasets/HeyixInn0/Reorganized-humaneval
    Explore at:
    Dataset updated
    Mar 11, 2025
    Authors
    Yixin He
    Description

    HeyixInn0/Reorganized-humaneval dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    HumanEval-V-Benchmark

    • huggingface.co
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HumanEval-V (2025). HumanEval-V-Benchmark [Dataset]. https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 2, 2025
    Dataset authored and provided by
    HumanEval-V
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

    📄 Paper •
    🏠 Home Page •
    💻 GitHub Repository •
    🏆 Leaderboard •
    🤗 Dataset Viewer 
    

    HumanEval-V is a novel benchmark designed to evaluate the diagram understanding and reasoning capabilities of Large Multimodal Models (LMMs) in programming contexts. Unlike existing benchmarks, HumanEval-V focuses on coding tasks that require sophisticated visual reasoning over… See the full description on the dataset page: https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark.

  11. h

    humaneval-mbpp-testgen-qa

    • huggingface.co
    Updated Apr 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Stanley (2023). humaneval-mbpp-testgen-qa [Dataset]. https://huggingface.co/datasets/OllieStanley/humaneval-mbpp-testgen-qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2023
    Authors
    Oliver Stanley
    Description

    Dataset Card for "humaneval-mbpp-testgen-qa"

    This dataset contains prompt-reply (question-answer) pairs where the prompt is to create a Python unit tests which tests for the functionality described in a specific docstring. The responses are then the generated unit tests.

  12. h

    HumanEval-r

    • huggingface.co
    Updated May 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre QI (2025). HumanEval-r [Dataset]. https://huggingface.co/datasets/pierreqi/HumanEval-r
    Explore at:
    Dataset updated
    May 3, 2025
    Authors
    Pierre QI
    Description

    pierreqi/HumanEval-r dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    humaneval-for-solidity-25

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BrainDAO (2025). humaneval-for-solidity-25 [Dataset]. https://huggingface.co/datasets/braindao/humaneval-for-solidity-25
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2025
    Dataset authored and provided by
    BrainDAO
    Description

    braindao/humaneval-for-solidity-25 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    humaneval

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Moor-Smith (2022). humaneval [Dataset]. https://huggingface.co/datasets/smoorsmith/humaneval
    Explore at:
    Dataset updated
    Jan 1, 2022
    Authors
    Samuel Moor-Smith
    Description

    smoorsmith/humaneval dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    humaneval

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abhaygupta, humaneval [Dataset]. https://huggingface.co/datasets/abhaygupta1266/humaneval
    Explore at:
    Authors
    abhaygupta
    Description

    abhaygupta1266/humaneval dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. Data Set: Promoting Open Science in Test-Driven Software Experiments

    • zenodo.org
    bin
    Updated Feb 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcus Kessel; Colin Atkinson; Marcus Kessel; Colin Atkinson (2024). Data Set: Promoting Open Science in Test-Driven Software Experiments [Dataset]. http://doi.org/10.5281/zenodo.8208246
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marcus Kessel; Colin Atkinson; Marcus Kessel; Colin Atkinson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data set for reproduction purposes (tabular data is stored using Apache's Parqet format).

  17. h

    quantized-llama-3.1-humaneval-evals

    • huggingface.co
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neural Magic (2024). quantized-llama-3.1-humaneval-evals [Dataset]. https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 10, 2024
    Dataset authored and provided by
    Neural Magic
    Description

    Coding Benchmark Results

    The coding benchmark results were obtained with the EvalPlus library.

    HumanEvalpass@1 HumanEval+pass@1

    meta-llama_Meta-Llama-3.1-405B-Instruct 67.3 67.5

    neuralmagic_Meta-Llama-3.1-405B-Instruct-W8A8-FP8 66.7 66.6

    neuralmagic_Meta-Llama-3.1-405B-Instruct-W4A16 66.5 66.4

    neuralmagic_Meta-Llama-3.1-405B-Instruct-W8A8-INT8 64.3 64.8

    neuralmagic_Meta-Llama-3.1-70B-Instruct-W8A8-FP8 58.1 57.7

    neuralmagic_Meta-Llama-3.1-70B-Instruct-W4A16 57.1… See the full description on the dataset page: https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals.

  18. h

    HumanEval-Multimodal

    • huggingface.co
    Updated May 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Waheed (2025). HumanEval-Multimodal [Dataset]. https://huggingface.co/datasets/macabdul9/HumanEval-Multimodal
    Explore at:
    Dataset updated
    May 29, 2025
    Authors
    Abdul Waheed
    Description

    macabdul9/HumanEval-Multimodal dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    HumanEval-Mojo

    • huggingface.co
    Updated Oct 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nishat Raihan (2024). HumanEval-Mojo [Dataset]. https://huggingface.co/datasets/md-nishat-008/HumanEval-Mojo
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 27, 2024
    Authors
    Nishat Raihan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🔥 Mojo-Coder 🔥 State-of-the-art Language Model for Mojo Programming

    🎯 Background and Motivation

    Mojo programming language, developed by Modular, has emerged as a game-changing technology in high-performance computing and AI development. Despite its growing popularity and impressive capabilities (up to 68,000x faster than Python!), existing LLMs struggle with Mojo code generation. Mojo-Coder addresses this gap by providing specialized support for Mojo programming, built upon… See the full description on the dataset page: https://huggingface.co/datasets/md-nishat-008/HumanEval-Mojo.

  20. NAACL'2025 Artifact

    • figshare.com
    zip
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuzhe Ou (2024). NAACL'2025 Artifact [Dataset]. http://doi.org/10.6084/m9.figshare.27241386.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Yuzhe Ou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large Language Models (LLMs) have shown impressive capabilities in generating code, yet they often produce hallucinations—unfounded or incorrect outputs—that compromise the functionality of the generated code. This study investigates the application of local uncertainty quantification methods to detect hallucinations at the line level in code generated by LLMs. We focus on evaluating these methods in the context of two prominent code generation tasks, HumanEval and MBPP. We experiment with both open-source and black-box models. For each model, we generate code, calculate line-level uncertainty scores using various uncertainty quantification methods, and assess the correlation of these scores with the presence of hallucinations as identified by test case failures. Our empirical results are evaluated using metrics such as AUROC and AUPR to determine the effectiveness of these methods in detecting hallucinations, providing insights into their reliability and practical utility in enhancing the accuracy of code generation by LLMs.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval

openai_humaneval

OpenAI HumanEval

openai/openai_humaneval

Explore at:
19 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for OpenAI HumanEval

  Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

  Supported Tasks and Leaderboards





  Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

Search
Clear search
Close search
Google apps
Main menu