MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for OpenAI HumanEval
Dataset Summary
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
Supported Tasks and Leaderboards
Languages
The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code". It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.
Instruct HumanEval
Summary
InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. Here is an example of use The prompt can be built as follows… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/instructhumaneval.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Evaluation dataset for umanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task (arxiv.org/abs/2412.21199).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
Claude 2, developed by rising startup star Anthropic, is the most capable large language model generative AI on the current market. It reached a success ratio of ** percent with the HumanEval benchmark. This is particularly noteworthy as it is a 0-shot evaluation, meaning all AI programs benchmarked against it had not had previous data of this sort nor previous training with the tasks. This means that Claude 2 was the quickest at absorbing and understanding the task given to it.
HeyixInn0/Reorganized-humaneval dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks
📄 Paper •
🏠 Home Page •
💻 GitHub Repository •
🏆 Leaderboard •
🤗 Dataset Viewer
HumanEval-V is a novel benchmark designed to evaluate the diagram understanding and reasoning capabilities of Large Multimodal Models (LMMs) in programming contexts. Unlike existing benchmarks, HumanEval-V focuses on coding tasks that require sophisticated visual reasoning over… See the full description on the dataset page: https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark.
Dataset Card for "humaneval-mbpp-testgen-qa"
This dataset contains prompt-reply (question-answer) pairs where the prompt is to create a Python unit tests which tests for the functionality described in a specific docstring. The responses are then the generated unit tests.
pierreqi/HumanEval-r dataset hosted on Hugging Face and contributed by the HF Datasets community
braindao/humaneval-for-solidity-25 dataset hosted on Hugging Face and contributed by the HF Datasets community
smoorsmith/humaneval dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data set for reproduction purposes (tabular data is stored using Apache's Parqet format).
Coding Benchmark Results
The coding benchmark results were obtained with the EvalPlus library.
HumanEvalpass@1 HumanEval+pass@1
meta-llama_Meta-Llama-3.1-405B-Instruct 67.3 67.5
neuralmagic_Meta-Llama-3.1-405B-Instruct-W8A8-FP8 66.7 66.6
neuralmagic_Meta-Llama-3.1-405B-Instruct-W4A16 66.5 66.4
neuralmagic_Meta-Llama-3.1-405B-Instruct-W8A8-INT8 64.3 64.8
neuralmagic_Meta-Llama-3.1-70B-Instruct-W8A8-FP8 58.1 57.7
neuralmagic_Meta-Llama-3.1-70B-Instruct-W4A16 57.1… See the full description on the dataset page: https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals.
macabdul9/HumanEval-Multimodal dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🔥 Mojo-Coder 🔥 State-of-the-art Language Model for Mojo Programming
🎯 Background and Motivation
Mojo programming language, developed by Modular, has emerged as a game-changing technology in high-performance computing and AI development. Despite its growing popularity and impressive capabilities (up to 68,000x faster than Python!), existing LLMs struggle with Mojo code generation. Mojo-Coder addresses this gap by providing specialized support for Mojo programming, built upon… See the full description on the dataset page: https://huggingface.co/datasets/md-nishat-008/HumanEval-Mojo.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large Language Models (LLMs) have shown impressive capabilities in generating code, yet they often produce hallucinations—unfounded or incorrect outputs—that compromise the functionality of the generated code. This study investigates the application of local uncertainty quantification methods to detect hallucinations at the line level in code generated by LLMs. We focus on evaluating these methods in the context of two prominent code generation tasks, HumanEval and MBPP. We experiment with both open-source and black-box models. For each model, we generate code, calculate line-level uncertainty scores using various uncertainty quantification methods, and assess the correlation of these scores with the presence of hallucinations as identified by test case failures. Our empirical results are evaluated using metrics such as AUROC and AUPR to determine the effectiveness of these methods in detecting hallucinations, providing insights into their reliability and practical utility in enhancing the accuracy of code generation by LLMs.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for OpenAI HumanEval
Dataset Summary
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
Supported Tasks and Leaderboards
Languages
The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.