MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for OpenAI HumanEval
Dataset Summary
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
Supported Tasks and Leaderboards
Languages
The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
The OpenAI HumanEval dataset is a handcrafted set of 164 programming problems designed to challenge code generation models. The problems include a function signature, docstring, body, and several unit tests, all handwritten to ensure they're not included in the training set of code generation models. The entry point for each problem is the prompt, making it an ideal dataset for testing natural language processing and machine learning models' ability to generate Python programs from scratch
To use this dataset, simply download the zip file and extract it. The resulting directory will contain the following files:
canonical_solution.py: The solution to the problem. (String) entry_point.py: The entry point for the problem. (String) prompt.txt: The prompt for the problem. (String) test.py: The unit tests for the problem
- The dataset could be used to develop a model that generates programs from natural language.
- The dataset could be used to develop a model that completes or debugs programs.
- The dataset could be used to develop a model that writes unit tests for programs
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: test.csv | Column name | Description | |:-----------------------|:--------------------------------------------------------------------------------------------------| | prompt | A natural language description of the programming problem. (String) | | canonical_solution | The correct Python code solution to the problem. (String) | | test | A set of unit tests that the generated code must pass in order to be considered correct. (String) | | entry_point | The starting point for the generated code. (String) |
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for HumanEvalPack
Dataset Summary
HumanEvalPack is an extension of OpenAI's HumanEval to cover 6 total languages across 3 tasks. The Python split is exactly the same as OpenAI's Python HumanEval. The other splits are translated by humans (similar to HumanEval-X but with additional cleaning, see here). Refer to the OctoPack paper for more details.
Languages: Python, JavaScript, Java, Go, C++, Rust OctoPack🐙🎒:
Data CommitPack 4TB of GitHub commits… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/humanevalpack.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Gen-Verse/HumanEval dataset hosted on Hugging Face and contributed by the HF Datasets community
Instruct HumanEval
Summary
InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. Here is an example of use The prompt can be built as follows… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/instructhumaneval.
HeyixInn0/Reorganized-humaneval dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Evaluation dataset for umanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task (arxiv.org/abs/2412.21199).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Inoichan
Released under MIT
eitanturok/humaneval-fix-starcoder dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BabelCode-HumaneEval (BC-HumanEval) dataset converts the HumanEval dataset released by OpenAI to 16 programming languages.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks
📄 Paper •
🏠 Home Page •
💻 GitHub Repository •
🏆 Leaderboard •
🤗 Dataset Viewer
HumanEval-V is a novel benchmark designed to evaluate the diagram understanding and reasoning capabilities of Large Multimodal Models (LMMs) in programming contexts. Unlike existing benchmarks, HumanEval-V focuses on coding tasks that require sophisticated visual reasoning over… See the full description on the dataset page: https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the replication package for the paper "The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models" by Fernando Vallecillos Ruiz, Max Hort, and Leon Moonen, accepted for the research track of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE 2025). A preprint of the paper is included.
The source code is distributed under the MIT license, and except for 3rd party datasets that come with their own license, all documentation, data, models and results in this repository are distributed under the CC BY 4.0 license.
This repository contains the necessary scripts, data, and resources to replicate the experiments presented in our conference paper. The structure of this repository has been organized to facilitate ease of use for researchers interested in reproducing our results, conducting similar analyses, or building upon our work.
Folder | Description |
---|---|
analysis | Contains Jupyter notebook scripts used to generate tables and visual analyses. These scripts assist in visualizing results, comparing metrics, and summarizing data from the experiments. The outputs can be easily exported for further use. |
apr_training | Contains the dataset used for the Automated Program Repair (APR) training phase. This data is utilized by the scripts in train_src/ for fine-tuning the models. |
benchmarks | Includes JSON files representing different benchmarks, specifically HumanEval-Java and Defects4J. In this work, we have primarily focused on and revised HumanEval-Java. |
inference_and_validation_src | Contains Python scripts used to generate patches and validate them across different benchmarks. These scripts play a critical role in producing and assessing model outputs. |
inference_scripts | Bash scripts used to automate the process of submitting inference and validation jobs to the compute cluster. This facilitates multiple iterations of inference and validation in a streamlined manner. |
models* | Stores the fine-tuned machine learning models used in the experiments. These models are the output of the fine-tuning process and are referenced by the inference scripts. |
results | Contains all the outputs from the models in JSON format, generated during the inference process. These files represent the raw experimental results. |
train_src | Python scripts for model fine-tuning. These scripts include methods for performing both full model training and LoRA fine-tuning for parameter-efficient updates. |
validation_benchmark_dataset | Contains the benchmark datasets used during validation. |
* Note that all contents except for the model files from the models/
folder are included in the compressed zip file in this Zenodo repository. The model files are uploaded separately to the repository to facilitate individual downloads, as several of them are relatively large (9.5-11.2GB).
analysis/
)This folder contains Jupyter notebook scripts used to generate tables and visual analyses of the experimental data. These scripts are designed to assist in visualizing results, comparing performance metrics, and summarizing experimental outcomes. Researchers can easily export the generated tables to spreadsheets for further processing or visualization. The outputs help in validating the experiment's consistency and provide insights into the performance of various model configurations.
inference_and_validation_src/
)The Python scripts in this folder are used for generating patches and validating them against predefined benchmarks. We utilize the "Fire" library to parse parameters and execute the relevant methods efficiently. This folder contains:
Key components include:
train_src/
)This folder contains the scripts used for model fine-tuning:
full_finetune.py
: This script performs full fine-tuning of a model on a given training dataset. It updates all trainable parameters to achieve optimal model performance on the target task.
lora_finetune.py
: This script implements LoRA (Low-Rank Adaptation) fine-tuning. LoRA is a parameter-efficient fine-tuning approach where only a smaller subset of model parameters are updated, making it effective for resource-constrained tasks.
inference_scripts/
)These Bash scripts are designed to automate the inference process by submitting multiple iterations of inference and validation jobs to the compute cluster. The scripts create job dependencies, ensuring that all necessary tasks are completed in a logical sequence.
The available inference scripts include:
model_inferencing_adjustable_FULL_d4j_big.sh
: Executes inference for specified model configurations with multiple iterations and outputs per iteration.model_inferencing_adjustable_FULL_d4j_lora_big.sh
: Similar to the previous script, but optimized for LoRA-based models.These scripts accept three parameters:
models/
folder.We hope this package serves as a useful resource for reproducing and expanding upon our research results. Please cite this work by referring to the published paper:
Fernando Vallecillos Ruiz, Max Hort, and Leon Moonen, 2025. The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models. In proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE 2025), ACM, 12 pages.
@inproceedings{ruiz2025:art,
title = {{The Art of Repair: Optimizing Iterative Program Repair with
Instruction-Tuned Models}},
author = {Ruiz, Fernando Vallecillos and Hort, Max and Moonen, Leon},
booktitle = {{Proceedings of the 29th International Conference on Evaluation
and Assessment in Software Engineering (EASE)}},
year = {2025},
pages = {12},
publisher = {{ACM}},
language = {en}
}
The replication package is archived on Zenodo with DOI: 10.5281/zenodo.15294695.
braindao/humaneval-for-solidity-25 dataset hosted on Hugging Face and contributed by the HF Datasets community
Here you can find the solutions generated by of the Code Llama models to the HumanEval and multiPL-E benchmarks used in the Big Code models Leaderboard: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard.
meng-lab/AdaDecode-CodeLlama-13B-Instruct-HumanEval dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "humaneval-mbpp-codegen-qa"
This dataset contains prompt-reply (question-answer) pairs where the prompt is to create a Python function which satisfies the functionality described in a specified docstring. The responses are then the generated functions.
Leon-Leee/humaneval dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for OpenAI HumanEval
Dataset Summary
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
Supported Tasks and Leaderboards
Languages
The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.