BigCodeBench is an easy-to-use benchmark for code generation with practical and challenging programming tasks¹. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting¹. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls¹.
Here are some key features of BigCodeBench: - Precise evaluation & ranking: It provides a leaderboard for latest LLM rankings before & after rigorous evaluation¹. - Pre-generated samples: BigCodeBench accelerates code intelligence research by open-sourcing LLM-generated samples for various models¹. - Execution Environment: The execution environment in BigCodeBench is less bounded than EvalPlus to support tasks with diverse library dependencies¹. - Test Evaluation: BigCodeBench relies on unittest for evaluating the generated code¹.
(1) GitHub - bigcode-project/bigcodebench: BigCodeBench: The Next .... https://github.com/bigcode-project/bigcodebench/.
bigcode/bigcodebench-hard dataset hosted on Hugging Face and contributed by the HF Datasets community
bigcode/bigcodebench-hard-results dataset hosted on Hugging Face and contributed by the HF Datasets community
bigcode/bigcodebench-perf dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "bigcodebench-hard-perf"
More Information needed
bigcode/bigcodebench-hard-solve-rate dataset hosted on Hugging Face and contributed by the HF Datasets community
Evaluation dataset for umanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task (arxiv.org/abs/2412.21199).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
BigCodeBench is an easy-to-use benchmark for code generation with practical and challenging programming tasks¹. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting¹. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls¹.
Here are some key features of BigCodeBench: - Precise evaluation & ranking: It provides a leaderboard for latest LLM rankings before & after rigorous evaluation¹. - Pre-generated samples: BigCodeBench accelerates code intelligence research by open-sourcing LLM-generated samples for various models¹. - Execution Environment: The execution environment in BigCodeBench is less bounded than EvalPlus to support tasks with diverse library dependencies¹. - Test Evaluation: BigCodeBench relies on unittest for evaluating the generated code¹.
(1) GitHub - bigcode-project/bigcodebench: BigCodeBench: The Next .... https://github.com/bigcode-project/bigcodebench/.