Dataset Card for "livebench/coding"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored… See the full description on the dataset page: https://huggingface.co/datasets/livebench/coding.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
🏠 Home Page •
💻 GitHub Repository •
🏆 Leaderboard •
LiveCodeBench is a "live" updating benchmark for holistically evaluating code related capabilities of LLMs. Particularly, it evaluates LLMs across a range of capabilties including code generation, self-repair, test output prediction, and code execution. This is the code generation scenario of LiveCodeBench. It is also… See the full description on the dataset page: https://huggingface.co/datasets/livecodebench/test_generation.
minimario/livecodebench-execute dataset hosted on Hugging Face and contributed by the HF Datasets community
PrimeIntellect/LiveCodeBench-v5 dataset hosted on Hugging Face and contributed by the HF Datasets community
xin1997/livecodebench-code-generation_all_only_input dataset hosted on Hugging Face and contributed by the HF Datasets community
test-gen/livecodebench dataset hosted on Hugging Face and contributed by the HF Datasets community
QAQAQAQAQ/LiveCodeBench-Pro dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LiveCodeBench-CPP: An Extension of LiveCodeBench for Contamination Free Evaluation in C++
Overview
LiveCodeBench-CPP includes 279 problems from the release_v5 of LiveCodeBench, covering the period from October 2024 to January 2025. These problems are sourced from AtCoder (175 problems) and LeetCode (104 problems).
AtCoder Problems: These require generated solutions to read inputs from standard input (stdin) and write outputs to standard output (stdout). For unit testing… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/LiveCodeBench-CPP.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
We use Stdio input/output format here. For example, for the task to calculate the sum of a list, the input and output are in the following format: input = "5 1 2 3 4 5 " output = "15"
CodeContests and CodeForces are using this format, however, MBPP and part of LiveCodeBench are using functional input/output format, such like assert sum_function([1, 2, 3, 4, 5]) == 15
In this project, we have converted the the functional format to the Stdio format to achieve consistency. Paper | Code… See the full description on the dataset page: https://huggingface.co/datasets/Gen-Verse/LiveCodeBench.
Groq/LiveCodeBench-CodeGeneration dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Rano23/Livecodebench-subset-50 dataset hosted on Hugging Face and contributed by the HF Datasets community
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Version 2.0 Original dataset: https://huggingface.co/datasets/livecodebench/code_generation_lite Translated to Thai by iApp Technology.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthia S1 27B LiveCodeBench Outputs
Done generating outputs. Evaluating now...
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Visualization of Code Generation Task Cases Samples
Check dataset samples visualization by viewing Dataset Viewer. The sampling procedure is guided by the Elo distribution introduced in our method. Original dataset is release_v5 of livecodebench/code_generation_lite from hugging face. samples/origin: 879/880
License
This repository is licensed under the Apache License 2.0
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data
Our training dataset consists of 24K problems paired with their test cases:
7.5K TACO Verified problems. 16K verified coding problems from PrimeIntellect’s SYNTHETIC-1. 600 LiveCodeBench (v5) problems submitted between May 1, 2023 and July 31, 2024.
Our test dataset consists of:
LiveCodeBench (v5) problems between August 1, 2024 and February 1, 2025. Codeforces problems from Qwen/CodeElo.
Format
Each row in the dataset contains:
problem: The coding problem… See the full description on the dataset page: https://huggingface.co/datasets/agentica-org/DeepCoder-Preview-Dataset.
mlfoundations-dev/OpenThinker-7B_eval_03-11-25_18-35-31_0981
Precomputed model outputs for evaluation.
Evaluation Results
Summary
Metric LiveCodeBench AIME24 AIME25 AMC23 GPQADiamond MATH500
Accuracy 38.9 32.0 24.0 71.0 29.8 83.0
LiveCodeBench
Average Accuracy: 38.94% ± 0.69% Number of Runs: 3
Run Accuracy Questions Solved Total Questions
1 38.36% 196 511
2 38.16% 195 511
3 40.31% 206 511
AIME24… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations-dev/OpenThinker-7B_eval_03-11-25_18-35-31_0981.
mlfoundations-dev/hero_run_2_fix_conversations_eval_03-18-25_01-58-28_0981
Precomputed model outputs for evaluation.
Evaluation Results
Summary
Metric LiveCodeBench AIME24 AIME25 AMC23 GPQADiamond MATH500
Accuracy 55.6 50.0 33.3 89.5 49.3 88.4
LiveCodeBench
Average Accuracy: 55.58% ± 0.79% Number of Runs: 3
Run Accuracy Questions Solved Total Questions
1 54.99% 281 511
2 57.14% 292 511
3 54.60% 279 511… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations-dev/hero_run_2_fix_conversations_eval_03-18-25_01-58-28_0981.
mlfoundations-dev/a1_science_camel_biology_1744691454_eval_1331
Precomputed model outputs for evaluation.
Evaluation Results
Summary
Metric AIME24 AMC23 MATH500 GPQADiamond JEEBench MMLUPro LiveCodeBench CodeElo
Accuracy 11.7 51.2 70.2 27.8 31.3 27.6 0.1 2.4
AIME24
Average Accuracy: 11.67% ± 1.51% Number of Runs: 10
Run Accuracy Questions Solved Total Questions
1 16.67% 5 30
2 3.33% 1 30
3 13.33% 4 30
4 6.67% 2 30… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations-dev/a1_science_camel_biology_1744691454_eval_1331.
mlfoundations-dev/herorun_1_1_3epoch_eval_03-18-25_00-30-48_0981
Precomputed model outputs for evaluation.
Evaluation Results
Summary
Metric LiveCodeBench AIME24 AIME25 AMC23 GPQADiamond MATH500
Accuracy 43.1 46.0 28.0 80.5 44.4 87.0
LiveCodeBench
Average Accuracy: 43.12% ± 1.03% Number of Runs: 3
Run Accuracy Questions Solved Total Questions
1 45.01% 230 511
2 41.49% 212 511
3 42.86% 219 511
AIME24… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations-dev/herorun_1_1_3epoch_eval_03-18-25_00-30-48_0981.
mlfoundations-dev/Light-R1-32B_1743569788_eval_0981
Precomputed model outputs for evaluation.
Evaluation Results
Summary
Metric AIME24 AIME25 AMC23 MATH500 GPQADiamond LiveCodeBench
Accuracy 75.3 55.3 95.5 90.2 22.6 55.7
AIME24
Average Accuracy: 75.33% ± 2.42% Number of Runs: 5
Run Accuracy Questions Solved Total Questions
1 83.33% 25 30
2 76.67% 23 30
3 73.33% 22 30
4 66.67% 20 30
5 76.67% 23 30… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations-dev/Light-R1-32B_1743569788_eval_0981.
Dataset Card for "livebench/coding"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored… See the full description on the dataset page: https://huggingface.co/datasets/livebench/coding.