Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
iidai/codenet dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Dataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]
Dataset Sources [optional]
Repository: [More⦠See the full description on the dataset page: https://huggingface.co/datasets/systemk/codenet.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
π Dataset Card for π CodeNet-16K
Dataset Summary
The π CodeNet-16K dataset consists of 16,500 Python attempts from the CodeNet dataset, which have been carefully filtered and deduplicated to create a high-quality dataset for code generation tasks. The dataset includes problem descriptions, input/output descriptions, and sample test cases for each problem.
Dataset Details
Dataset Sources
Repository:β¦ See the full description on the dataset page: https://huggingface.co/datasets/sumukshashidhar-archive/CodeNet-16K.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Source code related tasks for machine learning have become important with the large need of software production. In this dataset our main goal is to create a dataset for bug detection and repair.
The dataset is based on the CodeNet project and contains python code submissions for online coding competitions. The data is obtained by selecting consecutive attempts of a single user that resulted in fixing a buggy submission. Thus the data is represented by code pairs and annotated by the diff and error of each changed instruction. We have already tokenized all the source code files and kept the same format as in the original dataset.
CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Our goal is to create a bug detection and repair pipeline for online coding competition problems.
Facebook
TwitterThis dataset was created by tiffie_1
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artifact repository for the paper Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code, accepted at ICSE 2024, Lisbon, Portugal. Authors are Rangeet Pan* Ali Reza Ibrahimzada*, Rahul Krishna, Divya Sankar, Lambert Pougeum Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand.
Install
This repository contains the source code for reproducing the results in our paper. Please start by cloning this repository:
git clone https://github.com/Intelligent-CAT-Lab/PLTranslationEmpirical
We recommend using a virtual environment for running the scripts. Please download conda 23.11.0 from this link. You can create a virtual environment using the following command:
conda create -n plempirical python=3.10.13
After creating the virtual environment, you can activate it using the following command:
conda activate plempirical
You can run the following command to make sure that you are using the correct version of Python:
python3 --version && pip3 --version
Dependencies
To install all software dependencies, please execute the following command:
pip3 install -r requirements.txt
As for hardware dependencies, we used 16 NVIDIA A100 GPUs with 80GBs of memory for inferencing models. The models can be inferenced on any combination of GPUs as long as the reader can properly distribute the model weights across the GPUs. We did not perform weight distribution since we had enough memory (80 GB) per GPU.
Moreover, for compiling and testing the generated translations, we used Python 3.10, g++ 11, GCC Clang 14.0, Java 11, Go 1.20, Rust 1.73, and .Net 7.0.14 for Python, C++, C, Java, Go, Rust, and C#, respectively. Overall, we recommend using a machine with Linux OS and at least 32GB of RAM for running the scripts.
For running scripts of alternative approaches, you need to make sure you have installed C2Rust, CxGO, and Java2C# on your machine. Please refer to their repositories for installation instructions. For Java2C#, you need to create a .csproj file like below:
Exe
net7.0
enable
enable
Dataset
We uploaded the dataset we used in our empirical study to Zenodo. The dataset is organized as follows:
CodeNet
AVATAR
Evalplus
Apache Commons-CLI
Click
Please download and unzip the dataset.zip file from Zenodo. After unzipping, you should see the following directory structure:
PLTranslationEmpirical βββ dataset βββ codenet βββ avatar βββ evalplus βββ real-life-cli βββ ...
The structure of each dataset is as follows:
CodeNet & Avatar: Each directory in these datasets correspond to a source language where each include two directories Code and TestCases for code snippets and test cases, respectively. Each code snippet has an id in the filename, where the id is used as a prefix for test I/O files.
Evalplus: The source language code snippets follow a similar structure as CodeNet and Avatar. However, as a one time effort, we manually created the test cases in the target Java language inside a maven project, evalplus_java. To evaluate the translations from an LLM, we recommend moving the generated Java code snippets to the src/main/java directory of the maven project and then running the command mvn clean test surefire-report:report -Dmaven.test.failure.ignore=true to compile, test, and generate reports for the translations.
Real-life Projects: The real-life-cli directory represents two real-life CLI projects from Java and Python. These datasets only contain code snippets as files and no test cases. As mentioned in the paper, the authors manually evaluated the translations for these datasets.
Scripts
We provide bash scripts for reproducing our results in this work. First, we discuss the translation script. For doing translation with a model and dataset, first you need to create a .env file in the repository and add the following:
OPENAI_API_KEY= LLAMA2_AUTH_TOKEN= STARCODER_AUTH_TOKEN=
bash scripts/translate.sh GPT-4 codenet Python Java 50 0.95 0.7 0
PLTranslationEmpirical βββ dataset βββ codenet βββ avatar βββ evalplus βββ real-life-cli βββ CodeGeeX βββ codegeex βββ codegeex_13b.pt # this file is the model weight βββ ... βββ ...
You can run the following command to translate all Python -> Java code snippets in codenet dataset with the CodeGeeX while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.2 on GPU gpu_id=0:
bash scripts/translate.sh CodeGeeX codenet Python Java 50 0.95 0.2 0
bash scripts/translate.sh StarCoder codenet Python Java 50 0.95 0.2 0
bash scripts/translate_transpiler.sh codenet C Rust c2rust fix_report bash scripts/translate_transpiler.sh codenet C Go cxgo fix_reports bash scripts/translate_transpiler.sh codenet Java C# java2c# fix_reports bash scripts/translate_transpiler.sh avatar Java C# java2c# fix_reports
bash scripts/test_avatar.sh Python Java GPT-4 fix_reports 1 bash scripts/test_codenet.sh Python Java GPT-4 fix_reports 1 bash scripts/test_evalplus.sh Python Java GPT-4 fix_reports 1
bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 compile bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 runtime bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 incorrect
bash scripts/clean_generations.sh StarCoder codenet
Please note that for the above commands, you can change the dataset and model name to execute the same thing for other datasets and models. Moreover, you can refer to /prompts for different vanilla and repair prompts used in our study.
Artifacts
Please download the artifacts.zip file from our Zenodo repository. We have organized the artifacts as follows:
RQ1 - Translations: This directory contains the translations from all LLMs and for all datasets. We have added an excel file to show a detailed breakdown of the translation results.
RQ2 - Manual Labeling: This directory contains an excel file which includes the manual labeling results for all translation bugs.
RQ3 - Alternative Approaches: This directory contains the translations from all alternative approaches (i.e., C2Rust, CxGO, Java2C#). We have added an excel file to show a detailed breakdown of the translation results.
RQ4 - Mitigating Translation Bugs: This directory contains the fix results of GPT-4, StarCoder, CodeGen, and Llama 2. We have added an excel file to show a detailed breakdown of the fix results.
Contact
We look forward to hearing your feedback. Please contact Rangeet Pan or Ali Reza Ibrahimzada for any questions or comments π.
Facebook
Twitterlhkhiem28/CodeNet dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CodeNet Compiler Errors (Re-compiled 2026)
Dataset Summary
This dataset contains source code submissions from Project CodeNet that fail to compile. Unlike the original dataset metadata (which reflects compiler versions from 2011β2020), this dataset was re-executed in a modern Debian environment (2026) to generate up-to-date compiler error messages. It is designed for research in:
Automated Program Repair (APR): Fixing compile-time errors. Compiler Error Explanation:β¦ See the full description on the dataset page: https://huggingface.co/datasets/criyle/codenet-compile-errors.
Facebook
TwitterFounded in 2019, CodeNet BizTech operates in the Edtech sector offering advanced digital marketing training and a range of development services including web, software, and app development. The company also provides digital marketing agency services, Android development, graphics design, content writing, and logo design. Its diverse offerings cater to the growing demand for digital skills and development solutions in the current market. CodeNet BizTech aims to equip individuals and businesses with essential tools and knowledge to succeed in the digital landscape.
Facebook
Twitterdidula-wso2/diverse-codenet dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
sumukshashidhar-archive/CodeNet-24K dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://www.reportaziende.it/termini_e_condizioni_d_uso_del_serviziohttps://www.reportaziende.it/termini_e_condizioni_d_uso_del_servizio
Serie storica del fatturato e indicatori finanziari analizzati tramite intelligenza artificiale.
Facebook
Twitterhttps://www.reportaziende.it/termini_e_condizioni_d_uso_del_serviziohttps://www.reportaziende.it/termini_e_condizioni_d_uso_del_servizio
Serie storica del fatturato e indicatori finanziari analizzati tramite intelligenza artificiale.
Facebook
TwitterShijiaD/codenet-python-AST-NIT dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Introduction
This is the dataset Project_CodeNet_Python800 and Project_CodeNet_Java250 from Project CodeNet (arxiv). We are not the authors of Project CodeNet, but we are the authors of Heterogeneous Directed Hypergraph Neural Network (HDHGN) in paper Heterogeneous Directed Hypergraph Neural Network over abstract syntax tree (AST) for Code Classification (official, arxiv). Our HDHGN model utilizes the Python800 and Java250 datasets. The original official dataset links Python800 and⦠See the full description on the dataset page: https://huggingface.co/datasets/qiankunmu/Project_CodeNet_Python800_and_Java250.
Facebook
TwitterThis is dataset is extracted from CodeNet, python only. I merged the data into one single table, including metadata, problem description, test input output.
small: accepted status only big: all status, including accepted
Facebook
Twitterdeepcopy/ds4sd-synth-code-net-small dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for CodeContests
Dataset Summary
CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode. It consists of programming problems, from a variety of sources:
Site URL Source
Aizu https://judge.u-aizu.ac.jp CodeNet
AtCoder https://atcoder.jp CodeNet
CodeChef https://www.codechef.com description2code
Codeforces https://codeforces.com description2code and Codeforces
HackerEarth⦠See the full description on the dataset page: https://huggingface.co/datasets/Imandra/code_contests.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
iidai/codenet dataset hosted on Hugging Face and contributed by the HF Datasets community