Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
iidai/codenet dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Dataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]
Dataset Sources [optional]
Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/systemk/codenet.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📊 Dataset Card for 🏆 CodeNet-16K
Dataset Summary
The 🏆 CodeNet-16K dataset consists of 16,500 Python attempts from the CodeNet dataset, which have been carefully filtered and deduplicated to create a high-quality dataset for code generation tasks. The dataset includes problem descriptions, input/output descriptions, and sample test cases for each problem.
Dataset Details
Dataset Sources
Repository:… See the full description on the dataset page: https://huggingface.co/datasets/sumuks/CodeNet-16K.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Source code related tasks for machine learning have become important with the large need of software production. In this dataset our main goal is to create a dataset for bug detection and repair.
The dataset is based on the CodeNet project and contains python code submissions for online coding competitions. The data is obtained by selecting consecutive attempts of a single user that resulted in fixing a buggy submission. Thus the data is represented by code pairs and annotated by the diff and error of each changed instruction. We have already tokenized all the source code files and kept the same format as in the original dataset.
CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Our goal is to create a bug detection and repair pipeline for online coding competition problems.
Facebook
TwitterThis dataset was created by Diana Vostrova
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artifact repository for the paper Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code, accepted at ICSE 2024, Lisbon, Portugal. Authors are Rangeet Pan* Ali Reza Ibrahimzada*, Rahul Krishna, Divya Sankar, Lambert Pougeum Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand.
Install
This repository contains the source code for reproducing the results in our paper. Please start by cloning this repository:
git clone https://github.com/Intelligent-CAT-Lab/PLTranslationEmpirical
We recommend using a virtual environment for running the scripts. Please download conda 23.11.0 from this link. You can create a virtual environment using the following command:
conda create -n plempirical python=3.10.13
After creating the virtual environment, you can activate it using the following command:
conda activate plempirical
You can run the following command to make sure that you are using the correct version of Python:
python3 --version && pip3 --version
Dependencies
To install all software dependencies, please execute the following command:
pip3 install -r requirements.txt
As for hardware dependencies, we used 16 NVIDIA A100 GPUs with 80GBs of memory for inferencing models. The models can be inferenced on any combination of GPUs as long as the reader can properly distribute the model weights across the GPUs. We did not perform weight distribution since we had enough memory (80 GB) per GPU.
Moreover, for compiling and testing the generated translations, we used Python 3.10, g++ 11, GCC Clang 14.0, Java 11, Go 1.20, Rust 1.73, and .Net 7.0.14 for Python, C++, C, Java, Go, Rust, and C#, respectively. Overall, we recommend using a machine with Linux OS and at least 32GB of RAM for running the scripts.
For running scripts of alternative approaches, you need to make sure you have installed C2Rust, CxGO, and Java2C# on your machine. Please refer to their repositories for installation instructions. For Java2C#, you need to create a .csproj file like below:
Exe
net7.0
enable
enable
Dataset
We uploaded the dataset we used in our empirical study to Zenodo. The dataset is organized as follows:
CodeNet
AVATAR
Evalplus
Apache Commons-CLI
Click
Please download and unzip the dataset.zip file from Zenodo. After unzipping, you should see the following directory structure:
PLTranslationEmpirical ├── dataset ├── codenet ├── avatar ├── evalplus ├── real-life-cli ├── ...
The structure of each dataset is as follows:
CodeNet & Avatar: Each directory in these datasets correspond to a source language where each include two directories Code and TestCases for code snippets and test cases, respectively. Each code snippet has an id in the filename, where the id is used as a prefix for test I/O files.
Evalplus: The source language code snippets follow a similar structure as CodeNet and Avatar. However, as a one time effort, we manually created the test cases in the target Java language inside a maven project, evalplus_java. To evaluate the translations from an LLM, we recommend moving the generated Java code snippets to the src/main/java directory of the maven project and then running the command mvn clean test surefire-report:report -Dmaven.test.failure.ignore=true to compile, test, and generate reports for the translations.
Real-life Projects: The real-life-cli directory represents two real-life CLI projects from Java and Python. These datasets only contain code snippets as files and no test cases. As mentioned in the paper, the authors manually evaluated the translations for these datasets.
Scripts
We provide bash scripts for reproducing our results in this work. First, we discuss the translation script. For doing translation with a model and dataset, first you need to create a .env file in the repository and add the following:
OPENAI_API_KEY= LLAMA2_AUTH_TOKEN= STARCODER_AUTH_TOKEN=
bash scripts/translate.sh GPT-4 codenet Python Java 50 0.95 0.7 0
PLTranslationEmpirical ├── dataset ├── codenet ├── avatar ├── evalplus ├── real-life-cli ├── CodeGeeX ├── codegeex ├── codegeex_13b.pt # this file is the model weight ├── ... ├── ...
You can run the following command to translate all Python -> Java code snippets in codenet dataset with the CodeGeeX while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.2 on GPU gpu_id=0:
bash scripts/translate.sh CodeGeeX codenet Python Java 50 0.95 0.2 0
bash scripts/translate.sh StarCoder codenet Python Java 50 0.95 0.2 0
bash scripts/translate_transpiler.sh codenet C Rust c2rust fix_report bash scripts/translate_transpiler.sh codenet C Go cxgo fix_reports bash scripts/translate_transpiler.sh codenet Java C# java2c# fix_reports bash scripts/translate_transpiler.sh avatar Java C# java2c# fix_reports
bash scripts/test_avatar.sh Python Java GPT-4 fix_reports 1 bash scripts/test_codenet.sh Python Java GPT-4 fix_reports 1 bash scripts/test_evalplus.sh Python Java GPT-4 fix_reports 1
bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 compile bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 runtime bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 incorrect
bash scripts/clean_generations.sh StarCoder codenet
Please note that for the above commands, you can change the dataset and model name to execute the same thing for other datasets and models. Moreover, you can refer to /prompts for different vanilla and repair prompts used in our study.
Artifacts
Please download the artifacts.zip file from our Zenodo repository. We have organized the artifacts as follows:
RQ1 - Translations: This directory contains the translations from all LLMs and for all datasets. We have added an excel file to show a detailed breakdown of the translation results.
RQ2 - Manual Labeling: This directory contains an excel file which includes the manual labeling results for all translation bugs.
RQ3 - Alternative Approaches: This directory contains the translations from all alternative approaches (i.e., C2Rust, CxGO, Java2C#). We have added an excel file to show a detailed breakdown of the translation results.
RQ4 - Mitigating Translation Bugs: This directory contains the fix results of GPT-4, StarCoder, CodeGen, and Llama 2. We have added an excel file to show a detailed breakdown of the fix results.
Contact
We look forward to hearing your feedback. Please contact Rangeet Pan or Ali Reza Ibrahimzada for any questions or comments 🙏.
Facebook
TwitterThis dataset was created by tiffie_1
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
sumuks/CodeNet-24K dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterdidula-wso2/CodeNet dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterIBM Research is a renowned organization that has been pushing the boundaries of innovation for decades. With a strong focus on Semiconductors, Artificial Intelligence, Quantum Computing, and Hybrid Cloud, IBM Research is at the forefront of driving advancements in these cutting-edge fields. From developing powerful code models to exploring the potential of in-memory computing devices, IBM Research is committed to exploring new possibilities and solving the world's toughest challenges.
IBM Research is also home to various open-source projects and tools, including Project CodeNet, Project Debater for Academic Use, and GT4SD, among others. These projects aim to accelerate hypothesis generation in scientific discovery, facilitate code search and completion, and create more intelligent systems. With a strong commitment to research, innovation, and collaboration, IBM Research is a leading authority in its field, and its work has the potential to shape the future of technology and humanity.
Facebook
Twitterhttp://www.companywall.rs/Home/Licencehttp://www.companywall.rs/Home/Licence
Ovaj skup podataka uključuje finansijske izvještaje, račune i blokade, te nekretnine. Podaci uključuju prihode, rashode, dobit, imovinu, obaveze i informacije o nekretninama u vlasništvu kompanije. Finansijski podaci, finansijski sažetak, sažetak kompanije, preduzetnik, zanatlija, udruženje, poslovni subjekti.
Facebook
TwitterShijiaD/codenet-python-AST-NIT dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to free-code.net (Domain). Get insights into ownership history and changes over time.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of LLMs for Automated Code RefactoringThis project compares various Large Language Models (LLMs) for the task of automated code refactoring. The framework allows you to test and evaluate multiple models on real-world Python code samples.---Features- Supports 5 popular LLMs for code refactoring- Works on sample Python files in batch- Flexible model configuration via model.yaml- Designed for experimentation and evaluation in research or production- Uses CodeNet datasetSpecify one of the following in CodeBase/model.yaml:- Qwen/CodeQwen1.5-7B-Chat- deepseek-ai/deepseek-coder-6.7b-instruct- meta-llama/Llama-3.2-3B-Instruct- Qwen/Qwen2.5-Coder-7B-Instruct- microsoft/Phi-4-mini-instructSpecify one of the following in CodeBase/prompts.yaml:- few_shot- zero_shot- rciDependencies:- pip install -r requirements.txtData:- data/Python_wrapped - data/Problem_descriptions - selected_files.txtRun :- inference.main
Facebook
TwitterThis is a programming problem-solving dataset extracted from CodeNet, containing three different scale versions of the dataset. This dataset contains programming problems and their solutions extracted from the CodeNet platform. Each sample includes:
submission_id: Submission ID
problem_id: Problem ID
status: Submission status (e.g., "Accepted")
code: Solution code
input: Test input
output: Expected output
problem_description: Problem description
Dataset Versions
'tiny': A small subset… See the full description on the dataset page: https://huggingface.co/datasets/lizn-zn/CodeNet_Extracted.
Facebook
Twitterhttps://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to elegant-code.net (Domain). Get insights into ownership history and changes over time.
Facebook
Twitterhttps://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to pieces-of-code.net (Domain). Get insights into ownership history and changes over time.
Facebook
Twitterhttps://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to c-code.net (Domain). Get insights into ownership history and changes over time.
Facebook
TwitterThis is dataset is extracted from CodeNet, python only. I merged the data into one single table, including metadata, problem description, test input output.
small: accepted status only big: all status, including accepted
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
iidai/codenet dataset hosted on Hugging Face and contributed by the HF Datasets community