CodeXGLUE is a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE stands for General Language Understanding Evaluation benchmark for CODE. It includes 14 datasets for 10 diversified code intelligence tasks covering the following scenarios:
code-code (clone detection, defect detection, cloze test, code completion, code repair, and code-to-code translation) text-code (natural language code search, text-to-code generation) code-text (code summarization) text-text (documentation translation)
A brief summary of CodeXGLUE is provided in the figure, including tasks, datasets, language, sizes in various states, baseline systems, providers, and short definitions of each task. Datasets highlighted in BLUE are newly introduced.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_ct_code_to_text"
Dataset Summary
CodeXGLUE code-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text The dataset we use comes from CodeSearchNet and we filter the dataset as the following:
Remove examples that codes cannot be parsed into an abstract syntax tree.
Remove examples that #tokens of documents is < 3 or >256
Remove examples that documents contain special tokens (e.g. or… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text.
Concode dataset
A large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment. Concode dataset is a widely used code generation dataset from Iyer's EMNLP 2018 paper Mapping Language to Code in Programmatic Context. Data statistics of concode dataset are shown in the below table:
Train 100… See the full description on the dataset page: https://huggingface.co/datasets/AhmedSSoliman/CodeXGLUE-CONCODE.
https://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license
CodeXGLUE code-to-code-trans dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans The dataset is collected from several public repos, including Lucene(http://lucene.apache.org/), POI(http://poi.apache.org/), JGit(https://github.com/eclipse/jgit/) and Antlr(https://github.com/antlr/). We collect both the Java and C# versions of the codes and find the parallel functions. After removing duplicates and functions with the empty body, we split the whole dataset into training, validation and test sets.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_code_completion_token"
Dataset Summary
CodeXGLUE CodeCompletion-token dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-token Predict next code token given context of previous tokens. Models are evaluated by token level accuracy. Code completion is a one of the most widely used features in software development through IDEs. An effective code completion tool could improve software… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_completion_token.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_defect_detection"
Dataset Summary
CodeXGLUE Defect-detection dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection Given a source code, the task is to identify whether it is an insecure code that may attack software systems, such as resource leaks, use-after-free vulnerabilities and DoS attack. We treat the task as binary classification (0/1), where 1 stands for insecure code and 0 for secure… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_defect_detection.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_code_refinement"
Dataset Summary
CodeXGLUE code-refinement dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement We use the dataset released by this paper(https://arxiv.org/pdf/1812.08693.pdf). The source side is a Java function with bugs and the target side is the refined one. All the function and variable names are normalized. Their dataset contains two subsets ( i.e.small and medium) based on… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_refinement.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_clone_detection_big_clone_bench"
Dataset Summary
CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score. The dataset we use is BigCloneBench and filtered following the paper… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_tc_text_to_code"
Dataset Summary
CodeXGLUE text-to-code dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/.
Supported Tasks and Leaderboards
machine-translation: The dataset can be used to train a model for generating Java code from an English natural… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_tc_text_to_code.
https://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license
CodeXGLUE text-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/.
ShijiaD/CodeXGLUE-Code-Description dataset hosted on Hugging Face and contributed by the HF Datasets community
https://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license
CodeXGLUE NL-code-search-Adv dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv The dataset we use comes from CodeSearchNet and we filter the dataset as the following:
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_code_completion_line"
Dataset Summary
CodeXGLUE CodeCompletion-line dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-line Complete the unfinished line given previous context. Models are evaluated by exact match and edit similarity. We propose line completion task to test model's ability to autocomplete a line. Majority code completion systems behave well in token level completion, but fail in… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_completion_line.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this online repository, we release the source code of each of the selected techniques as well as the experiment results from each technique (which are stored in the Results.zip file). For each technique, we also provide our scripts to fine this approach on the CodeSearchNet-Python dataset. For example, finetune.sh/inference.sh are used to finetune/evaluate CodeBERT and they are under "CodeBERT/CodeBERT".
Our evaluation dataset CodeSearchNet is a well-known benchmark and it can be downloaded on its official webpage.
The code to calculate the evaluation metrics are reused from CodeBLEU.
Below is a piece of code generated by CodeT5. In this case, CodeT5 generates a statement recurrently, which leads to the syntactic error. Despite that, the code itself fulfills certain functionalities, and that is why it can achieve a CodeBLEU of 24.9%.
def makeMimiLocal(filename):
try:
with open(filename, 'rb') as f:
data = f.read()
except IOError:
data = b''
data = data.decode('utf-8')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\
We also release the 100 randomly-selected queries as well as the code generated by ChatGPT in the chatGPT.jsonl.
CAPYBARA This dataset is published as part of the paper: "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries". It includes both the training/evaluation data as well as the raw data. The data_split folder contains .pickle files with the test and validation repos, the train repos are all the remaining repos. It also contains a .pickle file with a dictionary that specifies the optimization level for each repository. In the processed_data folder, the processed datasets can be found in .csv format. The columns of the CSV are the summaries
, the original documentation
, the repo
, the source
and decompiled
code, the function name
and a unique identifier
. We also include the deduplicated samples in separate CSVs. The processed training files can be found in the training_data folder. Source C
, decompiled
, demiStripped
, and stripped
can each be found in their corresponding folders and are split into deduplicated and regular datasets. The data is further split into .jsonl files, for the train, test, and validation sets. These .jsonl files can be loaded into CodeT5 and CodeXGlue as is. The raw_data folder contains all the stripped and decompiled functions without any pre-processing applied. The columns of the CSV are repo
, the location
, the original
code, the corresponding decompiled
code, the function name
, a unique identifier
key, and the corresponding documentation
for both the decompiled and stripped functions. License Copyright 2022 ########## Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
ShijiaD/CodeXGLUE-AST-Docstring dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_clone_detection_poj_104"
Dataset Summary
CodeXGLUE Clone-detection-POJ-104 dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 Given a code and a collection of candidates as the input, the task is to return Top K codes with the same semantic. Models are evaluated by MAP score. We use POJ-104 dataset on this task.
Supported Tasks and Leaderboards
document-retrieval: The… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_poj104.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_cloze_testing_all"
Dataset Summary
CodeXGLUE ClozeTesting-all dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/ClozeTesting-all Cloze tests are widely adopted in Natural Languages Processing to evaluate the performance of the trained language models. The task is aimed to predict the answers for the blank with the context of the blank, which can be formulated as a multi-choice classification problem. Here we… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_cloze_testing_all.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/Clone-detection-POJ-104 in Semeru
CodeXGLUE -- Clone Detection (POJ-104)
Task Definition
Given a code and a collection of candidates as the input, the task is to return Top K codes with the same semantic. Models are evaluated by MAP@R score. MAP@R is defined as the mean of… See the full description on the dataset page: https://huggingface.co/datasets/semeru/Code-Code-CloneDetection-POJ104.
CodeXGLUE is a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE stands for General Language Understanding Evaluation benchmark for CODE. It includes 14 datasets for 10 diversified code intelligence tasks covering the following scenarios:
code-code (clone detection, defect detection, cloze test, code completion, code repair, and code-to-code translation) text-code (natural language code search, text-to-code generation) code-text (code summarization) text-text (documentation translation)
A brief summary of CodeXGLUE is provided in the figure, including tasks, datasets, language, sizes in various states, baseline systems, providers, and short definitions of each task. Datasets highlighted in BLUE are newly introduced.