Facebook
Twitterhttps://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_ct_code_to_text"
Dataset Summary
CodeXGLUE code-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text The dataset we use comes from CodeSearchNet and we filter the dataset as the following:
Remove examples that codes cannot be parsed into an abstract syntax tree.
Remove examples that #tokens of documents is < 3 or >256
Remove examples that documents contain special tokens (e.g. or… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text.
Facebook
TwitterConcode dataset
A large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment. Concode dataset is a widely used code generation dataset from Iyer's EMNLP 2018 paper Mapping Language to Code in Programmatic Context. Data statistics of concode dataset are shown in the below table:
Train 100… See the full description on the dataset page: https://huggingface.co/datasets/AhmedSSoliman/CodeXGLUE-CONCODE.
Facebook
Twitterhttps://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license
CodeXGLUE code-to-code-trans dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans The dataset is collected from several public repos, including Lucene(http://lucene.apache.org/), POI(http://poi.apache.org/), JGit(https://github.com/eclipse/jgit/) and Antlr(https://github.com/antlr/). We collect both the Java and C# versions of the codes and find the parallel functions. After removing duplicates and functions with the empty body, we split the whole dataset into training, validation and test sets.
Facebook
TwitterThis dataset was created by SarthakPaswan
Facebook
Twitterhttps://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_code_refinement"
Dataset Summary
CodeXGLUE code-refinement dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement We use the dataset released by this paper(https://arxiv.org/pdf/1812.08693.pdf). The source side is a Java function with bugs and the target side is the refined one. All the function and variable names are normalized. Their dataset contains two subsets ( i.e.small and medium) based on… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_refinement.
Facebook
TwitterShijiaD/CodeXGLUE-Code-Docstring-Test dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterShijiaD/CodeXGLUE-SBT-Docstring dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license
CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score. The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.
Facebook
Twitterlong methods
Facebook
Twitterhttps://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license
CodeXGLUE text-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/.
Facebook
Twitterhttps://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license
CodeXGLUE NL-code-search-Adv dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv The dataset we use comes from CodeSearchNet and we filter the dataset as the following:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this online repository, we release the source code of each of the selected techniques as well as the experiment results from each technique (which are stored in the Results.zip file). For each technique, we also provide our scripts to fine this approach on the CodeSearchNet-Python dataset. For example, finetune.sh/inference.sh are used to finetune/evaluate CodeBERT and they are under "CodeBERT/CodeBERT".
Our evaluation dataset CodeSearchNet is a well-known benchmark and it can be downloaded on its official webpage.
The code to calculate the evaluation metrics are reused from CodeBLEU.
Below is a piece of code generated by CodeT5. In this case, CodeT5 generates a statement recurrently, which leads to the syntactic error. Despite that, the code itself fulfills certain functionalities, and that is why it can achieve a CodeBLEU of 24.9%.
def makeMimiLocal(filename):
try:
with open(filename, 'rb') as f:
data = f.read()
except IOError:
data = b''
data = data.decode('utf-8')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\x00', b'\x00')
data = data.replace(b'\
We also release the 100 randomly-selected queries as well as the code generated by ChatGPT in the chatGPT.jsonl.
Facebook
Twitterhttps://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_tc_text_to_code"
Dataset Summary
CodeXGLUE text-to-code dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/.
Supported Tasks and Leaderboards
machine-translation: The dataset can be used to train a model for generating Java code from an English natural… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_tc_text_to_code.
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
CAPYBARA
This dataset is published as part of the paper: "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries". It includes both the training/evaluation data as well as the raw data.
The data_split folder contains .pickle files with the test and validation repos, the train repos are all the remaining repos. It also contains a .pickle file with a dictionary that specifies the optimization level for each repository.
In the processed_data folder, the processed datasets can be found in .csv format. The columns of the CSV are the summaries, the original documentation, the repo, the source and decompiled code, the function name and a unique identifier. We also include the deduplicated samples in separate CSVs.
The processed training files can be found in the training_data folder. Source C, decompiled, demiStripped, and stripped can each be found in their corresponding folders and are split into deduplicated and regular datasets. The data is further split into .jsonl files, for the train, test, and validation sets. These .jsonl files can be loaded into CodeT5 and CodeXGlue as is.
The raw_data folder contains all the stripped and decompiled functions without any pre-processing applied. The columns of the CSV are repo, the location, the original code, the corresponding decompiled code, the function name, a unique identifier key, and the corresponding documentation for both the decompiled and stripped functions.
License
Copyright 2022 ##########
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Facebook
TwitterAhmedSSoliman/CodeXGLUE-CodeTans dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_cloze_testing_all"
Dataset Summary
CodeXGLUE ClozeTesting-all dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/ClozeTesting-all Cloze tests are widely adopted in Natural Languages Processing to evaluate the performance of the trained language models. The task is aimed to predict the answers for the blank with the context of the blank, which can be formulated as a multi-choice classification problem. Here we… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_cloze_testing_all.
Facebook
Twitterhttps://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_clone_detection_poj_104"
Dataset Summary
CodeXGLUE Clone-detection-POJ-104 dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 Given a code and a collection of candidates as the input, the task is to return Top K codes with the same semantic. Models are evaluated by MAP score. We use POJ-104 dataset on this task.
Supported Tasks and Leaderboards
document-retrieval: The… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_poj104.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/CodeCompletion-token/dataset/py150 in Semeru
CodeXGLUE -- Code Completion (token level)
Update 2021.07.30: We update the code completion dataset with literals normalized to avoid sensitive information. Here is the introduction and pipeline for token level code completion task.… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-CodeCompletion-TokenLevel-Python.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/code-to-code-trans in Semeru
CodeXGLUE -- Code2Code Translation
Task Definition
Code translation aims to migrate legacy software from one programming language in a platform toanother. In CodeXGLUE, given a piece of Java (C#) code, the task is to translate the code into C#… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-translation-java-csharp.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/Method-Generation/dataset/codexglue_method_generation in Semeru
CodeXGLUE -- Method Generation
Here is the introduction and pipeline for method generation task.
Task Definition
Method generation is the prediction of a method body implementation conditioned on a signature, a… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-MethodGeneration.
Facebook
Twitterhttps://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_ct_code_to_text"
Dataset Summary
CodeXGLUE code-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text The dataset we use comes from CodeSearchNet and we filter the dataset as the following:
Remove examples that codes cannot be parsed into an abstract syntax tree.
Remove examples that #tokens of documents is < 3 or >256
Remove examples that documents contain special tokens (e.g. or… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text.