33 datasets found
  1. P

    CodeXGLUE Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Mar 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuai Lu; Daya Guo; Shuo Ren; JunJie Huang; Alexey Svyatkovskiy; Ambrosio Blanco; Colin Clement; Dawn Drain; Daxin Jiang; Duyu Tang; Ge Li; Lidong Zhou; Linjun Shou; Long Zhou; Michele Tufano; Ming Gong; Ming Zhou; Nan Duan; Neel Sundaresan; Shao Kun Deng; Shengyu Fu; Shujie Liu (2023). CodeXGLUE Dataset [Dataset]. https://paperswithcode.com/dataset/codexglue
    Explore at:
    Dataset updated
    Mar 29, 2023
    Authors
    Shuai Lu; Daya Guo; Shuo Ren; JunJie Huang; Alexey Svyatkovskiy; Ambrosio Blanco; Colin Clement; Dawn Drain; Daxin Jiang; Duyu Tang; Ge Li; Lidong Zhou; Linjun Shou; Long Zhou; Michele Tufano; Ming Gong; Ming Zhou; Nan Duan; Neel Sundaresan; Shao Kun Deng; Shengyu Fu; Shujie Liu
    Description

    CodeXGLUE is a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE stands for General Language Understanding Evaluation benchmark for CODE. It includes 14 datasets for 10 diversified code intelligence tasks covering the following scenarios:

    code-code (clone detection, defect detection, cloze test, code completion, code repair, and code-to-code translation) text-code (natural language code search, text-to-code generation) code-text (code summarization) text-text (documentation translation)

    A brief summary of CodeXGLUE is provided in the figure, including tasks, datasets, language, sizes in various states, baseline systems, providers, and short definitions of each task. Datasets highlighted in BLUE are newly introduced.

  2. code_x_glue_ct_code_to_text

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google, code_x_glue_ct_code_to_text [Dataset]. https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_ct_code_to_text"

      Dataset Summary
    

    CodeXGLUE code-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

    Remove examples that codes cannot be parsed into an abstract syntax tree. Remove examples that #tokens of documents is < 3 or >256 Remove examples that documents contain special tokens (e.g. or… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text.

  3. h

    CodeXGLUE-CONCODE

    • huggingface.co
    • opendatalab.com
    Updated Jan 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Soliman (2022). CodeXGLUE-CONCODE [Dataset]. https://huggingface.co/datasets/AhmedSSoliman/CodeXGLUE-CONCODE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 30, 2022
    Authors
    Ahmed Soliman
    Description

    Concode dataset

    A large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment. Concode dataset is a widely used code generation dataset from Iyer's EMNLP 2018 paper Mapping Language to Code in Programmatic Context. Data statistics of concode dataset are shown in the below table:

    Examples

    Train 100… See the full description on the dataset page: https://huggingface.co/datasets/AhmedSSoliman/CodeXGLUE-CONCODE.

  4. O

    code-x-glue-cc-code-to-code-trans

    • opendatalab.com
    • huggingface.co
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun Yat-Sen University (2023). code-x-glue-cc-code-to-code-trans [Dataset]. https://opendatalab.com/OpenDataLab/code-x-glue-cc-code-to-code-trans
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Peking University
    Sun Yat-Sen University
    Beihang University
    License

    https://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license

    Description

    CodeXGLUE code-to-code-trans dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans The dataset is collected from several public repos, including Lucene(http://lucene.apache.org/), POI(http://poi.apache.org/), JGit(https://github.com/eclipse/jgit/) and Antlr(https://github.com/antlr/). We collect both the Java and C# versions of the codes and find the parallel functions. After removing duplicates and functions with the empty body, we split the whole dataset into training, validation and test sets.

  5. code_x_glue_cc_code_completion_token

    • huggingface.co
    Updated Aug 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2021). code_x_glue_cc_code_completion_token [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_code_completion_token
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2021
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_code_completion_token"

      Dataset Summary
    

    CodeXGLUE CodeCompletion-token dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-token Predict next code token given context of previous tokens. Models are evaluated by token level accuracy. Code completion is a one of the most widely used features in software development through IDEs. An effective code completion tool could improve software… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_completion_token.

  6. h

    CodeXGLUE-CONCODE

    • huggingface.co
    Updated Jan 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frederic (2022). CodeXGLUE-CONCODE [Dataset]. https://huggingface.co/datasets/FredericFan/CodeXGLUE-CONCODE
    Explore at:
    Dataset updated
    Jan 30, 2022
    Authors
    Frederic
    Description

    Dataset Card for "concode-preprocessed"

    More Information needed

  7. code_x_glue_cc_defect_detection

    • huggingface.co
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2024). code_x_glue_cc_defect_detection [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_defect_detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 16, 2024
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_defect_detection"

      Dataset Summary
    

    CodeXGLUE Defect-detection dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection Given a source code, the task is to identify whether it is an insecure code that may attack software systems, such as resource leaks, use-after-free vulnerabilities and DoS attack. We treat the task as binary classification (0/1), where 1 stands for insecure code and 0 for secure… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_defect_detection.

  8. code_x_glue_cc_code_refinement

    • huggingface.co
    • opendatalab.com
    Updated Nov 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2024). code_x_glue_cc_code_refinement [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_code_refinement
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2024
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_code_refinement"

      Dataset Summary
    

    CodeXGLUE code-refinement dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement We use the dataset released by this paper(https://arxiv.org/pdf/1812.08693.pdf). The source side is a Java function with bugs and the target side is the refined one. All the function and variable names are normalized. Their dataset contains two subsets ( i.e.small and medium) based on… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_refinement.

  9. code_x_glue_cc_clone_detection_big_clone_bench

    • huggingface.co
    • opendatalab.com
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2025). code_x_glue_cc_clone_detection_big_clone_bench [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2025
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_clone_detection_big_clone_bench"

      Dataset Summary
    

    CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score. The dataset we use is BigCloneBench and filtered following the paper… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench.

  10. code_x_glue_tc_text_to_code

    • huggingface.co
    Updated Dec 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2023). code_x_glue_tc_text_to_code [Dataset]. https://huggingface.co/datasets/google/code_x_glue_tc_text_to_code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2023
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_tc_text_to_code"

      Dataset Summary
    

    CodeXGLUE text-to-code dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/.

      Supported Tasks and Leaderboards
    

    machine-translation: The dataset can be used to train a model for generating Java code from an English natural… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_tc_text_to_code.

  11. O

    code-x-glue-tt-text-to-text

    • opendatalab.com
    • huggingface.co
    zip
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun Yat-Sen University (2024). code-x-glue-tt-text-to-text [Dataset]. https://opendatalab.com/OpenDataLab/code-x-glue-tt-text-to-text
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2024
    Dataset provided by
    Peking University
    Sun Yat-Sen University
    Beihang University
    License

    https://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license

    Description

    CodeXGLUE text-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/.

  12. h

    CodeXGLUE-Code-Description

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShijiaDong (2025). CodeXGLUE-Code-Description [Dataset]. https://huggingface.co/datasets/ShijiaD/CodeXGLUE-Code-Description
    Explore at:
    Dataset updated
    Jul 30, 2025
    Authors
    ShijiaDong
    Description

    ShijiaD/CodeXGLUE-Code-Description dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. O

    code-x-glue-tc-nl-code-search-adv

    • opendatalab.com
    • huggingface.co
    zip
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun Yat-Sen University (2023). code-x-glue-tc-nl-code-search-adv [Dataset]. https://opendatalab.com/OpenDataLab/code-x-glue-tc-nl-code-search-adv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    Peking University
    Sun Yat-Sen University
    Beihang University
    License

    https://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license

    Description

    CodeXGLUE NL-code-search-Adv dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

  14. code_x_glue_cc_code_completion_line

    • huggingface.co
    Updated Oct 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2023). code_x_glue_cc_code_completion_line [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_code_completion_line
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2023
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_code_completion_line"

      Dataset Summary
    

    CodeXGLUE CodeCompletion-line dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-line Complete the unfinished line given previous context. Models are evaluated by exact match and edit similarity. We propose line completion task to test model's ability to autocomplete a line. Majority code completion systems behave well in token level completion, but fail in… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_completion_line.

  15. The Artifact of the ESEC/FSE 2023 Paper Titled "Natural Language to Code:...

    • zenodo.org
    bin, zip
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shangwen Wang; Shangwen Wang (2023). The Artifact of the ESEC/FSE 2023 Paper Titled "Natural Language to Code: How Far are We?" [Dataset]. http://doi.org/10.5281/zenodo.7546358
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Aug 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shangwen Wang; Shangwen Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this online repository, we release the source code of each of the selected techniques as well as the experiment results from each technique (which are stored in the Results.zip file). For each technique, we also provide our scripts to fine this approach on the CodeSearchNet-Python dataset. For example, finetune.sh/inference.sh are used to finetune/evaluate CodeBERT and they are under "CodeBERT/CodeBERT".

    Our evaluation dataset CodeSearchNet is a well-known benchmark and it can be downloaded on its official webpage.

    The code to calculate the evaluation metrics are reused from CodeBLEU.

    Below is a piece of code generated by CodeT5. In this case, CodeT5 generates a statement recurrently, which leads to the syntactic error. Despite that, the code itself fulfills certain functionalities, and that is why it can achieve a CodeBLEU of 24.9%.

    def makeMimiLocal(filename):
      try:
        with open(filename, 'rb') as f:
          data = f.read()
      except IOError:
        data = b''
      data = data.decode('utf-8')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\
    

    We also release the 100 randomly-selected queries as well as the code generated by ChatGPT in the chatGPT.jsonl.

  16. o

    CAPYBARA: Decompiled Binary Functions and Related Summaries

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Oct 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Al-Kaswan; Toufique Ahmed; Maliheh Izadi; Anand Ashok Sawant; Prem Devanbu; Arie Van Deursen (2022). CAPYBARA: Decompiled Binary Functions and Related Summaries [Dataset]. http://doi.org/10.5281/zenodo.7229809
    Explore at:
    Dataset updated
    Oct 20, 2022
    Authors
    Ali Al-Kaswan; Toufique Ahmed; Maliheh Izadi; Anand Ashok Sawant; Prem Devanbu; Arie Van Deursen
    Description

    CAPYBARA This dataset is published as part of the paper: "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries". It includes both the training/evaluation data as well as the raw data. The data_split folder contains .pickle files with the test and validation repos, the train repos are all the remaining repos. It also contains a .pickle file with a dictionary that specifies the optimization level for each repository. In the processed_data folder, the processed datasets can be found in .csv format. The columns of the CSV are the summaries, the original documentation, the repo, the source and decompiled code, the function name and a unique identifier. We also include the deduplicated samples in separate CSVs. The processed training files can be found in the training_data folder. Source C, decompiled, demiStripped, and stripped can each be found in their corresponding folders and are split into deduplicated and regular datasets. The data is further split into .jsonl files, for the train, test, and validation sets. These .jsonl files can be loaded into CodeT5 and CodeXGlue as is. The raw_data folder contains all the stripped and decompiled functions without any pre-processing applied. The columns of the CSV are repo, the location, the original code, the corresponding decompiled code, the function name, a unique identifier key, and the corresponding documentation for both the decompiled and stripped functions. License Copyright 2022 ########## Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

  17. h

    CodeXGLUE-AST-Docstring

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShijiaDong (2025). CodeXGLUE-AST-Docstring [Dataset]. https://huggingface.co/datasets/ShijiaD/CodeXGLUE-AST-Docstring
    Explore at:
    Dataset updated
    Jul 30, 2025
    Authors
    ShijiaDong
    Description

    ShijiaD/CodeXGLUE-AST-Docstring dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. code_x_glue_cc_clone_detection_poj104

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google, code_x_glue_cc_clone_detection_poj104 [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_poj104
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_clone_detection_poj_104"

      Dataset Summary
    

    CodeXGLUE Clone-detection-POJ-104 dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 Given a code and a collection of candidates as the input, the task is to return Top K codes with the same semantic. Models are evaluated by MAP score. We use POJ-104 dataset on this task.

      Supported Tasks and Leaderboards
    

    document-retrieval: The… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_poj104.

  19. code_x_glue_cc_cloze_testing_all

    • huggingface.co
    Updated Feb 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2022). code_x_glue_cc_cloze_testing_all [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_cloze_testing_all
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2022
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_cloze_testing_all"

      Dataset Summary
    

    CodeXGLUE ClozeTesting-all dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/ClozeTesting-all Cloze tests are widely adopted in Natural Languages Processing to evaluate the performance of the trained language models. The task is aimed to predict the answers for the blank with the context of the blank, which can be formulated as a multi-choice classification problem. Here we… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_cloze_testing_all.

  20. h

    Code-Code-CloneDetection-POJ104

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semeru Lab, Code-Code-CloneDetection-POJ104 [Dataset]. https://huggingface.co/datasets/semeru/Code-Code-CloneDetection-POJ104
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Semeru Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset is imported from CodeXGLUE and pre-processed using their script.

      Where to find in Semeru:
    

    The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/Clone-detection-POJ-104 in Semeru

      CodeXGLUE -- Clone Detection (POJ-104)
    
    
    
    
    
      Task Definition
    

    Given a code and a collection of candidates as the input, the task is to return Top K codes with the same semantic. Models are evaluated by MAP@R score. MAP@R is defined as the mean of… See the full description on the dataset page: https://huggingface.co/datasets/semeru/Code-Code-CloneDetection-POJ104.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shuai Lu; Daya Guo; Shuo Ren; JunJie Huang; Alexey Svyatkovskiy; Ambrosio Blanco; Colin Clement; Dawn Drain; Daxin Jiang; Duyu Tang; Ge Li; Lidong Zhou; Linjun Shou; Long Zhou; Michele Tufano; Ming Gong; Ming Zhou; Nan Duan; Neel Sundaresan; Shao Kun Deng; Shengyu Fu; Shujie Liu (2023). CodeXGLUE Dataset [Dataset]. https://paperswithcode.com/dataset/codexglue

CodeXGLUE Dataset

Explore at:
Dataset updated
Mar 29, 2023
Authors
Shuai Lu; Daya Guo; Shuo Ren; JunJie Huang; Alexey Svyatkovskiy; Ambrosio Blanco; Colin Clement; Dawn Drain; Daxin Jiang; Duyu Tang; Ge Li; Lidong Zhou; Linjun Shou; Long Zhou; Michele Tufano; Ming Gong; Ming Zhou; Nan Duan; Neel Sundaresan; Shao Kun Deng; Shengyu Fu; Shujie Liu
Description

CodeXGLUE is a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE stands for General Language Understanding Evaluation benchmark for CODE. It includes 14 datasets for 10 diversified code intelligence tasks covering the following scenarios:

code-code (clone detection, defect detection, cloze test, code completion, code repair, and code-to-code translation) text-code (natural language code search, text-to-code generation) code-text (code summarization) text-text (documentation translation)

A brief summary of CodeXGLUE is provided in the figure, including tasks, datasets, language, sizes in various states, baseline systems, providers, and short definitions of each task. Datasets highlighted in BLUE are newly introduced.

Search
Clear search
Close search
Google apps
Main menu