24 datasets found
  1. code_x_glue_ct_code_to_text

    • huggingface.co
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2023). code_x_glue_ct_code_to_text [Dataset]. https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 10, 2023
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_ct_code_to_text"

      Dataset Summary
    

    CodeXGLUE code-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

    Remove examples that codes cannot be parsed into an abstract syntax tree. Remove examples that #tokens of documents is < 3 or >256 Remove examples that documents contain special tokens (e.g. or… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text.

  2. h

    CodeXGLUE-CONCODE

    • huggingface.co
    • opendatalab.com
    Updated Jan 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Soliman (2022). CodeXGLUE-CONCODE [Dataset]. https://huggingface.co/datasets/AhmedSSoliman/CodeXGLUE-CONCODE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 30, 2022
    Authors
    Ahmed Soliman
    Description

    Concode dataset

    A large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment. Concode dataset is a widely used code generation dataset from Iyer's EMNLP 2018 paper Mapping Language to Code in Programmatic Context. Data statistics of concode dataset are shown in the below table:

    Examples

    Train 100… See the full description on the dataset page: https://huggingface.co/datasets/AhmedSSoliman/CodeXGLUE-CONCODE.

  3. O

    code-x-glue-cc-code-to-code-trans

    • opendatalab.com
    • huggingface.co
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peking University (2023). code-x-glue-cc-code-to-code-trans [Dataset]. https://opendatalab.com/OpenDataLab/code-x-glue-cc-code-to-code-trans
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Sun Yat-Sen University
    Beihang University
    Peking University
    License

    https://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license

    Description

    CodeXGLUE code-to-code-trans dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans The dataset is collected from several public repos, including Lucene(http://lucene.apache.org/), POI(http://poi.apache.org/), JGit(https://github.com/eclipse/jgit/) and Antlr(https://github.com/antlr/). We collect both the Java and C# versions of the codes and find the parallel functions. After removing duplicates and functions with the empty body, we split the whole dataset into training, validation and test sets.

  4. CodeXGLUE Processed

    • kaggle.com
    zip
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SarthakPaswan (2025). CodeXGLUE Processed [Dataset]. https://www.kaggle.com/datasets/sarthakpaswan20/codexglue-processed
    Explore at:
    zip(411905729 bytes)Available download formats
    Dataset updated
    May 29, 2025
    Authors
    SarthakPaswan
    Description

    Dataset

    This dataset was created by SarthakPaswan

    Contents

  5. code_x_glue_cc_code_refinement

    • huggingface.co
    • opendatalab.com
    Updated Nov 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2024). code_x_glue_cc_code_refinement [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_code_refinement
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2024
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_code_refinement"

      Dataset Summary
    

    CodeXGLUE code-refinement dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement We use the dataset released by this paper(https://arxiv.org/pdf/1812.08693.pdf). The source side is a Java function with bugs and the target side is the refined one. All the function and variable names are normalized. Their dataset contains two subsets ( i.e.small and medium) based on… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_refinement.

  6. h

    CodeXGLUE-Code-Docstring-Test

    • huggingface.co
    Updated Oct 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShijiaDong (2025). CodeXGLUE-Code-Docstring-Test [Dataset]. https://huggingface.co/datasets/ShijiaD/CodeXGLUE-Code-Docstring-Test
    Explore at:
    Dataset updated
    Oct 1, 2025
    Authors
    ShijiaDong
    Description

    ShijiaD/CodeXGLUE-Code-Docstring-Test dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    CodeXGLUE-SBT-Docstring

    • huggingface.co
    Updated Sep 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShijiaDong (2025). CodeXGLUE-SBT-Docstring [Dataset]. https://huggingface.co/datasets/ShijiaD/CodeXGLUE-SBT-Docstring
    Explore at:
    Dataset updated
    Sep 21, 2025
    Authors
    ShijiaDong
    Description

    ShijiaD/CodeXGLUE-SBT-Docstring dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. O

    code-x-glue-cc-clone-detection-big-clone-bench

    • opendatalab.com
    • huggingface.co
    zip
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Queen’s University (2023). code-x-glue-cc-clone-detection-big-clone-bench [Dataset]. https://opendatalab.com/OpenDataLab/code-x-glue-cc-clone-detection-big-clone-bench
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    University of Saskatchewan
    Queen’s University
    License

    https://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license

    Description

    CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score. The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.

  9. i

    Know Your LLM with Code Smell: Role-Aware Code Smell Summaries Corpus

    • ieee-dataport.org
    Updated Oct 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhengyi ZHUO (2025). Know Your LLM with Code Smell: Role-Aware Code Smell Summaries Corpus [Dataset]. https://ieee-dataport.org/documents/know-your-llm-code-smell-role-aware-code-smell-summaries-corpus
    Explore at:
    Dataset updated
    Oct 25, 2025
    Authors
    Zhengyi ZHUO
    Description

    long methods

  10. O

    code-x-glue-tt-text-to-text

    • opendatalab.com
    • huggingface.co
    zip
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun Yat-Sen University (2024). code-x-glue-tt-text-to-text [Dataset]. https://opendatalab.com/OpenDataLab/code-x-glue-tt-text-to-text
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2024
    Dataset provided by
    Sun Yat-Sen University
    Beihang University
    Peking University
    License

    https://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license

    Description

    CodeXGLUE text-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/.

  11. O

    code-x-glue-tc-nl-code-search-adv

    • opendatalab.com
    • huggingface.co
    zip
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun Yat-Sen University (2023). code-x-glue-tc-nl-code-search-adv [Dataset]. https://opendatalab.com/OpenDataLab/code-x-glue-tc-nl-code-search-adv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    Sun Yat-Sen University
    Beihang University
    Peking University
    License

    https://github.com/microsoft/CodeXGLUE#licensehttps://github.com/microsoft/CodeXGLUE#license

    Description

    CodeXGLUE NL-code-search-Adv dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

  12. The Artifact of the ESEC/FSE 2023 Paper Titled "Natural Language to Code:...

    • zenodo.org
    bin, zip
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shangwen Wang; Shangwen Wang (2023). The Artifact of the ESEC/FSE 2023 Paper Titled "Natural Language to Code: How Far are We?" [Dataset]. http://doi.org/10.5281/zenodo.7546358
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Aug 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shangwen Wang; Shangwen Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this online repository, we release the source code of each of the selected techniques as well as the experiment results from each technique (which are stored in the Results.zip file). For each technique, we also provide our scripts to fine this approach on the CodeSearchNet-Python dataset. For example, finetune.sh/inference.sh are used to finetune/evaluate CodeBERT and they are under "CodeBERT/CodeBERT".

    Our evaluation dataset CodeSearchNet is a well-known benchmark and it can be downloaded on its official webpage.

    The code to calculate the evaluation metrics are reused from CodeBLEU.

    Below is a piece of code generated by CodeT5. In this case, CodeT5 generates a statement recurrently, which leads to the syntactic error. Despite that, the code itself fulfills certain functionalities, and that is why it can achieve a CodeBLEU of 24.9%.

    def makeMimiLocal(filename):
      try:
        with open(filename, 'rb') as f:
          data = f.read()
      except IOError:
        data = b''
      data = data.decode('utf-8')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\x00', b'\x00')
      data = data.replace(b'\
    

    We also release the 100 randomly-selected queries as well as the code generated by ChatGPT in the chatGPT.jsonl.

  13. code_x_glue_tc_text_to_code

    • huggingface.co
    Updated Dec 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2023). code_x_glue_tc_text_to_code [Dataset]. https://huggingface.co/datasets/google/code_x_glue_tc_text_to_code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2023
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_tc_text_to_code"

      Dataset Summary
    

    CodeXGLUE text-to-code dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/.

      Supported Tasks and Leaderboards
    

    machine-translation: The dataset can be used to train a model for generating Java code from an English natural… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_tc_text_to_code.

  14. Z

    CAPYBARA: Decompiled Binary Functions and Related Summaries

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Al-Kaswan; Toufique Ahmed; Maliheh Izadi; Anand Ashok Sawant; Prem Devanbu; Arie van Deursen (2023). CAPYBARA: Decompiled Binary Functions and Related Summaries [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7229808
    Explore at:
    Dataset updated
    Jan 4, 2023
    Dataset provided by
    Delft University of Technology
    University of California, Davis
    Authors
    Ali Al-Kaswan; Toufique Ahmed; Maliheh Izadi; Anand Ashok Sawant; Prem Devanbu; Arie van Deursen
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    CAPYBARA

    This dataset is published as part of the paper: "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries". It includes both the training/evaluation data as well as the raw data.

    The data_split folder contains .pickle files with the test and validation repos, the train repos are all the remaining repos. It also contains a .pickle file with a dictionary that specifies the optimization level for each repository.

    In the processed_data folder, the processed datasets can be found in .csv format. The columns of the CSV are the summaries, the original documentation, the repo, the source and decompiled code, the function name and a unique identifier. We also include the deduplicated samples in separate CSVs.

    The processed training files can be found in the training_data folder. Source C, decompiled, demiStripped, and stripped can each be found in their corresponding folders and are split into deduplicated and regular datasets. The data is further split into .jsonl files, for the train, test, and validation sets. These .jsonl files can be loaded into CodeT5 and CodeXGlue as is.

    The raw_data folder contains all the stripped and decompiled functions without any pre-processing applied. The columns of the CSV are repo, the location, the original code, the corresponding decompiled code, the function name, a unique identifier key, and the corresponding documentation for both the decompiled and stripped functions.

    License

    Copyright 2022 ##########

    Licensed under the Apache License, Version 2.0 (the "License");

    you may not use this file except in compliance with the License.

    You may obtain a copy of the License at:

    http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software

    distributed under the License is distributed on an "AS IS" BASIS,

    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

    See the License for the specific language governing permissions and

    limitations under the License.

  15. h

    CodeXGLUE-CodeTans

    • huggingface.co
    Updated Jul 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Soliman (2017). CodeXGLUE-CodeTans [Dataset]. https://huggingface.co/datasets/AhmedSSoliman/CodeXGLUE-CodeTans
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 11, 2017
    Authors
    Ahmed Soliman
    Description

    AhmedSSoliman/CodeXGLUE-CodeTans dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. code_x_glue_cc_cloze_testing_all

    • huggingface.co
    Updated Feb 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2022). code_x_glue_cc_cloze_testing_all [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_cloze_testing_all
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2022
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_cloze_testing_all"

      Dataset Summary
    

    CodeXGLUE ClozeTesting-all dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/ClozeTesting-all Cloze tests are widely adopted in Natural Languages Processing to evaluate the performance of the trained language models. The task is aimed to predict the answers for the blank with the context of the blank, which can be formulated as a multi-choice classification problem. Here we… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_cloze_testing_all.

  17. code_x_glue_cc_clone_detection_poj104

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google, code_x_glue_cc_clone_detection_poj104 [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_poj104
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_clone_detection_poj_104"

      Dataset Summary
    

    CodeXGLUE Clone-detection-POJ-104 dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 Given a code and a collection of candidates as the input, the task is to return Top K codes with the same semantic. Models are evaluated by MAP score. We use POJ-104 dataset on this task.

      Supported Tasks and Leaderboards
    

    document-retrieval: The… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_poj104.

  18. h

    code-code-CodeCompletion-TokenLevel-Python

    • huggingface.co
    Updated Jan 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semeru Lab (2025). code-code-CodeCompletion-TokenLevel-Python [Dataset]. https://huggingface.co/datasets/semeru/code-code-CodeCompletion-TokenLevel-Python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 27, 2025
    Dataset authored and provided by
    Semeru Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset is imported from CodeXGLUE and pre-processed using their script.

      Where to find in Semeru:
    

    The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/CodeCompletion-token/dataset/py150 in Semeru

      CodeXGLUE -- Code Completion (token level)
    

    Update 2021.07.30: We update the code completion dataset with literals normalized to avoid sensitive information. Here is the introduction and pipeline for token level code completion task.… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-CodeCompletion-TokenLevel-Python.

  19. h

    code-code-translation-java-csharp

    • huggingface.co
    Updated Jul 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semeru Lab (2017). code-code-translation-java-csharp [Dataset]. https://huggingface.co/datasets/semeru/code-code-translation-java-csharp
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 11, 2017
    Dataset authored and provided by
    Semeru Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset is imported from CodeXGLUE and pre-processed using their script.

      Where to find in Semeru:
    

    The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/code-to-code-trans in Semeru

      CodeXGLUE -- Code2Code Translation
    
    
    
    
    
      Task Definition
    

    Code translation aims to migrate legacy software from one programming language in a platform toanother. In CodeXGLUE, given a piece of Java (C#) code, the task is to translate the code into C#… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-translation-java-csharp.

  20. h

    code-code-MethodGeneration

    • huggingface.co
    Updated Dec 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semeru Lab (2022). code-code-MethodGeneration [Dataset]. https://huggingface.co/datasets/semeru/code-code-MethodGeneration
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 11, 2022
    Dataset authored and provided by
    Semeru Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset is imported from CodeXGLUE and pre-processed using their script.

      Where to find in Semeru:
    

    The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/Method-Generation/dataset/codexglue_method_generation in Semeru

      CodeXGLUE -- Method Generation
    

    Here is the introduction and pipeline for method generation task.

      Task Definition
    

    Method generation is the prediction of a method body implementation conditioned on a signature, a… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-MethodGeneration.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google (2023). code_x_glue_ct_code_to_text [Dataset]. https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text
Organization logo

code_x_glue_ct_code_to_text

CodeXGlueCtCodeToText

google/code_x_glue_ct_code_to_text

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 10, 2023
Dataset authored and provided by
Googlehttp://google.com/
License

https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

Description

Dataset Card for "code_x_glue_ct_code_to_text"

  Dataset Summary

CodeXGLUE code-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

Remove examples that codes cannot be parsed into an abstract syntax tree. Remove examples that #tokens of documents is < 3 or >256 Remove examples that documents contain special tokens (e.g. or… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text.

Search
Clear search
Close search
Google apps
Main menu