45 datasets found
  1. P

    CodeSearchNet Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet
    Explore at:
    Dataset updated
    Dec 30, 2024
    Authors
    Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
    Description

    The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

  2. h

    codesearchnet

    • huggingface.co
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2024). codesearchnet [Dataset]. https://huggingface.co/datasets/sentence-transformers/codesearchnet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2024
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for CodeSearchNet

    This dataset is a collection of comment-code pairs of various programming languages. See code_search_net for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

      Dataset Subsets
    
    
    
    
    
      pair subset
    

    Columns: "comment", "code" Column types: str, str Examples:{ 'comment': 'Computes the new parent id for the node being moved.

    @return int', 'code': "protected function… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/codesearchnet.

  3. h

    code-search-net-python

    • huggingface.co
    Updated Dec 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2023). code-search-net-python [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 27, 2023
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "code-search-net-python"

      Dataset Description
    

    Homepage: None Repository: https://huggingface.co/datasets/Nan-Do/code-search-net-python Paper: None Leaderboard: None Point of Contact: @Nan-Do

      Dataset Summary
    

    This dataset is the Python portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-python.

  4. h

    CodeSearchNet-ccr

    • huggingface.co
    Updated Aug 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CoIR (2024). CodeSearchNet-ccr [Dataset]. https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet-ccr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 13, 2024
    Dataset authored and provided by
    CoIR
    Description

    Employing the MTEB evaluation framework's dataset version, utilize the code below for assessment: import mteb import logging from sentence_transformers import SentenceTransformer from mteb import MTEB

    logger = logging.getLogger(name)

    model_name = 'intfloat/e5-base-v2' model = SentenceTransformer(model_name) tasks = mteb.get_tasks( tasks=[ "AppsRetrieval", "CodeFeedbackMT", "CodeFeedbackST", "CodeTransOceanContest", "CodeTransOceanDL"… See the full description on the dataset page: https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet-ccr.

  5. CodeSearchNet

    • kaggle.com
    Updated Apr 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Om Duggineni (2023). CodeSearchNet [Dataset]. https://www.kaggle.com/omduggineni/codesearchnet/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Om Duggineni
    Description

    Dataset

    This dataset was created by Om Duggineni

    Contents

  6. h

    code-search-net-ruby

    • huggingface.co
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2023). code-search-net-ruby [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-ruby
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2023
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "code-search-net-ruby"

      Dataset Summary
    

    This dataset is the Ruby portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

      Languages
    

    The dataset's comments are in English and the functions are coded in Ruby

      Data Splits
    

    Train, test, validation labels are included in the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-ruby.

  7. h

    code-search-net-javascript

    • huggingface.co
    Updated Nov 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2023). code-search-net-javascript [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-javascript
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2023
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "code-search-net-javascript"

      Dataset Summary
    

    This dataset is the JavaScript portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

      Languages
    

    The dataset's comments are in English and the functions are coded in JavaScript

      Data Splits
    

    Train, test, validation labels are… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-javascript.

  8. h

    code-search-net-java

    • huggingface.co
    Updated Apr 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2025). code-search-net-java [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-java
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2025
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "code-search-net-java"

      Dataset Summary
    

    This dataset is the Java portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

      Languages
    

    The dataset's comments are in English and the functions are coded in Java

      Data Splits
    

    Train, test, validation labels are included in the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-java.

  9. h

    code_search_net

    • huggingface.co
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Claudio Spiess (2024). code_search_net [Dataset]. https://huggingface.co/datasets/claudios/code_search_net
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2024
    Authors
    Claudio Spiess
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    CodeSearchNet

    This is an unofficial reupload of the code_search_net dataset in the parquet format. I have also removed the columns func_code_tokens, func_documentation_tokens, and split_name as they are not relevant. The original repository relies on a Python module that is downloaded and executed to unpack the dataset, which is a potential security risk but importantly raises an annoying warning. As a plus, parquets load faster. Original model card:

      Dataset Card for… See the full description on the dataset page: https://huggingface.co/datasets/claudios/code_search_net.
    
  10. h

    CodeSearchNet-Python

    • huggingface.co
    Updated Aug 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Soliman (2023). CodeSearchNet-Python [Dataset]. https://huggingface.co/datasets/AhmedSSoliman/CodeSearchNet-Python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 26, 2023
    Authors
    Ahmed Soliman
    License

    https://choosealicense.com/licenses/ms-pl/https://choosealicense.com/licenses/ms-pl/

    Description

    AhmedSSoliman/CodeSearchNet-Python dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    code-search-net-go

    • huggingface.co
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2023). code-search-net-go [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-go
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2023
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "code-search-net-go"

      Dataset Summary
    

    This dataset is the Go portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

      Languages
    

    The dataset's comments are in English and the functions are coded in Go

      Data Splits
    

    Train, test, validation labels are included in the dataset as… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-go.

  12. The Self-Inflicted Collapse: How Recursive Training Undermines Large...

    • figshare.com
    txt
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    123 (2025). The Self-Inflicted Collapse: How Recursive Training Undermines Large Language Models in Automated Software Engineering Tasks [Dataset]. http://doi.org/10.6084/m9.figshare.28559318.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    figshare
    Authors
    123
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large Language Models (LLMs) have revolutionized natural language processing and are now integral to various automated software engineering tasks, such as code generation, vulnerability detection, and code summarization. However, the way these models are trained critically affects their long-term performance. In particular, recursive self-training—where models are continuously fine-tuned on data generated by their own outputs—poses a significant challenge, as it can lead to the gradual accumulation of errors and a phenomenon known as model collapse.This paper, "The Self-Inflicted Collapse: How Recursive Training Undermines Large Language Models in Automated Software Engineering Tasks," investigates the impact of recursive training on LLMs. Our study leverages three well-known datasets:HumanEval is used for the code generation task, providing a collection of programming problems with reference solutions to measure accuracy through the pass@1 metric1.CodeSearchNet serves the code summarization task, offering paired code snippets and human-written summaries, with performance evaluated using BLEU-4 scores2.ReVeal Dataset is employed for the vulnerability detection task, containing annotated smart contract code and detailed vulnerability reports, with performance assessed via the F1 score3.We benchmark six models—ChatGPT 4o, ChatGPT 4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, and Llama 3.2—across these tasks. First, baseline performance is established by fine-tuning each model exclusively on high-quality human-generated data. Then, we simulate a recursive training scenario in which the models are continuously fine-tuned on their own generated outputs over 10 generations. Performance is monitored through various metrics, including pass@1, F1 score, BLEU-4, and perplexity, to capture how recursive self-training affects each model's predictive capability.Our experimental results reveal a consistent pattern of performance degradation when models are trained solely on their own outputs. As the generations progress, key metrics decline and perplexity increases, providing quantitative evidence of model collapse. This study highlights the risks associated with recursive self-training and underscores the need for improved training paradigms to maintain the robustness of LLMs in automated software engineering applications.

  13. h

    codesearchnet-python-pep8-v1

    • huggingface.co
    Updated Apr 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kejian Shi (2023). codesearchnet-python-pep8-v1 [Dataset]. https://huggingface.co/datasets/kejian/codesearchnet-python-pep8-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 23, 2023
    Authors
    Kejian Shi
    Description

    kejian/codesearchnet-python-pep8-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    CodeSearchNet-go-qrels

    • huggingface.co
    Updated Aug 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CoIR (2024). CodeSearchNet-go-qrels [Dataset]. https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet-go-qrels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2024
    Dataset authored and provided by
    CoIR
    Description

    Dataset Card for "CodeSearchNet-go-qrels"

    More Information needed

  15. h

    Cleaned-CodeSearchNet

    • huggingface.co
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hunter Paxton (2025). Cleaned-CodeSearchNet [Dataset]. https://huggingface.co/datasets/Hunter-Pax/Cleaned-CodeSearchNet
    Explore at:
    Dataset updated
    Jun 8, 2025
    Authors
    Hunter Paxton
    Description

    Hunter-Pax/Cleaned-CodeSearchNet dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    ruCoir-CodeSearchNet-python-qrels

    • huggingface.co
    Updated Feb 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLP Core Team (2025). ruCoir-CodeSearchNet-python-qrels [Dataset]. https://huggingface.co/datasets/NLPCoreTeam/ruCoir-CodeSearchNet-python-qrels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2025
    Dataset authored and provided by
    NLP Core Team
    Description

    NLPCoreTeam/ruCoir-CodeSearchNet-python-qrels dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    codesearchnet_challenge

    • huggingface.co
    Updated Aug 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ir-datasets (2023). codesearchnet_challenge [Dataset]. https://huggingface.co/datasets/irds/codesearchnet_challenge
    Explore at:
    Dataset updated
    Aug 24, 2023
    Dataset authored and provided by
    ir-datasets
    Description

    Dataset Card for codesearchnet/challenge

    The codesearchnet/challenge dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

      Data
    

    This dataset provides:

    queries (i.e., topics); count=99

    qrels: (relevance assessments); count=4,006

    For docs, use irds/codesearchnet

      Usage
    

    from datasets import load_dataset

    queries = load_dataset('irds/codesearchnet_challenge', 'queries') for record in queries:… See the full description on the dataset page: https://huggingface.co/datasets/irds/codesearchnet_challenge.

  18. code_x_glue_tc_nl_code_search_adv

    • huggingface.co
    • opendatalab.com
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2023). code_x_glue_tc_nl_code_search_adv [Dataset]. https://huggingface.co/datasets/google/code_x_glue_tc_nl_code_search_adv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2023
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_tc_nl_code_search_adv"

      Dataset Summary
    

    CodeXGLUE NL-code-search-Adv dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

    Remove examples that codes cannot be parsed into an abstract syntax tree. Remove examples that #tokens of documents is < 3 or >256 Remove examples that documents contain special tokens… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_tc_nl_code_search_adv.

  19. h

    CodeSearchNet-Python

    • huggingface.co
    Updated Apr 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abubakar Wazih Tushar (2023). CodeSearchNet-Python [Dataset]. https://huggingface.co/datasets/algo-tushar/CodeSearchNet-Python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2023
    Authors
    Abubakar Wazih Tushar
    Description

    algo-tushar/CodeSearchNet-Python dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    codesearchnet-knowlang

    • huggingface.co
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabhyun Kim (2025). codesearchnet-knowlang [Dataset]. https://huggingface.co/datasets/gabykim/codesearchnet-knowlang
    Explore at:
    Dataset updated
    Mar 23, 2025
    Authors
    Gabhyun Kim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    gabykim/codesearchnet-knowlang dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet

CodeSearchNet Dataset

Explore at:
Dataset updated
Dec 30, 2024
Authors
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
Description

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

Search
Clear search
Close search
Google apps
Main menu