The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
Dataset Card for CodeSearchNet
This dataset is a collection of comment-code pairs of various programming languages. See code_search_net for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.
Dataset Subsets
pair subset
Columns: "comment", "code" Column types: str, str Examples:{ 'comment': 'Computes the new parent id for the node being moved.
@return int', 'code': "protected function… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/codesearchnet.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-python"
Dataset Description
Homepage: None Repository: https://huggingface.co/datasets/Nan-Do/code-search-net-python Paper: None Leaderboard: None Point of Contact: @Nan-Do
Dataset Summary
This dataset is the Python portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-python.
Employing the MTEB evaluation framework's dataset version, utilize the code below for assessment: import mteb import logging from sentence_transformers import SentenceTransformer from mteb import MTEB
logger = logging.getLogger(name)
model_name = 'intfloat/e5-base-v2' model = SentenceTransformer(model_name) tasks = mteb.get_tasks( tasks=[ "AppsRetrieval", "CodeFeedbackMT", "CodeFeedbackST", "CodeTransOceanContest", "CodeTransOceanDL"… See the full description on the dataset page: https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet-ccr.
This dataset was created by Om Duggineni
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-ruby"
Dataset Summary
This dataset is the Ruby portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.
Languages
The dataset's comments are in English and the functions are coded in Ruby
Data Splits
Train, test, validation labels are included in the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-ruby.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-javascript"
Dataset Summary
This dataset is the JavaScript portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.
Languages
The dataset's comments are in English and the functions are coded in JavaScript
Data Splits
Train, test, validation labels are… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-javascript.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-java"
Dataset Summary
This dataset is the Java portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.
Languages
The dataset's comments are in English and the functions are coded in Java
Data Splits
Train, test, validation labels are included in the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-java.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
CodeSearchNet
This is an unofficial reupload of the code_search_net dataset in the parquet format. I have also removed the columns func_code_tokens, func_documentation_tokens, and split_name as they are not relevant. The original repository relies on a Python module that is downloaded and executed to unpack the dataset, which is a potential security risk but importantly raises an annoying warning. As a plus, parquets load faster. Original model card:
Dataset Card for… See the full description on the dataset page: https://huggingface.co/datasets/claudios/code_search_net.
https://choosealicense.com/licenses/ms-pl/https://choosealicense.com/licenses/ms-pl/
AhmedSSoliman/CodeSearchNet-Python dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-go"
Dataset Summary
This dataset is the Go portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.
Languages
The dataset's comments are in English and the functions are coded in Go
Data Splits
Train, test, validation labels are included in the dataset as… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-go.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large Language Models (LLMs) have revolutionized natural language processing and are now integral to various automated software engineering tasks, such as code generation, vulnerability detection, and code summarization. However, the way these models are trained critically affects their long-term performance. In particular, recursive self-training—where models are continuously fine-tuned on data generated by their own outputs—poses a significant challenge, as it can lead to the gradual accumulation of errors and a phenomenon known as model collapse.This paper, "The Self-Inflicted Collapse: How Recursive Training Undermines Large Language Models in Automated Software Engineering Tasks," investigates the impact of recursive training on LLMs. Our study leverages three well-known datasets:HumanEval is used for the code generation task, providing a collection of programming problems with reference solutions to measure accuracy through the pass@1 metric1.CodeSearchNet serves the code summarization task, offering paired code snippets and human-written summaries, with performance evaluated using BLEU-4 scores2.ReVeal Dataset is employed for the vulnerability detection task, containing annotated smart contract code and detailed vulnerability reports, with performance assessed via the F1 score3.We benchmark six models—ChatGPT 4o, ChatGPT 4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, and Llama 3.2—across these tasks. First, baseline performance is established by fine-tuning each model exclusively on high-quality human-generated data. Then, we simulate a recursive training scenario in which the models are continuously fine-tuned on their own generated outputs over 10 generations. Performance is monitored through various metrics, including pass@1, F1 score, BLEU-4, and perplexity, to capture how recursive self-training affects each model's predictive capability.Our experimental results reveal a consistent pattern of performance degradation when models are trained solely on their own outputs. As the generations progress, key metrics decline and perplexity increases, providing quantitative evidence of model collapse. This study highlights the risks associated with recursive self-training and underscores the need for improved training paradigms to maintain the robustness of LLMs in automated software engineering applications.
kejian/codesearchnet-python-pep8-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "CodeSearchNet-go-qrels"
More Information needed
Hunter-Pax/Cleaned-CodeSearchNet dataset hosted on Hugging Face and contributed by the HF Datasets community
NLPCoreTeam/ruCoir-CodeSearchNet-python-qrels dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for codesearchnet/challenge
The codesearchnet/challenge dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.
Data
This dataset provides:
queries (i.e., topics); count=99
qrels: (relevance assessments); count=4,006
For docs, use irds/codesearchnet
Usage
from datasets import load_dataset
queries = load_dataset('irds/codesearchnet_challenge', 'queries') for record in queries:… See the full description on the dataset page: https://huggingface.co/datasets/irds/codesearchnet_challenge.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_tc_nl_code_search_adv"
Dataset Summary
CodeXGLUE NL-code-search-Adv dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv The dataset we use comes from CodeSearchNet and we filter the dataset as the following:
Remove examples that codes cannot be parsed into an abstract syntax tree. Remove examples that #tokens of documents is < 3 or >256 Remove examples that documents contain special tokens… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_tc_nl_code_search_adv.
algo-tushar/CodeSearchNet-Python dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
gabykim/codesearchnet-knowlang dataset hosted on Hugging Face and contributed by the HF Datasets community
The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found