45 datasets found

P
CodeSearchNet Dataset
paperswithcode.com
opendatalab.com
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet
Explore at:
Dataset updated
Dec 30, 2024
Authors
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
Description
The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
h
codesearchnet
huggingface.co
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2024). codesearchnet [Dataset]. https://huggingface.co/datasets/sentence-transformers/codesearchnet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2024
Dataset authored and provided by
Sentence Transformers
Description
Dataset Card for CodeSearchNet

This dataset is a collection of comment-code pairs of various programming languages. See code_search_net for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

Dataset Subsets pair subset

Columns: "comment", "code" Column types: str, str Examples:{ 'comment': 'Computes the new parent id for the node being moved.

@return int', 'code': "protected function… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/codesearchnet.
h
code-search-net-python
huggingface.co
Updated Dec 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Tarin Morales (2023). code-search-net-python [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 27, 2023
Authors
Fernando Tarin Morales
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "code-search-net-python"

Dataset Description

Homepage: None Repository: https://huggingface.co/datasets/Nan-Do/code-search-net-python Paper: None Leaderboard: None Point of Contact: @Nan-Do

Dataset Summary

This dataset is the Python portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-python.
h
CodeSearchNet-ccr
huggingface.co
Updated Aug 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CoIR (2024). CodeSearchNet-ccr [Dataset]. https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet-ccr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 13, 2024
Dataset authored and provided by
CoIR
Description
Employing the MTEB evaluation framework's dataset version, utilize the code below for assessment: import mteb import logging from sentence_transformers import SentenceTransformer from mteb import MTEB

logger = logging.getLogger(name)

model_name = 'intfloat/e5-base-v2' model = SentenceTransformer(model_name) tasks = mteb.get_tasks( tasks=[ "AppsRetrieval", "CodeFeedbackMT", "CodeFeedbackST", "CodeTransOceanContest", "CodeTransOceanDL"… See the full description on the dataset page: https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet-ccr.
CodeSearchNet
kaggle.com
Updated Apr 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Om Duggineni (2023). CodeSearchNet [Dataset]. https://www.kaggle.com/omduggineni/codesearchnet/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Om Duggineni
Description
Dataset

This dataset was created by Om Duggineni

Contents
h
code-search-net-ruby
huggingface.co
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Tarin Morales (2023). code-search-net-ruby [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-ruby
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2023
Authors
Fernando Tarin Morales
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "code-search-net-ruby"

Dataset Summary

This dataset is the Ruby portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

Languages

The dataset's comments are in English and the functions are coded in Ruby

Data Splits

Train, test, validation labels are included in the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-ruby.
h
code-search-net-javascript
huggingface.co
Updated Nov 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Tarin Morales (2023). code-search-net-javascript [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-javascript
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 13, 2023
Authors
Fernando Tarin Morales
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "code-search-net-javascript"

Dataset Summary

This dataset is the JavaScript portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

Languages

The dataset's comments are in English and the functions are coded in JavaScript

Data Splits

Train, test, validation labels are… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-javascript.
h
code-search-net-java
huggingface.co
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Tarin Morales (2025). code-search-net-java [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-java
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2025
Authors
Fernando Tarin Morales
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "code-search-net-java"

Dataset Summary

This dataset is the Java portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

Languages

The dataset's comments are in English and the functions are coded in Java

Data Splits

Train, test, validation labels are included in the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-java.
h
code_search_net
huggingface.co
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Claudio Spiess (2024). code_search_net [Dataset]. https://huggingface.co/datasets/claudios/code_search_net
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2024
Authors
Claudio Spiess
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
CodeSearchNet

This is an unofficial reupload of the code_search_net dataset in the parquet format. I have also removed the columns func_code_tokens, func_documentation_tokens, and split_name as they are not relevant. The original repository relies on a Python module that is downloaded and executed to unpack the dataset, which is a potential security risk but importantly raises an annoying warning. As a plus, parquets load faster. Original model card:

Dataset Card for… See the full description on the dataset page: https://huggingface.co/datasets/claudios/code_search_net.
h
CodeSearchNet-Python
huggingface.co
Updated Aug 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Soliman (2023). CodeSearchNet-Python [Dataset]. https://huggingface.co/datasets/AhmedSSoliman/CodeSearchNet-Python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 26, 2023
Authors
Ahmed Soliman
License
https://choosealicense.com/licenses/ms-pl/https://choosealicense.com/licenses/ms-pl/
Description
AhmedSSoliman/CodeSearchNet-Python dataset hosted on Hugging Face and contributed by the HF Datasets community
h
code-search-net-go
huggingface.co
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Tarin Morales (2023). code-search-net-go [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-go
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2023
Authors
Fernando Tarin Morales
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "code-search-net-go"

Dataset Summary

This dataset is the Go portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

Languages

The dataset's comments are in English and the functions are coded in Go

Data Splits

Train, test, validation labels are included in the dataset as… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-go.
The Self-Inflicted Collapse: How Recursive Training Undermines Large...
figshare.com
txt
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
123 (2025). The Self-Inflicted Collapse: How Recursive Training Undermines Large Language Models in Automated Software Engineering Tasks [Dataset]. http://doi.org/10.6084/m9.figshare.28559318.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28559318.v2
Dataset updated
Mar 13, 2025
Dataset provided by
figshare
Authors
123
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Large Language Models (LLMs) have revolutionized natural language processing and are now integral to various automated software engineering tasks, such as code generation, vulnerability detection, and code summarization. However, the way these models are trained critically affects their long-term performance. In particular, recursive self-training—where models are continuously fine-tuned on data generated by their own outputs—poses a significant challenge, as it can lead to the gradual accumulation of errors and a phenomenon known as model collapse.This paper, "The Self-Inflicted Collapse: How Recursive Training Undermines Large Language Models in Automated Software Engineering Tasks," investigates the impact of recursive training on LLMs. Our study leverages three well-known datasets:HumanEval is used for the code generation task, providing a collection of programming problems with reference solutions to measure accuracy through the pass@1 metric1.CodeSearchNet serves the code summarization task, offering paired code snippets and human-written summaries, with performance evaluated using BLEU-4 scores2.ReVeal Dataset is employed for the vulnerability detection task, containing annotated smart contract code and detailed vulnerability reports, with performance assessed via the F1 score3.We benchmark six models—ChatGPT 4o, ChatGPT 4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, and Llama 3.2—across these tasks. First, baseline performance is established by fine-tuning each model exclusively on high-quality human-generated data. Then, we simulate a recursive training scenario in which the models are continuously fine-tuned on their own generated outputs over 10 generations. Performance is monitored through various metrics, including pass@1, F1 score, BLEU-4, and perplexity, to capture how recursive self-training affects each model's predictive capability.Our experimental results reveal a consistent pattern of performance degradation when models are trained solely on their own outputs. As the generations progress, key metrics decline and perplexity increases, providing quantitative evidence of model collapse. This study highlights the risks associated with recursive self-training and underscores the need for improved training paradigms to maintain the robustness of LLMs in automated software engineering applications.
h
codesearchnet-python-pep8-v1
huggingface.co
Updated Apr 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kejian Shi (2023). codesearchnet-python-pep8-v1 [Dataset]. https://huggingface.co/datasets/kejian/codesearchnet-python-pep8-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2023
Authors
Kejian Shi
Description
kejian/codesearchnet-python-pep8-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
CodeSearchNet-go-qrels
huggingface.co
Updated Aug 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CoIR (2024). CodeSearchNet-go-qrels [Dataset]. https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet-go-qrels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 14, 2024
Dataset authored and provided by
CoIR
Description
Dataset Card for "CodeSearchNet-go-qrels"

More Information needed
h
Cleaned-CodeSearchNet
huggingface.co
Updated Jun 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hunter Paxton (2025). Cleaned-CodeSearchNet [Dataset]. https://huggingface.co/datasets/Hunter-Pax/Cleaned-CodeSearchNet
Explore at:
Dataset updated
Jun 8, 2025
Authors
Hunter Paxton
Description
Hunter-Pax/Cleaned-CodeSearchNet dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ruCoir-CodeSearchNet-python-qrels
huggingface.co
Updated Feb 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLP Core Team (2025). ruCoir-CodeSearchNet-python-qrels [Dataset]. https://huggingface.co/datasets/NLPCoreTeam/ruCoir-CodeSearchNet-python-qrels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2025
Dataset authored and provided by
NLP Core Team
Description
NLPCoreTeam/ruCoir-CodeSearchNet-python-qrels dataset hosted on Hugging Face and contributed by the HF Datasets community
h
codesearchnet_challenge
huggingface.co
Updated Aug 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ir-datasets (2023). codesearchnet_challenge [Dataset]. https://huggingface.co/datasets/irds/codesearchnet_challenge
Explore at:
Dataset updated
Aug 24, 2023
Dataset authored and provided by
ir-datasets
Description
Dataset Card for codesearchnet/challenge

The codesearchnet/challenge dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

Data

This dataset provides:

queries (i.e., topics); count=99

qrels: (relevance assessments); count=4,006

For docs, use irds/codesearchnet

Usage

from datasets import load_dataset

queries = load_dataset('irds/codesearchnet_challenge', 'queries') for record in queries:… See the full description on the dataset page: https://huggingface.co/datasets/irds/codesearchnet_challenge.
code_x_glue_tc_nl_code_search_adv
huggingface.co
opendatalab.com
Updated Jul 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2023). code_x_glue_tc_nl_code_search_adv [Dataset]. https://huggingface.co/datasets/google/code_x_glue_tc_nl_code_search_adv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 27, 2023
Dataset authored and provided by
Googlehttp://google.com/
License
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Description
Dataset Card for "code_x_glue_tc_nl_code_search_adv"

Dataset Summary

CodeXGLUE NL-code-search-Adv dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

Remove examples that codes cannot be parsed into an abstract syntax tree. Remove examples that #tokens of documents is < 3 or >256 Remove examples that documents contain special tokens… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_tc_nl_code_search_adv.
h
CodeSearchNet-Python
huggingface.co
Updated Apr 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abubakar Wazih Tushar (2023). CodeSearchNet-Python [Dataset]. https://huggingface.co/datasets/algo-tushar/CodeSearchNet-Python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2023
Authors
Abubakar Wazih Tushar
Description
algo-tushar/CodeSearchNet-Python dataset hosted on Hugging Face and contributed by the HF Datasets community
h
codesearchnet-knowlang
huggingface.co
Updated Mar 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabhyun Kim (2025). codesearchnet-knowlang [Dataset]. https://huggingface.co/datasets/gabykim/codesearchnet-knowlang
Explore at:
Dataset updated
Mar 23, 2025
Authors
Gabhyun Kim
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
gabykim/codesearchnet-knowlang dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet

CodeSearchNet Dataset

Explore at:

Dataset updated

Dec 30, 2024

Authors

Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt

Description

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

Clear search

Close search

Google apps

Main menu

CodeSearchNet Dataset

codesearchnet

code-search-net-python

CodeSearchNet-ccr

CodeSearchNet

Dataset

Contents

code-search-net-ruby

code-search-net-javascript

code-search-net-java

code_search_net

CodeSearchNet-Python

code-search-net-go

The Self-Inflicted Collapse: How Recursive Training Undermines Large...

codesearchnet-python-pep8-v1

CodeSearchNet-go-qrels

Cleaned-CodeSearchNet

ruCoir-CodeSearchNet-python-qrels

codesearchnet_challenge

code_x_glue_tc_nl_code_search_adv

CodeSearchNet-Python

codesearchnet-knowlang

CodeSearchNet Dataset