100+ datasets found

h
the-stack
huggingface.co
opendatalab.com
Updated Oct 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
Explore at:
Dataset updated
Oct 27, 2022
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for The Stack

Changelog

Release Description

v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
h
code-generation-dataset
huggingface.co
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Mashhudur Rahim (2025). code-generation-dataset [Dataset]. https://huggingface.co/datasets/XythicK/code-generation-dataset
Explore at:
Dataset updated
May 31, 2025
Authors
M Mashhudur Rahim
Description
📄 Code Generation Dataset

A large-scale dataset curated for training and evaluating code generation models. This dataset contains high-quality code snippets, prompts, and metadata suitable for various code synthesis tasks, including prompt completion, function generation, and docstring-to-code translation.

📦 Dataset Summary

The code-generation-dataset provides:

✅ Prompts describing coding tasks ✅ Code solutions in Python (or other languages, if applicable) ✅ Metadata… See the full description on the dataset page: https://huggingface.co/datasets/XythicK/code-generation-dataset.
P
BigCodeBench Dataset
paperswithcode.com
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Terry Yue Zhuo; Minh Chien Vu; Jenny Chim; Han Hu; Wenhao Yu; Ratnadira Widyasari; Imam Nur Bani Yusuf; Haolan Zhan; Junda He; Indraneil Paul; Simon Brunner; Chen Gong; Thong Hoang; Armel Randy Zebaze; Xiaoheng Hong; Wen-Ding Li; Jean Kaddour; Ming Xu; Zhihan Zhang; Prateek Yadav; Naman jain; Alex Gu; Zhoujun Cheng; Jiawei Liu; Qian Liu; Zijian Wang; Binyuan Hui; Niklas Muennighoff; David Lo; Daniel Fried; Xiaoning Du; Harm de Vries; Leandro von Werra (2024). BigCodeBench Dataset [Dataset]. https://paperswithcode.com/dataset/bigcodebench
Explore at:
Dataset updated
Jun 21, 2024
Authors
Terry Yue Zhuo; Minh Chien Vu; Jenny Chim; Han Hu; Wenhao Yu; Ratnadira Widyasari; Imam Nur Bani Yusuf; Haolan Zhan; Junda He; Indraneil Paul; Simon Brunner; Chen Gong; Thong Hoang; Armel Randy Zebaze; Xiaoheng Hong; Wen-Ding Li; Jean Kaddour; Ming Xu; Zhihan Zhang; Prateek Yadav; Naman jain; Alex Gu; Zhoujun Cheng; Jiawei Liu; Qian Liu; Zijian Wang; Binyuan Hui; Niklas Muennighoff; David Lo; Daniel Fried; Xiaoning Du; Harm de Vries; Leandro von Werra
Description
BigCodeBench is an easy-to-use benchmark for code generation with practical and challenging programming tasks¹. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting¹. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls¹.

Here are some key features of BigCodeBench: - Precise evaluation & ranking: It provides a leaderboard for latest LLM rankings before & after rigorous evaluation¹. - Pre-generated samples: BigCodeBench accelerates code intelligence research by open-sourcing LLM-generated samples for various models¹. - Execution Environment: The execution environment in BigCodeBench is less bounded than EvalPlus to support tasks with diverse library dependencies¹. - Test Evaluation: BigCodeBench relies on unittest for evaluating the generated code¹.

(1) GitHub - bigcode-project/bigcodebench: BigCodeBench: The Next .... https://github.com/bigcode-project/bigcodebench/.
f
LAIL
figshare.com
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Li (2024). LAIL [Dataset]. http://doi.org/10.6084/m9.figshare.22014596.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22014596.v1
Dataset updated
Jul 30, 2024
Dataset provided by
figshare
Authors
Jia Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LAILLAIL is a Large language model-Aware selection approach for In-context-Learning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement.## Requirements- openai- tqdm- javaWe also privide a scripts (/Evaluation/evaluation_setup.sh) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval as example:train.jsonl and test.jsonl:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data and test_source data folders consist of the original code repositories collected from Github.estimate_prompt folder contain the constructed prompts to estimate candidate examples.generation_prompt folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL folder provides the selected examples' id in LAIL_id by our LAIL. Developers can directly use these provided prompts through codellama_completion.py to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py. (3) Finally, developers evaluate the generated programs with the source code in Evaluation folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples package.Take DevEval as example:(1) /Dataset/DevEval/estimate_prompt folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train folder.bashbash run_inference.sh ../Dataset/DevEval###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py. bashpython process_generation.py###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py(2) Then, developers use /baselines/make_prompt.py to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/EvaluationSince the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn.
f
Improving LLM Code Generation via Testing and Static Analysis Feedback
figshare.com
zip
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincenzo Arceri (2024). Improving LLM Code Generation via Testing and Static Analysis Feedback [Dataset]. http://doi.org/10.6084/m9.figshare.26984716.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26984716.v1
Dataset updated
Sep 11, 2024
Dataset provided by
figshare
Authors
Vincenzo Arceri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
assertion: Scripts for the (in)correctness analysis + Results for first generation and repair phase experiments- compilation: Scripts to do code refactoring on the generated files and get the compiling files- correctness_stats: Aggregate stats for the (in)correctness analysis- dataset: Contains the dataset used for the experiments (100_clean_tasks.json) and other additional files- files_to_analyze_strict: Files to analyze in the phases after the generation. These are the 89 files that compile for all the models- first_gen_output_prompt*: Generated output for the first generation of the prompt experiments- generation: Script for interacting and prompting the models to obtain the output for each phase- infer: Vulnerability report created by Infer for the first generation and the vulnerability repair phase + scripts for running Infer- infer_stats: Vulnerability stats for the first generation and the repair phase - including the repair prompt experiments- iterations-correctness: Generated output for the correctness repair experiments at each iteration- iterations-vulnerabilities: Generated output for the vulnerability repair experiments at each iteration- prompt_experiments: Contains prompts and some results for the prompt experiments that we ran- regeneration_output_correctness_prompt*: Generated output for the correctness repair experiments- regeneration_output_vulnerability_prompt*: Generated output for the vulnerability repair experiments- self_correctness_output_prompt*: Generated output for the self-correctness experiments- self_safety_output_prompt*: Generated output for the self-correctness experiments- self_correctness_stats: Generated stats for the self-correctness experiments- self_safety_stats: Generated stats for the self-safety experiments- stats: Python script for obtaining different stats- folder_descr.txt: this file. description of the folders in this directory- README.md: pipeline description with some results reported in the paper
P
CodeSearchNet Dataset
paperswithcode.com
opendatalab.com
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet
Explore at:
Dataset updated
Dec 30, 2024
Authors
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
Description
The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
AIMO-24: Model (openai-community/gpt2-large)
kaggle.com
zip
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 7, 2024
Authors
Dinh Thoai Tran @ randrise.com
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
language: en

license: mit

GPT-2 Large

Table of Contents

Model Details

How To Get Started With the Model

Uses

Risks, Limitations and Biases

Training

Evaluation

Environmental Impact

Technical Specifications

Citation Information

Model Card Authors

Model Details

Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

Developed by: OpenAI, see associated research paper and GitHub repo for model developers.

Model Type: Transformer-based language model

Language(s): English

License: Modified MIT License

Related Models: GPT-2, GPT-Medium and GPT-XL

Resources for more information:

Research Paper

OpenAI Blog Post

GitHub Repo

OpenAI Model Card for GPT-2

Test the full generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large

How to Get Started with the Model

Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed >>> generator = pipeline('text-generation', model='gpt2-large') >>> set_seed(42) >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5) [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"}, {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"}, {'generated_text': "Hello, I'm a language model, why does this matter for you? When I hear new languages, I tend to start thinking in terms"}, {'generated_text': "Hello, I'm a language model, a functional language... I don't need to know anything else. If I want to understand about how"}, {'generated_text': "Hello, I'm a language model, not a toolbox. In a nutshell, a language model is a set of attributes that define how"}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = GPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)

and in TensorFlow:

from transformers import GPT2Tokenizer, TFGPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = TFGPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)

Uses

Direct Use

In their model card about GPT-2, OpenAI wrote:

The primary intended users of these models are AI researchers and practitioners.

We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

Downstream Use

In their model card about GPT-2, OpenAI wrote:

Here are some secondary use cases we believe are likely:

Writing assistance: Grammar assistance, autocompletion (for normal prose or code)

Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.

Entertainment: Creation of games, chat bots, and amusing generations.

Misuse and Out-of-scope Use

In their model card about GPT-2, OpenAI wrote:

Because large-scale language models like GPT-2 ...
C
Code Training Model Generation Software Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Code Training Model Generation Software Report [Dataset]. https://www.marketreportanalytics.com/reports/code-training-model-generation-software-52268
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Code Training Model Generation Software market is experiencing rapid growth, driven by the increasing demand for efficient and accurate code generation. The market, estimated at $2 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This robust expansion is fueled by several key factors. Firstly, the rising complexity of software development necessitates tools that automate repetitive tasks and accelerate the coding process. Secondly, the growing adoption of AI and machine learning across various industries is creating a significant demand for code generation solutions that can handle increasingly sophisticated algorithms and data structures. Thirdly, the availability of large, publicly accessible datasets for training these models is further fueling innovation and market expansion. The market is segmented by application (enterprise and personal use) and deployment type (cloud-based and on-premises), with cloud-based solutions gaining significant traction due to their scalability and cost-effectiveness. Leading players like OpenAI, GitHub, and others are driving innovation and competition, fostering the development of more powerful and user-friendly tools. The geographical distribution of the market shows strong growth across North America and Europe, fueled by a high concentration of technology companies and a mature software development ecosystem. Asia Pacific is also witnessing substantial growth, driven by a rapidly expanding tech sector and increasing digital adoption. However, market penetration in regions like the Middle East and Africa remains relatively low, presenting significant future growth opportunities. While the market faces challenges like data security concerns and the need for continuous model training and updates, the overall outlook remains positive, with significant potential for further expansion driven by ongoing advancements in AI and machine learning technologies. The growing adoption of DevOps methodologies and the need for faster software development cycles are further solidifying the long-term growth trajectory of the code training model generation software market.
h
github-code
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code
Explore at:
Dataset authored and provided by
CodeParrot
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.
P
ClassEval Dataset
paperswithcode.com
Updated Aug 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xueying Du; Mingwei Liu; Kaixin Wang; Hanlin Wang; Junwei Liu; Yixuan Chen; Jiayi Feng; Chaofeng Sha; Xin Peng; Yiling Lou (2023). ClassEval Dataset [Dataset]. https://paperswithcode.com/dataset/classeval
Explore at:
Dataset updated
Aug 2, 2023
Authors
Xueying Du; Mingwei Liu; Kaixin Wang; Hanlin Wang; Junwei Liu; Yixuan Chen; Jiayi Feng; Chaofeng Sha; Xin Peng; Yiling Lou
Description
In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes.
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
Updated May 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11213783
Dataset updated
May 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.
g
Data from: Data Science Problems
github.com
opendatalab.com
Updated Feb 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Data Science Problems [Dataset]. https://github.com/microsoft/DataScienceProblems
Explore at:
Dataset updated
Feb 8, 2022
License
https://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txthttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt
Description
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset.
Data from: CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java...
zenodo.org
application/gzip, bin
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaihang Jiang; Jin Bihui; Nie Pengyu; Kaihang Jiang; Jin Bihui; Nie Pengyu (2025). CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories [Dataset]. http://doi.org/10.5281/zenodo.15293313
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15293313
Dataset updated
Apr 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kaihang Jiang; Jin Bihui; Nie Pengyu; Kaihang Jiang; Jin Bihui; Nie Pengyu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Modern programming languages are constantly evolving, introducing new language features and APIs to enhance software development practices. Software developers often face the tedious task of upgrading their codebase to new programming language versions. Recently, large language models (LLMs) have demonstrated potential in automating various code generation and editing tasks, suggesting their applicability in automating code upgrade. However, there exists no benchmark for evaluating the code upgrade ability of LLMs, as distilling code changes related to programming language evolution from real-world software repositories’ commit histories is a complex challenge.
In this work, we introduce CoUpJava, the first large-scale dataset for code upgrade, focusing on the code changes related to the evolution of Java. CoUpJava comprises 10,697 code upgrade samples, distilled from the commit histories of 1,379 open-source Java repositories and covering Java versions 7–23. The dataset is divided into two subsets: CoUpJava-Fine, which captures fine-grained method-level refactorings towards new language features; and CoUpJava-Coarse, which includes coarse-grained repository-level changes encompassing new language features, standard library APIs, and build configurations. Our proposed dataset provides high-quality samples by filtering irrelevant and noisy changes and verifying the compilability of upgraded code. Moreover, CoUpJava reveals diversity in code upgrade scenarios, ranging from small, fine-grained refactorings to large-scale repository modifications.
Databricks Dolly 15K Dataset
kaggle.com
Updated Apr 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Snehil Sanyal (2023). Databricks Dolly 15K Dataset [Dataset]. https://www.kaggle.com/datasets/snehilsanyal/databricks-dolly-15k-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Snehil Sanyal
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This dataset was taken from the GitHub Repository. This dataset is made public by Databricks for research and commercial use-cases. Originally the repository provides a jsonl file which was used to create a csv file included in this dataset.

Summary

Blog post: Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM

databricks-dolly-15k is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.

Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation

Languages: English Version: 1.0

Owner: Databricks, Inc.

Dataset Overview

databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category.

Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.

For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]) which we recommend users remove for downstream applications.

Intended Uses

While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories.

Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets.

Dataset

Purpose of Collection

As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.

Sources

Human-generated data: Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories.

Wikipedia: For instruction categories that require an annotator to consult a reference text (information extraction, closed QA, summarization) contributors selected passages from Wikipedia for particular subsets of instruction categories. No guidance was given to annotators as to how to select the target passages.

Annotator Guidelines

To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous co...

Code4ML 2.0

zenodo.org

csv, txt

Updated May 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

Raw Data for Research Paper: Analyzing Prominent LLMs: An empirical study on...
zenodo.org
Updated Feb 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). Raw Data for Research Paper: Analyzing Prominent LLMs: An empirical study on solving LeetCode problems [Dataset]. http://doi.org/10.5281/zenodo.14791416
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14791416
Dataset updated
Feb 3, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Replication Package contains data and results of evaluating the performance and complexity of Large Language Models (LLMs) in solving programming challenges in LeetCode. It was developed with the paper "Analyzing Prominent LLMs: An Empirical Study on Solving LeetCode Problems," submitted to the 29th International Conference on Evaluation and Assessment in Software Engineering (2025). The dataset includes prompt templates, problem IDs, model-generated code solutions, and a spreadsheet with the raw data. Further details about the processed data, visualizations, and scripts will be provided with the final version of the paper.
h
MultiPL-E
huggingface.co
Updated Jan 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Northeastern University PRL (2025). MultiPL-E [Dataset]. https://huggingface.co/datasets/nuprl-staging/MultiPL-E
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2025
Dataset authored and provided by
Northeastern University PRL
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for MultiPL-E

Dataset Summary

MultiPL-E is a dataset for evaluating large language models for code generation that supports 22 programming languages. It takes the OpenAI HumanEval and the Mostly Basic Python Programs (MBPP) benchmarks and uses little compilers to translate them to other languages. It is easy to add support for new languages and benchmarks. The dataset is divided into several configurations named SRCDATA-LANG, where SRCDATA is either… See the full description on the dataset page: https://huggingface.co/datasets/nuprl-staging/MultiPL-E.
h
DA-Code
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianwen Luo, DA-Code [Dataset]. https://huggingface.co/datasets/Jianwen2003/DA-Code
Explore at:
Authors
Jianwen Luo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
[EMNLP2024] DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

DA-Code is a comprehensive evaluation dataset designed to assess the data analysis and code generation capabilities of LLM in agent-based data science tasks. Our papers and experiment reports have been published on Arxiv.

Dataset Overview

500 complex real-world data analysis tasks across Data Wrangling (DW), Machine Learning (ML), and Exploratory Data Analysis (EDA). Tasks cover… See the full description on the dataset page: https://huggingface.co/datasets/Jianwen2003/DA-Code.
P
APPS Dataset
paperswithcode.com
Updated May 21, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Hendrycks; Steven Basart; Saurav Kadavath; Mantas Mazeika; Akul Arora; Ethan Guo; Collin Burns; Samir Puranik; Horace He; Dawn Song; Jacob Steinhardt (2021). APPS Dataset [Dataset]. https://paperswithcode.com/dataset/apps
Explore at:
Dataset updated
May 21, 2021
Authors
Dan Hendrycks; Steven Basart; Saurav Kadavath; Mantas Mazeika; Akul Arora; Ethan Guo; Collin Burns; Samir Puranik; Horace He; Dawn Song; Jacob Steinhardt
Description
The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions. The problems range in difficulty from introductory to collegiate competition level and measure coding ability as well as problem-solving.

The Automated Programming Progress Standard, abbreviated APPS, consists of 10,000 coding problems in total, with 131,836 test cases for checking solutions and 232,444 ground-truth solutions written by humans. Problems can be complicated, as the average length of a problem is 293.2 words. The data are split evenly into training and test sets, with 5,000 problems each. In the test set, every problem has multiple test cases, and the average number of test cases is 21.2. Each test case is specifically designed for the corresponding problem, enabling us to rigorously evaluate program functionality.
P
CoNaLa Dataset
paperswithcode.com
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pengcheng Yin; Bowen Deng; Edgar Chen; Bogdan Vasilescu; Graham Neubig (2024). CoNaLa Dataset [Dataset]. https://paperswithcode.com/dataset/conala
Explore at:
Dataset updated
May 31, 2024
Authors
Pengcheng Yin; Bowen Deng; Edgar Chen; Bogdan Vasilescu; Graham Neubig
Description
The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel labs. Its purpose is for testing the generation of code snippets from natural language. The data comes from StackOverflow questions. There are 2379 training and 500 test examples that were manually annotated. Every example has a natural language intent and its corresponding python snippet. In addition to the manually annotated dataset, there are also 598,237 mined intent-snippet pairs. These examples are similar to the hand-annotated ones except that they contain a probability if the pair is valid.

Facebook

Twitter

Click to copy link

Link copied

Cite

BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack

the-stack

The-Stack

bigcode/the-stack

Explore at:

51 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Oct 27, 2022

Dataset authored and provided by

BigCode

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Dataset Card for The Stack

  Changelog

Release Description

v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.

Clear search

Close search

Google apps

Main menu

the-stack

code-generation-dataset

BigCodeBench Dataset

LAIL

Improving LLM Code Generation via Testing and Static Analysis Feedback

CodeSearchNet Dataset

AIMO-24: Model (openai-community/gpt2-large)

license: mit

GPT-2 Large

Table of Contents

Model Details

How to Get Started with the Model

Uses

Direct Use

Downstream Use

Misuse and Out-of-scope Use

Code Training Model Generation Software Report

github-code

ClassEval Dataset

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Data from: Data Science Problems

Data from: CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java...

Databricks Dolly 15K Dataset

Summary

Dataset Overview

Intended Uses

Dataset

Purpose of Collection

Sources

Annotator Guidelines

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications

Raw Data for Research Paper: Analyzing Prominent LLMs: An empirical study on...

MultiPL-E

DA-Code

APPS Dataset

CoNaLa Dataset

the-stackSee More Versions

The-Stack

bigcode/the-stack

the-stack