100+ datasets found
  1. h

    the-stack

    • huggingface.co
    • opendatalab.com
    Updated Oct 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for The Stack

      Changelog
    

    Release Description

    v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

    v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.

  2. h

    code-generation-dataset

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Mashhudur Rahim (2025). code-generation-dataset [Dataset]. https://huggingface.co/datasets/XythicK/code-generation-dataset
    Explore at:
    Dataset updated
    May 31, 2025
    Authors
    M Mashhudur Rahim
    Description

    📄 Code Generation Dataset

    A large-scale dataset curated for training and evaluating code generation models. This dataset contains high-quality code snippets, prompts, and metadata suitable for various code synthesis tasks, including prompt completion, function generation, and docstring-to-code translation.

      📦 Dataset Summary
    

    The code-generation-dataset provides:

    ✅ Prompts describing coding tasks ✅ Code solutions in Python (or other languages, if applicable) ✅ Metadata… See the full description on the dataset page: https://huggingface.co/datasets/XythicK/code-generation-dataset.

  3. P

    BigCodeBench Dataset

    • paperswithcode.com
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Terry Yue Zhuo; Minh Chien Vu; Jenny Chim; Han Hu; Wenhao Yu; Ratnadira Widyasari; Imam Nur Bani Yusuf; Haolan Zhan; Junda He; Indraneil Paul; Simon Brunner; Chen Gong; Thong Hoang; Armel Randy Zebaze; Xiaoheng Hong; Wen-Ding Li; Jean Kaddour; Ming Xu; Zhihan Zhang; Prateek Yadav; Naman jain; Alex Gu; Zhoujun Cheng; Jiawei Liu; Qian Liu; Zijian Wang; Binyuan Hui; Niklas Muennighoff; David Lo; Daniel Fried; Xiaoning Du; Harm de Vries; Leandro von Werra (2024). BigCodeBench Dataset [Dataset]. https://paperswithcode.com/dataset/bigcodebench
    Explore at:
    Dataset updated
    Jun 21, 2024
    Authors
    Terry Yue Zhuo; Minh Chien Vu; Jenny Chim; Han Hu; Wenhao Yu; Ratnadira Widyasari; Imam Nur Bani Yusuf; Haolan Zhan; Junda He; Indraneil Paul; Simon Brunner; Chen Gong; Thong Hoang; Armel Randy Zebaze; Xiaoheng Hong; Wen-Ding Li; Jean Kaddour; Ming Xu; Zhihan Zhang; Prateek Yadav; Naman jain; Alex Gu; Zhoujun Cheng; Jiawei Liu; Qian Liu; Zijian Wang; Binyuan Hui; Niklas Muennighoff; David Lo; Daniel Fried; Xiaoning Du; Harm de Vries; Leandro von Werra
    Description

    BigCodeBench is an easy-to-use benchmark for code generation with practical and challenging programming tasks¹. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting¹. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls¹.

    Here are some key features of BigCodeBench: - Precise evaluation & ranking: It provides a leaderboard for latest LLM rankings before & after rigorous evaluation¹. - Pre-generated samples: BigCodeBench accelerates code intelligence research by open-sourcing LLM-generated samples for various models¹. - Execution Environment: The execution environment in BigCodeBench is less bounded than EvalPlus to support tasks with diverse library dependencies¹. - Test Evaluation: BigCodeBench relies on unittest for evaluating the generated code¹.

    (1) GitHub - bigcode-project/bigcodebench: BigCodeBench: The Next .... https://github.com/bigcode-project/bigcodebench/.

  4. f

    LAIL

    • figshare.com
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li (2024). LAIL [Dataset]. http://doi.org/10.6084/m9.figshare.22014596.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    figshare
    Authors
    Jia Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LAILLAIL is a Large language model-Aware selection approach for In-context-Learning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement.## Requirements- openai- tqdm- javaWe also privide a scripts (/Evaluation/evaluation_setup.sh) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval as example:train.jsonl and test.jsonl:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data and test_source data folders consist of the original code repositories collected from Github.estimate_prompt folder contain the constructed prompts to estimate candidate examples.generation_prompt folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL folder provides the selected examples' id in LAIL_id by our LAIL. Developers can directly use these provided prompts through codellama_completion.py to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py. (3) Finally, developers evaluate the generated programs with the source code in Evaluation folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples package.Take DevEval as example:(1) /Dataset/DevEval/estimate_prompt folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train folder.bashbash run_inference.sh ../Dataset/DevEval###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py. bashpython process_generation.py###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py(2) Then, developers use /baselines/make_prompt.py to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/EvaluationSince the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn.

  5. f

    Improving LLM Code Generation via Testing and Static Analysis Feedback

    • figshare.com
    zip
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincenzo Arceri (2024). Improving LLM Code Generation via Testing and Static Analysis Feedback [Dataset]. http://doi.org/10.6084/m9.figshare.26984716.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    figshare
    Authors
    Vincenzo Arceri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • assertion: Scripts for the (in)correctness analysis + Results for first generation and repair phase experiments- compilation: Scripts to do code refactoring on the generated files and get the compiling files- correctness_stats: Aggregate stats for the (in)correctness analysis- dataset: Contains the dataset used for the experiments (100_clean_tasks.json) and other additional files- files_to_analyze_strict: Files to analyze in the phases after the generation. These are the 89 files that compile for all the models- first_gen_output_prompt*: Generated output for the first generation of the prompt experiments- generation: Script for interacting and prompting the models to obtain the output for each phase- infer: Vulnerability report created by Infer for the first generation and the vulnerability repair phase + scripts for running Infer- infer_stats: Vulnerability stats for the first generation and the repair phase - including the repair prompt experiments- iterations-correctness: Generated output for the correctness repair experiments at each iteration- iterations-vulnerabilities: Generated output for the vulnerability repair experiments at each iteration- prompt_experiments: Contains prompts and some results for the prompt experiments that we ran- regeneration_output_correctness_prompt*: Generated output for the correctness repair experiments- regeneration_output_vulnerability_prompt*: Generated output for the vulnerability repair experiments- self_correctness_output_prompt*: Generated output for the self-correctness experiments- self_safety_output_prompt*: Generated output for the self-correctness experiments- self_correctness_stats: Generated stats for the self-correctness experiments- self_safety_stats: Generated stats for the self-safety experiments- stats: Python script for obtaining different stats- folder_descr.txt: this file. description of the folders in this directory- README.md: pipeline description with some results reported in the paper
  6. P

    CodeSearchNet Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet
    Explore at:
    Dataset updated
    Dec 30, 2024
    Authors
    Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
    Description

    The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

  7. AIMO-24: Model (openai-community/gpt2-large)

    • kaggle.com
    zip
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 7, 2024
    Authors
    Dinh Thoai Tran @ randrise.com
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    language: en

    license: mit

    GPT-2 Large

    Table of Contents

    Model Details

    Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

    How to Get Started with the Model

    Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

    >>> from transformers import pipeline, set_seed
    >>> generator = pipeline('text-generation', model='gpt2-large')
    >>> set_seed(42)
    >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
    
    [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"},
     {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"},
     {'generated_text': "Hello, I'm a language model, why does this matter for you?
    
    When I hear new languages, I tend to start thinking in terms"},
     {'generated_text': "Hello, I'm a language model, a functional language...
    
    I don't need to know anything else. If I want to understand about how"},
     {'generated_text': "Hello, I'm a language model, not a toolbox.
    
    In a nutshell, a language model is a set of attributes that define how"}]
    

    Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import GPT2Tokenizer, GPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    model = GPT2Model.from_pretrained('gpt2-large')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

    and in TensorFlow:

    from transformers import GPT2Tokenizer, TFGPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    model = TFGPT2Model.from_pretrained('gpt2-large')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

    Uses

    Direct Use

    In their model card about GPT-2, OpenAI wrote:

    The primary intended users of these models are AI researchers and practitioners.

    We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

    Downstream Use

    In their model card about GPT-2, OpenAI wrote:

    Here are some secondary use cases we believe are likely:

    • Writing assistance: Grammar assistance, autocompletion (for normal prose or code)
    • Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.
    • Entertainment: Creation of games, chat bots, and amusing generations.

    Misuse and Out-of-scope Use

    In their model card about GPT-2, OpenAI wrote:

    Because large-scale language models like GPT-2 ...

  8. C

    Code Training Model Generation Software Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Code Training Model Generation Software Report [Dataset]. https://www.marketreportanalytics.com/reports/code-training-model-generation-software-52268
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Code Training Model Generation Software market is experiencing rapid growth, driven by the increasing demand for efficient and accurate code generation. The market, estimated at $2 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This robust expansion is fueled by several key factors. Firstly, the rising complexity of software development necessitates tools that automate repetitive tasks and accelerate the coding process. Secondly, the growing adoption of AI and machine learning across various industries is creating a significant demand for code generation solutions that can handle increasingly sophisticated algorithms and data structures. Thirdly, the availability of large, publicly accessible datasets for training these models is further fueling innovation and market expansion. The market is segmented by application (enterprise and personal use) and deployment type (cloud-based and on-premises), with cloud-based solutions gaining significant traction due to their scalability and cost-effectiveness. Leading players like OpenAI, GitHub, and others are driving innovation and competition, fostering the development of more powerful and user-friendly tools. The geographical distribution of the market shows strong growth across North America and Europe, fueled by a high concentration of technology companies and a mature software development ecosystem. Asia Pacific is also witnessing substantial growth, driven by a rapidly expanding tech sector and increasing digital adoption. However, market penetration in regions like the Middle East and Africa remains relatively low, presenting significant future growth opportunities. While the market faces challenges like data security concerns and the need for continuous model training and updates, the overall outlook remains positive, with significant potential for further expansion driven by ongoing advancements in AI and machine learning technologies. The growing adoption of DevOps methodologies and the need for faster software development cycles are further solidifying the long-term growth trajectory of the code training model generation software market.

  9. h

    github-code

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code
    Explore at:
    Dataset authored and provided by
    CodeParrot
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

  10. P

    ClassEval Dataset

    • paperswithcode.com
    Updated Aug 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xueying Du; Mingwei Liu; Kaixin Wang; Hanlin Wang; Junwei Liu; Yixuan Chen; Jiayi Feng; Chaofeng Sha; Xin Peng; Yiling Lou (2023). ClassEval Dataset [Dataset]. https://paperswithcode.com/dataset/classeval
    Explore at:
    Dataset updated
    Aug 2, 2023
    Authors
    Xueying Du; Mingwei Liu; Kaixin Wang; Hanlin Wang; Junwei Liu; Yixuan Chen; Jiayi Feng; Chaofeng Sha; Xin Peng; Yiling Lou
    Description

    In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes.

  11. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    Updated May 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
    Explore at:
    Dataset updated
    May 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

    The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

    Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

    Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

    The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

    Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.

  12. g

    Data from: Data Science Problems

    • github.com
    • opendatalab.com
    Updated Feb 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Data Science Problems [Dataset]. https://github.com/microsoft/DataScienceProblems
    Explore at:
    Dataset updated
    Feb 8, 2022
    License

    https://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txthttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt

    Description

    Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset.

  13. Data from: CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java...

    • zenodo.org
    application/gzip, bin
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaihang Jiang; Jin Bihui; Nie Pengyu; Kaihang Jiang; Jin Bihui; Nie Pengyu (2025). CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories [Dataset]. http://doi.org/10.5281/zenodo.15293313
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kaihang Jiang; Jin Bihui; Nie Pengyu; Kaihang Jiang; Jin Bihui; Nie Pengyu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Modern programming languages are constantly evolving, introducing new language features and APIs to enhance software development practices. Software developers often face the tedious task of upgrading their codebase to new programming language versions. Recently, large language models (LLMs) have demonstrated potential in automating various code generation and editing tasks, suggesting their applicability in automating code upgrade. However, there exists no benchmark for evaluating the code upgrade ability of LLMs, as distilling code changes related to programming language evolution from real-world software repositories’ commit histories is a complex challenge.
    In this work, we introduce CoUpJava, the first large-scale dataset for code upgrade, focusing on the code changes related to the evolution of Java. CoUpJava comprises 10,697 code upgrade samples, distilled from the commit histories of 1,379 open-source Java repositories and covering Java versions 7–23. The dataset is divided into two subsets: CoUpJava-Fine, which captures fine-grained method-level refactorings towards new language features; and CoUpJava-Coarse, which includes coarse-grained repository-level changes encompassing new language features, standard library APIs, and build configurations. Our proposed dataset provides high-quality samples by filtering irrelevant and noisy changes and verifying the compilability of upgraded code. Moreover, CoUpJava reveals diversity in code upgrade scenarios, ranging from small, fine-grained refactorings to large-scale repository modifications.

  14. Databricks Dolly 15K Dataset

    • kaggle.com
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Snehil Sanyal (2023). Databricks Dolly 15K Dataset [Dataset]. https://www.kaggle.com/datasets/snehilsanyal/databricks-dolly-15k-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Snehil Sanyal
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset was taken from the GitHub Repository. This dataset is made public by Databricks for research and commercial use-cases. Originally the repository provides a jsonl file which was used to create a csv file included in this dataset.

    Summary

    Blog post: Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM

    databricks-dolly-15k is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.

    Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation

    Languages: English Version: 1.0

    Owner: Databricks, Inc.

    Dataset Overview

    databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category.

    Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.

    For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]) which we recommend users remove for downstream applications.

    Intended Uses

    While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories.

    Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets.

    Dataset

    Purpose of Collection

    As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.

    Sources

    • Human-generated data: Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories.
    • Wikipedia: For instruction categories that require an annotator to consult a reference text (information extraction, closed QA, summarization) contributors selected passages from Wikipedia for particular subsets of instruction categories. No guidance was given to annotators as to how to select the target passages.

    Annotator Guidelines

    To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous co...

  15. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  16. Raw Data for Research Paper: Analyzing Prominent LLMs: An empirical study on...

    • zenodo.org
    Updated Feb 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2025). Raw Data for Research Paper: Analyzing Prominent LLMs: An empirical study on solving LeetCode problems [Dataset]. http://doi.org/10.5281/zenodo.14791416
    Explore at:
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Replication Package contains data and results of evaluating the performance and complexity of Large Language Models (LLMs) in solving programming challenges in LeetCode. It was developed with the paper "Analyzing Prominent LLMs: An Empirical Study on Solving LeetCode Problems," submitted to the 29th International Conference on Evaluation and Assessment in Software Engineering (2025). The dataset includes prompt templates, problem IDs, model-generated code solutions, and a spreadsheet with the raw data. Further details about the processed data, visualizations, and scripts will be provided with the final version of the paper.

  17. h

    MultiPL-E

    • huggingface.co
    Updated Jan 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Northeastern University PRL (2025). MultiPL-E [Dataset]. https://huggingface.co/datasets/nuprl-staging/MultiPL-E
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2025
    Dataset authored and provided by
    Northeastern University PRL
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for MultiPL-E

      Dataset Summary
    

    MultiPL-E is a dataset for evaluating large language models for code generation that supports 22 programming languages. It takes the OpenAI HumanEval and the Mostly Basic Python Programs (MBPP) benchmarks and uses little compilers to translate them to other languages. It is easy to add support for new languages and benchmarks. The dataset is divided into several configurations named SRCDATA-LANG, where SRCDATA is either… See the full description on the dataset page: https://huggingface.co/datasets/nuprl-staging/MultiPL-E.

  18. h

    DA-Code

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianwen Luo, DA-Code [Dataset]. https://huggingface.co/datasets/Jianwen2003/DA-Code
    Explore at:
    Authors
    Jianwen Luo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    [EMNLP2024] DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

    DA-Code is a comprehensive evaluation dataset designed to assess the data analysis and code generation capabilities of LLM in agent-based data science tasks. Our papers and experiment reports have been published on Arxiv.

      Dataset Overview
    

    500 complex real-world data analysis tasks across Data Wrangling (DW), Machine Learning (ML), and Exploratory Data Analysis (EDA). Tasks cover… See the full description on the dataset page: https://huggingface.co/datasets/Jianwen2003/DA-Code.

  19. P

    APPS Dataset

    • paperswithcode.com
    Updated May 21, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hendrycks; Steven Basart; Saurav Kadavath; Mantas Mazeika; Akul Arora; Ethan Guo; Collin Burns; Samir Puranik; Horace He; Dawn Song; Jacob Steinhardt (2021). APPS Dataset [Dataset]. https://paperswithcode.com/dataset/apps
    Explore at:
    Dataset updated
    May 21, 2021
    Authors
    Dan Hendrycks; Steven Basart; Saurav Kadavath; Mantas Mazeika; Akul Arora; Ethan Guo; Collin Burns; Samir Puranik; Horace He; Dawn Song; Jacob Steinhardt
    Description

    The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions. The problems range in difficulty from introductory to collegiate competition level and measure coding ability as well as problem-solving.

    The Automated Programming Progress Standard, abbreviated APPS, consists of 10,000 coding problems in total, with 131,836 test cases for checking solutions and 232,444 ground-truth solutions written by humans. Problems can be complicated, as the average length of a problem is 293.2 words. The data are split evenly into training and test sets, with 5,000 problems each. In the test set, every problem has multiple test cases, and the average number of test cases is 21.2. Each test case is specifically designed for the corresponding problem, enabling us to rigorously evaluate program functionality.

  20. P

    CoNaLa Dataset

    • paperswithcode.com
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pengcheng Yin; Bowen Deng; Edgar Chen; Bogdan Vasilescu; Graham Neubig (2024). CoNaLa Dataset [Dataset]. https://paperswithcode.com/dataset/conala
    Explore at:
    Dataset updated
    May 31, 2024
    Authors
    Pengcheng Yin; Bowen Deng; Edgar Chen; Bogdan Vasilescu; Graham Neubig
    Description

    The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel labs. Its purpose is for testing the generation of code snippets from natural language. The data comes from StackOverflow questions. There are 2379 training and 500 test examples that were manually annotated. Every example has a natural language intent and its corresponding python snippet. In addition to the manually annotated dataset, there are also 598,237 mined intent-snippet pairs. These examples are similar to the hand-annotated ones except that they contain a probability if the pair is valid.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack

the-stack

The-Stack

bigcode/the-stack

Explore at:
51 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 27, 2022
Dataset authored and provided by
BigCode
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Dataset Card for The Stack

  Changelog

Release Description

v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.

Search
Clear search
Close search
Google apps
Main menu