https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for The Stack
Changelog
Release Description
v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.
v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
📄 Code Generation Dataset
A large-scale dataset curated for training and evaluating code generation models. This dataset contains high-quality code snippets, prompts, and metadata suitable for various code synthesis tasks, including prompt completion, function generation, and docstring-to-code translation.
📦 Dataset Summary
The code-generation-dataset provides:
✅ Prompts describing coding tasks ✅ Code solutions in Python (or other languages, if applicable) ✅ Metadata… See the full description on the dataset page: https://huggingface.co/datasets/XythicK/code-generation-dataset.
BigCodeBench is an easy-to-use benchmark for code generation with practical and challenging programming tasks¹. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting¹. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls¹.
Here are some key features of BigCodeBench: - Precise evaluation & ranking: It provides a leaderboard for latest LLM rankings before & after rigorous evaluation¹. - Pre-generated samples: BigCodeBench accelerates code intelligence research by open-sourcing LLM-generated samples for various models¹. - Execution Environment: The execution environment in BigCodeBench is less bounded than EvalPlus to support tasks with diverse library dependencies¹. - Test Evaluation: BigCodeBench relies on unittest for evaluating the generated code¹.
(1) GitHub - bigcode-project/bigcodebench: BigCodeBench: The Next .... https://github.com/bigcode-project/bigcodebench/.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
/Evaluation/evaluation_setup.sh
) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh
###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval
as example:train.jsonl
and test.jsonl
:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data
and test_source data
folders consist of the original code repositories collected from Github.estimate_prompt
folder contain the constructed prompts to estimate candidate examples.generation_prompt
folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL
folder provides the selected examples' id in LAIL_id
by our LAIL. Developers can directly use these provided prompts through codellama_completion.py
to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py
. (3) Finally, developers evaluate the generated programs with the source code in Evaluation
folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples
package.Take DevEval
as example:(1) /Dataset/DevEval/estimate_prompt
folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt
(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train
folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &
## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train
folder.bashbash run_inference.sh ../Dataset/DevEval
###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py
) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False
(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py
. bashpython process_generation.py
###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines
folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py
(2) Then, developers use /baselines/make_prompt.py
to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1
###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/Evaluation
Since the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/
.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn
.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
language: en
Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.
Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='gpt2-large')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
[{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"},
{'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"},
{'generated_text': "Hello, I'm a language model, why does this matter for you?
When I hear new languages, I tend to start thinking in terms"},
{'generated_text': "Hello, I'm a language model, a functional language...
I don't need to know anything else. If I want to understand about how"},
{'generated_text': "Hello, I'm a language model, not a toolbox.
In a nutshell, a language model is a set of attributes that define how"}]
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2Model.from_pretrained('gpt2-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
and in TensorFlow:
from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = TFGPT2Model.from_pretrained('gpt2-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
In their model card about GPT-2, OpenAI wrote:
The primary intended users of these models are AI researchers and practitioners.
We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.
In their model card about GPT-2, OpenAI wrote:
Here are some secondary use cases we believe are likely:
- Writing assistance: Grammar assistance, autocompletion (for normal prose or code)
- Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.
- Entertainment: Creation of games, chat bots, and amusing generations.
In their model card about GPT-2, OpenAI wrote:
Because large-scale language models like GPT-2 ...
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The Code Training Model Generation Software market is experiencing rapid growth, driven by the increasing demand for efficient and accurate code generation. The market, estimated at $2 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This robust expansion is fueled by several key factors. Firstly, the rising complexity of software development necessitates tools that automate repetitive tasks and accelerate the coding process. Secondly, the growing adoption of AI and machine learning across various industries is creating a significant demand for code generation solutions that can handle increasingly sophisticated algorithms and data structures. Thirdly, the availability of large, publicly accessible datasets for training these models is further fueling innovation and market expansion. The market is segmented by application (enterprise and personal use) and deployment type (cloud-based and on-premises), with cloud-based solutions gaining significant traction due to their scalability and cost-effectiveness. Leading players like OpenAI, GitHub, and others are driving innovation and competition, fostering the development of more powerful and user-friendly tools. The geographical distribution of the market shows strong growth across North America and Europe, fueled by a high concentration of technology companies and a mature software development ecosystem. Asia Pacific is also witnessing substantial growth, driven by a rapidly expanding tech sector and increasing digital adoption. However, market penetration in regions like the Middle East and Africa remains relatively low, presenting significant future growth opportunities. While the market faces challenges like data security concerns and the need for continuous model training and updates, the overall outlook remains positive, with significant potential for further expansion driven by ongoing advancements in AI and machine learning technologies. The growing adoption of DevOps methodologies and the need for faster software development cycles are further solidifying the long-term growth trajectory of the code training model generation software market.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.
In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.
The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).
Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.
Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.
The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).
Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.
https://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txthttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Modern programming languages are constantly evolving, introducing new language features and APIs to enhance software development practices. Software developers often face the tedious task of upgrading their codebase to new programming language versions. Recently, large language models (LLMs) have demonstrated potential in automating various code generation and editing tasks, suggesting their applicability in automating code upgrade. However, there exists no benchmark for evaluating the code upgrade ability of LLMs, as distilling code changes related to programming language evolution from real-world software repositories’ commit histories is a complex challenge.
In this work, we introduce CoUpJava, the first large-scale dataset for code upgrade, focusing on the code changes related to the evolution of Java. CoUpJava comprises 10,697 code upgrade samples, distilled from the commit histories of 1,379 open-source Java repositories and covering Java versions 7–23. The dataset is divided into two subsets: CoUpJava-Fine, which captures fine-grained method-level refactorings towards new language features; and CoUpJava-Coarse, which includes coarse-grained repository-level changes encompassing new language features, standard library APIs, and build configurations. Our proposed dataset provides high-quality samples by filtering irrelevant and noisy changes and verifying the compilability of upgraded code. Moreover, CoUpJava reveals diversity in code upgrade scenarios, ranging from small, fine-grained refactorings to large-scale repository modifications.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset was taken from the GitHub Repository. This dataset is made public by Databricks for research and commercial use-cases. Originally the repository provides a jsonl file which was used to create a csv file included in this dataset.
Blog post: Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM
databricks-dolly-15k
is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.
Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation
Languages: English Version: 1.0
Owner: Databricks, Inc.
databricks-dolly-15k
is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category.
Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.
For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context
field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]
) which we recommend users remove for downstream applications.
While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories.
Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets.
As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.
To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous co...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.
The original dataset is organized into multiple CSV files, each containing structured data on different entities:
Table 1. code_blocks.csv structure
Column | Description |
code_blocks_index | Global index linking code blocks to markup_data.csv. |
kernel_id | Identifier for the Kaggle Jupyter notebook from which the code block was extracted. |
code_block_id |
Position of the code block within the notebook. |
code_block |
The actual machine learning code snippet. |
Table 2. kernels_meta.csv structure
Column | Description |
kernel_id | Identifier for the Kaggle Jupyter notebook. |
kaggle_score | Performance metric of the notebook. |
kaggle_comments | Number of comments on the notebook. |
kaggle_upvotes | Number of upvotes the notebook received. |
kernel_link | URL to the notebook. |
comp_name | Name of the associated Kaggle competition. |
Table 3. competitions_meta.csv structure
Column | Description |
comp_name | Name of the Kaggle competition. |
description | Overview of the competition task. |
data_type | Type of data used in the competition. |
comp_type | Classification of the competition. |
subtitle | Short description of the task. |
EvaluationAlgorithmAbbreviation | Metric used for assessing competition submissions. |
data_sources | Links to datasets used. |
metric type | Class label for the assessment metric. |
Table 4. markup_data.csv structure
Column | Description |
code_block | Machine learning code block. |
too_long | Flag indicating whether the block spans multiple semantic types. |
marks | Confidence level of the annotation. |
graph_vertex_id | ID of the semantic type. |
The dataset allows mapping between these tables. For example:
kernel_id
column.comp_name
. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index
column.
The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank
), providing additional context for evaluation.
competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.
The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Replication Package contains data and results of evaluating the performance and complexity of Large Language Models (LLMs) in solving programming challenges in LeetCode. It was developed with the paper "Analyzing Prominent LLMs: An Empirical Study on Solving LeetCode Problems," submitted to the 29th International Conference on Evaluation and Assessment in Software Engineering (2025). The dataset includes prompt templates, problem IDs, model-generated code solutions, and a spreadsheet with the raw data. Further details about the processed data, visualizations, and scripts will be provided with the final version of the paper.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for MultiPL-E
Dataset Summary
MultiPL-E is a dataset for evaluating large language models for code generation that supports 22 programming languages. It takes the OpenAI HumanEval and the Mostly Basic Python Programs (MBPP) benchmarks and uses little compilers to translate them to other languages. It is easy to add support for new languages and benchmarks. The dataset is divided into several configurations named SRCDATA-LANG, where SRCDATA is either… See the full description on the dataset page: https://huggingface.co/datasets/nuprl-staging/MultiPL-E.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
[EMNLP2024] DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
DA-Code is a comprehensive evaluation dataset designed to assess the data analysis and code generation capabilities of LLM in agent-based data science tasks. Our papers and experiment reports have been published on Arxiv.
Dataset Overview
500 complex real-world data analysis tasks across Data Wrangling (DW), Machine Learning (ML), and Exploratory Data Analysis (EDA). Tasks cover… See the full description on the dataset page: https://huggingface.co/datasets/Jianwen2003/DA-Code.
The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions. The problems range in difficulty from introductory to collegiate competition level and measure coding ability as well as problem-solving.
The Automated Programming Progress Standard, abbreviated APPS, consists of 10,000 coding problems in total, with 131,836 test cases for checking solutions and 232,444 ground-truth solutions written by humans. Problems can be complicated, as the average length of a problem is 293.2 words. The data are split evenly into training and test sets, with 5,000 problems each. In the test set, every problem has multiple test cases, and the average number of test cases is 21.2. Each test case is specifically designed for the corresponding problem, enabling us to rigorously evaluate program functionality.
The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel labs. Its purpose is for testing the generation of code snippets from natural language. The data comes from StackOverflow questions. There are 2379 training and 500 test examples that were manually annotated. Every example has a natural language intent and its corresponding python snippet. In addition to the manually annotated dataset, there are also 598,237 mined intent-snippet pairs. These examples are similar to the hand-annotated ones except that they contain a probability if the pair is valid.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for The Stack
Changelog
Release Description
v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.
v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.