https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The Stack v2
The dataset consists of 4 versions:
bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on the… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for The Stack
Changelog
Release Description
v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 1.5TB in size.
v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-dedup.
The Stack contains over 3TB of permissively-licensed source code files covering 30 programming languages crawled from GitHub. The dataset was created as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs).
Reset23/the-stack-v2-java dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about companies in Dearborn. It has 129 rows. It features 2 columns including stack.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about companies. It has 3,456,808 rows. It features 2 columns including stack. It is 81% filled with non-null values.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about companies in Brasília. It has 215 rows. It features 2 columns including stack.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about companies in Mosta. It has 1 row. It features 2 columns including stack.
Reset23/the-stack-v2-filtered2-cpp dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about companies in Bubikon. It has 10 rows. It features 2 columns including stack.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about companies in Derby. It has 391 rows. It features 2 columns including stack.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dicomized distinct proprietary files of microscope imaging modalities
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dicomized distinct proprietary files of microscope imaging modalities
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dicomized distinct proprietary files of microscope imaging modalities
Involves data where a robot interacts with 5.1 cm colored blocks to complete an order-fulfillment style block stacking task. It contains dynamic scenes and real time-series data in a less constrained environment than comparable datasets. There are nearly 12,000 stacking attempts and over 2 million frames of real data.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Every year, Stack Overflow conducts a massive survey of people on the site, covering all sorts of information like programming languages, salary, code style and various other information. This year, they amassed more than 64,000 responses fielded from 213 countries. Data The data is made up of two files: 1. survey_results_public.csv - CSV file with main survey results, one respondent per row and one column per answer 2. survey_results_schema.csv - CSV file with survey schema, i.e., the questions that correspond to each column name m Acknowledgements Data is directly taken from StackOverflow and licensed under the ODbL license.
Surveys
internet,Information Technology,coding
51248
Free
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about companies in China. It has 32,433 rows. It features 2 columns including stack.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Software engineering Q&A websites (e.g., Stack Overflow), harness the collective expertise of users to address technical queries. Over time, these platforms evolve into valuable repositories of software engineering knowledge. Such repositories serve as essential resources for developers looking for solutions to common programming problems. In Stack Overflow, developers may approach answering questions in various ways. Gaining insight into how developers formulate their answers on Stack Overflow can enhance knowledge sharing and streamline the process of finding solutions. Furthermore, such insights could also inform improvements in Generative Artificial Intelligence (GenAI) tools to better align generated source code for comprehension and understandability, as AI-generated answers are known to include irrelevant information and hallucinations. In this study, we seek to deepen the understanding of how solutions are presented on Stack Overflow. We conducted an empirical study that investigates programming questions that are answered with a Solution Snippet to understand how a Solution Snippet is presented, and the ways how it should be adapted when it is reused. Our study resulted in two categorizations: 1) eight categories of how Solution Snippets are presented on Stack Overflow answers and 2) five categories of how Solution Snippets could be adapted for reuse. Then, we analyzed these categorizations and discussed the implications. We anticipate that Stack Overflow will remain a valuable resource for the foreseeable future, and the insights revealed in our paper lay the groundwork for improving program comprehension of Solution Snippets on Stack Overflow and GenAI tools.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset is about: (Table 2) SPECMAP stack of stable oxygen isotopes covering the last 300 000 years. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.726602 for more information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dicomized distinct proprietary files of microscope imaging modalities
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The Stack v2
The dataset consists of 4 versions:
bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on the… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.