Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The Stack v2
The dataset consists of 4 versions:
bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on theโฆ See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for The Stack
Changelog
Release Description
v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.
v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languagesโฆ See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The Stack v2 Subset with File Contents (Python, Java, JavaScript, C, C++)
TempestTeam/dataset-the-stack-v2-dedup-sub
Dataset Summary
This dataset is a language-filtered and self-contained subset of bigcode/the-stack-v2-dedup, part of the BigCode Project. It contains only files written in the following programming languages:
Python ๐ Java โ JavaScript ๐ C โ๏ธ C++ โ๏ธ
Unlike the original dataset, which only includes metadata and Software Heritage IDs, this subset includesโฆ See the full description on the dataset page: https://huggingface.co/datasets/TempestTeam/dataset-the-stack-v2-dedup-sub.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for The Stack Metadata
Changelog
Release Description
v1.1 This is the first release of the metadata. It is for The Stack v1.1
v1.2 Metadata dataset matching The Stack v1.2
Dataset Summary
This is a set of additional information for repositories used for The Stack. It contains file paths, detected licenes as well as some other information for the repositories.
Supported Tasks and Leaderboards
The main task is to recreateโฆ See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-metadata.
Facebook
TwitterReset23/the-stack-v2-java dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterReset23/the-stack-v2-python dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Update on The Stack V2 dataset: https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids All repos from original dataset are parsed with Github API and re-downloaded, so respective updates are kept, metadata is updated. This took 10+ days to process due to GraphQL limits.
Filtering rules
Removed repos with no update in the last 6 years (no updates since September 2019) Removed files with a single line Removed repos with a single file Removed repos with more than 99%โฆ See the full description on the dataset page: https://huggingface.co/datasets/thepowerfuldeez/the-stack-v2-train-smol-ids-updated.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset
์ด ๋ฐ์ดํฐ์ ์ fineweb-edu์ ๋ฐฉ๋ฒ์ ์ฌ๋ฌ ํ๋ก๊ทธ๋๋ฐ ์ธ์ด์ ์ ์ฉํ๊ธฐ ์ํด ๋ง๋ค์ด์ง ํฉ์ฑ ๋ฐ์ดํฐ์ ์ ๋๋ค. ๊ธฐ์กด์ ์กด์ฌํ๋ HuggingFaceTB/smollm-corpus์ Python-edu๋ Python์ผ๋ก๋ง ํ์ ๋์ด ์์์ต๋๋ค. ์ด ๋ฐ์ดํฐ์ ์ bigcode/the-stack-dedup์์ 21๊ฐ์ ํ๋ก๊ทธ๋๋ฐ ์ธ์ด์์ ๊ฐ๊ฐ 30k ์ํ์ ์ถ์ถํด ํ๊ฐํด ์ฌ๋ฌ ์ธ์ด์ ๋์ํฉ๋๋ค. ๊ตฌ์ฒด์ ์ผ๋ก๋ devngho/the-stack-mini-nonshuffled์ MIT, Apache 2.0, BSD 2-clause, BSD 3-clause ๋ผ์ด์ ์ค์ธ ์ฒซ 30k ์ํ์ด ์ฌ์ฉ๋์์ต๋๋ค. devngho/the_stack_llm_annotations์ ์ ์ฌํ๋, ํ๊ฐ์ Qwen2.5-32B ๋์ Qwen2.5-Coder-32B๋ฅผ ์ฌ์ฉํ์ต๋๋ค. This synthetic dataset was created to apply the methods ofโฆ See the full description on the dataset page: https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created using LeRobot.
Dataset Structure
meta/info.json: { "codebase_version": "v2.1", "robot_type": "Franka", "total_episodes": 6, "total_frames": 3130, "total_tasks": 1, "total_videos": 12, "total_chunks": 1, "chunks_size": 1000, "fps": 10, "splits": { "train": "0:6" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":โฆ See the full description on the dataset page: https://huggingface.co/datasets/Mohamedal/bowls-stack-2-test.
Facebook
TwitterNOTE: Please see EleutherAI/proof-pile-2
This is a cherry-picked repackaging of the algebraic-stack segment from the proof-pile-2 dataset as parquet files
License
see EleutherAI/proof-pile-2
Citation
see EleutherAI/proof-pile-2
Facebook
TwitterStack V2 Edu
Description
We filter the Stack V2 to only include code from openly licensed repositories, based on the license detection performed by the creators of Stack V2. When multiple licenses are detected in a single repository, we ensure that all of the licenses are on the Blue Oak Council certified license list. Per-document license information is available in the license entry of the metadata field of each example. Code for collecting, processing, and preparingโฆ See the full description on the dataset page: https://huggingface.co/datasets/common-pile/stackv2_edu_filtered.
Facebook
TwitterReset23/the-stack-v2-filtered-c dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterReset23/the-stack-v2-blamed2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterReset23/the-stack-v2-new-c dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterjadechoghari/genesis-stack-cube-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterCohenQu/the-stack-v2-dedup-Python_10k_01 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterproof-pile-2ใฎalgebraic-stackใใใฉใณใใ ใซๆๅพใใใใผใฟใปใใ https://huggingface.co/datasets/EleutherAI/proof-pile-2 License see EleutherAI/proof-pile-2
Facebook
TwitterReset23/the-stack-v2-blamed-python dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The R code only of https://huggingface.co/datasets/bigcode/the-stack-v2, downloaded content and ready to use.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The Stack v2
The dataset consists of 4 versions:
bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on theโฆ See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.