100+ datasets found

h
the-stack-v2
huggingface.co
Updated Mar 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2024). the-stack-v2 [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2
Explore at:
Dataset updated
Mar 1, 2024
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The Stack v2

The dataset consists of 4 versions:

bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on the… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.
h
the-stack
huggingface.co
opendatalab.com
Updated Oct 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
Explore at:
Dataset updated
Oct 27, 2022
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for The Stack

Changelog

Release Description

v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
h
dataset-the-stack-v2-dedup-sub
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TempestTeam, dataset-the-stack-v2-dedup-sub [Dataset]. https://huggingface.co/datasets/TempestTeam/dataset-the-stack-v2-dedup-sub
Explore at:
Dataset authored and provided by
TempestTeam
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The Stack v2 Subset with File Contents (Python, Java, JavaScript, C, C++)

TempestTeam/dataset-the-stack-v2-dedup-sub

Dataset Summary

This dataset is a language-filtered and self-contained subset of bigcode/the-stack-v2-dedup, part of the BigCode Project. It contains only files written in the following programming languages:

Python 🐍 Java ☕ JavaScript 📜 C ⚙️ C++ ⚙️

Unlike the original dataset, which only includes metadata and Software Heritage IDs, this subset includes… See the full description on the dataset page: https://huggingface.co/datasets/TempestTeam/dataset-the-stack-v2-dedup-sub.
h
the-stack-metadata
huggingface.co
Updated Apr 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2023). the-stack-metadata [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 16, 2023
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for The Stack Metadata

Changelog

Release Description

v1.1 This is the first release of the metadata. It is for The Stack v1.1

v1.2 Metadata dataset matching The Stack v1.2

Dataset Summary

This is a set of additional information for repositories used for The Stack. It contains file paths, detected licenes as well as some other information for the repositories.

Supported Tasks and Leaderboards

The main task is to recreate… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-metadata.
h
the-stack-v2-java
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShangyiGeng, the-stack-v2-java [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-java
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
ShangyiGeng
Description
Reset23/the-stack-v2-java dataset hosted on Hugging Face and contributed by the HF Datasets community
h
the-stack-v2-python
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShangyiGeng, the-stack-v2-python [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
ShangyiGeng
Description
Reset23/the-stack-v2-python dataset hosted on Hugging Face and contributed by the HF Datasets community
h
the-stack-v2-train-smol-ids-updated
huggingface.co
Updated Sep 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Grigorev (2025). the-stack-v2-train-smol-ids-updated [Dataset]. https://huggingface.co/datasets/thepowerfuldeez/the-stack-v2-train-smol-ids-updated
Explore at:
Dataset updated
Sep 19, 2025
Authors
George Grigorev
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Update on The Stack V2 dataset: https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids All repos from original dataset are parsed with Github API and re-downloaded, so respective updates are kept, metadata is updated. This took 10+ days to process due to GraphQL limits.

Filtering rules

Removed repos with no update in the last 6 years (no updates since September 2019) Removed files with a single line Removed repos with a single file Removed repos with more than 99%… See the full description on the dataset page: https://huggingface.co/datasets/thepowerfuldeez/the-stack-v2-train-smol-ids-updated.
h
the-stack-llm-annotations-v2
huggingface.co
Updated Nov 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
devngho (2024). the-stack-llm-annotations-v2 [Dataset]. https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2
Explore at:
Dataset updated
Nov 21, 2024
Authors
devngho
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

이 데이터셋은 fineweb-edu의 방법을 여러 프로그래밍 언어에 적용하기 위해 만들어진 합성 데이터셋입니다. 기존에 존재하던 HuggingFaceTB/smollm-corpus의 Python-edu는 Python으로만 한정되어 있었습니다. 이 데이터셋은 bigcode/the-stack-dedup에서 21개의 프로그래밍 언어에서 각각 30k 샘플을 추출해 평가해 여러 언어에 대응합니다. 구체적으로는 devngho/the-stack-mini-nonshuffled의 MIT, Apache 2.0, BSD 2-clause, BSD 3-clause 라이선스인 첫 30k 샘플이 사용되었습니다. devngho/the_stack_llm_annotations와 유사하나, 평가에 Qwen2.5-32B 대신 Qwen2.5-Coder-32B를 사용했습니다. This synthetic dataset was created to apply the methods of… See the full description on the dataset page: https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2.
h
bowls-stack-2-test
huggingface.co
Updated Jul 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Alaoui Mhamdi (2025). bowls-stack-2-test [Dataset]. https://huggingface.co/datasets/Mohamedal/bowls-stack-2-test
Explore at:
Dataset updated
Jul 27, 2025
Authors
Mohamed Alaoui Mhamdi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset was created using LeRobot.

Dataset Structure

meta/info.json: { "codebase_version": "v2.1", "robot_type": "Franka", "total_episodes": 6, "total_frames": 3130, "total_tasks": 1, "total_videos": 12, "total_chunks": 1, "chunks_size": 1000, "fps": 10, "splits": { "train": "0:6" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/Mohamedal/bowls-stack-2-test.
h
algebraic-stack
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sparverius, algebraic-stack [Dataset]. https://huggingface.co/datasets/typeof/algebraic-stack
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
sparverius
Description
NOTE: Please see EleutherAI/proof-pile-2

This is a cherry-picked repackaging of the algebraic-stack segment from the proof-pile-2 dataset as parquet files

License

see EleutherAI/proof-pile-2

Citation

see EleutherAI/proof-pile-2
h
stackv2_edu_filtered
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Pile, stackv2_edu_filtered [Dataset]. https://huggingface.co/datasets/common-pile/stackv2_edu_filtered
Explore at:
Dataset authored and provided by
Common Pile
Description
Stack V2 Edu

Description

We filter the Stack V2 to only include code from openly licensed repositories, based on the license detection performed by the creators of Stack V2. When multiple licenses are detected in a single repository, we ensure that all of the licenses are on the Blue Oak Council certified license list. Per-document license information is available in the license entry of the metadata field of each example. Code for collecting, processing, and preparing… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/stackv2_edu_filtered.
h
the-stack-v2-filtered-c
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShangyiGeng, the-stack-v2-filtered-c [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-filtered-c
Explore at:
Authors
ShangyiGeng
Description
Reset23/the-stack-v2-filtered-c dataset hosted on Hugging Face and contributed by the HF Datasets community
h
the-stack-v2-blamed2
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShangyiGeng, the-stack-v2-blamed2 [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-blamed2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
ShangyiGeng
Description
Reset23/the-stack-v2-blamed2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
the-stack-v2-new-c
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShangyiGeng, the-stack-v2-new-c [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-new-c
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
ShangyiGeng
Description
Reset23/the-stack-v2-new-c dataset hosted on Hugging Face and contributed by the HF Datasets community
h
genesis-stack-cube-2
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jade Choghari (2025). genesis-stack-cube-2 [Dataset]. https://huggingface.co/datasets/jadechoghari/genesis-stack-cube-2
Explore at:
Dataset updated
Jun 1, 2025
Authors
Jade Choghari
Description
jadechoghari/genesis-stack-cube-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
the-stack-v2-dedup-Python_10k_01
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuxiao Qu, the-stack-v2-dedup-Python_10k_01 [Dataset]. https://huggingface.co/datasets/CohenQu/the-stack-v2-dedup-Python_10k_01
Explore at:
Authors
Yuxiao Qu
Description
CohenQu/the-stack-v2-dedup-Python_10k_01 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
algebraic-stack-small
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
issei fujimoto (2024). algebraic-stack-small [Dataset]. https://huggingface.co/datasets/if001/algebraic-stack-small
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
issei fujimoto
Description
proof-pile-2のalgebraic-stackからランダムに所得したデータセット https://huggingface.co/datasets/EleutherAI/proof-pile-2 License see EleutherAI/proof-pile-2
h
the-stack-v2-extra-python-content
huggingface.co
Updated Sep 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Grigorev (2025). the-stack-v2-extra-python-content [Dataset]. https://huggingface.co/datasets/thepowerfuldeez/the-stack-v2-extra-python-content
Explore at:
Dataset updated
Sep 19, 2025
Authors
George Grigorev
Description
580M tokens
h
the-stack-v2-blamed-python
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShangyiGeng, the-stack-v2-blamed-python [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-blamed-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
ShangyiGeng
Description
Reset23/the-stack-v2-blamed-python dataset hosted on Hugging Face and contributed by the HF Datasets community
h
the-stack-v2-r-code
huggingface.co
Updated Oct 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad Zaenal (2025). the-stack-v2-r-code [Dataset]. https://huggingface.co/datasets/zaenalium/the-stack-v2-r-code
Explore at:
Dataset updated
Oct 10, 2025
Authors
Ahmad Zaenal
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The R code only of https://huggingface.co/datasets/bigcode/the-stack-v2, downloaded content and ready to use.

Facebook

Twitter

Click to copy link

Link copied

Cite

BigCode (2024). the-stack-v2 [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2

the-stack-v2

The-Stack-v2

bigcode/the-stack-v2

Explore at:

Dataset updated

Mar 1, 2024

Dataset authored and provided by

BigCode

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The Stack v2

The dataset consists of 4 versions:

bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on the… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.

Clear search

Close search

Google apps

Main menu

the-stack-v2

the-stack

dataset-the-stack-v2-dedup-sub

the-stack-metadata

the-stack-v2-java

the-stack-v2-python

the-stack-v2-train-smol-ids-updated

the-stack-llm-annotations-v2

bowls-stack-2-test

algebraic-stack

stackv2_edu_filtered

the-stack-v2-filtered-c

the-stack-v2-blamed2

the-stack-v2-new-c

genesis-stack-cube-2

the-stack-v2-dedup-Python_10k_01

algebraic-stack-small

the-stack-v2-extra-python-content

the-stack-v2-blamed-python

the-stack-v2-r-code

the-stack-v2See More Versions

The-Stack-v2

bigcode/the-stack-v2

the-stack-v2