100+ datasets found
  1. h

    the-stack-v2

    • huggingface.co
    Updated Mar 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2024). the-stack-v2 [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2
    Explore at:
    Dataset updated
    Mar 1, 2024
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The Stack v2

    The dataset consists of 4 versions:

    bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.

  2. h

    the-stack

    • huggingface.co
    • opendatalab.com
    Updated Oct 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for The Stack

      Changelog
    

    Release Description

    v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

    v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languagesโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.

  3. h

    dataset-the-stack-v2-dedup-sub

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TempestTeam, dataset-the-stack-v2-dedup-sub [Dataset]. https://huggingface.co/datasets/TempestTeam/dataset-the-stack-v2-dedup-sub
    Explore at:
    Dataset authored and provided by
    TempestTeam
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The Stack v2 Subset with File Contents (Python, Java, JavaScript, C, C++)

    TempestTeam/dataset-the-stack-v2-dedup-sub

      Dataset Summary
    

    This dataset is a language-filtered and self-contained subset of bigcode/the-stack-v2-dedup, part of the BigCode Project. It contains only files written in the following programming languages:

    Python ๐Ÿ Java โ˜• JavaScript ๐Ÿ“œ C โš™๏ธ C++ โš™๏ธ

    Unlike the original dataset, which only includes metadata and Software Heritage IDs, this subset includesโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/TempestTeam/dataset-the-stack-v2-dedup-sub.

  4. h

    the-stack-metadata

    • huggingface.co
    Updated Apr 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2023). the-stack-metadata [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 16, 2023
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for The Stack Metadata

      Changelog
    

    Release Description

    v1.1 This is the first release of the metadata. It is for The Stack v1.1

    v1.2 Metadata dataset matching The Stack v1.2

      Dataset Summary
    

    This is a set of additional information for repositories used for The Stack. It contains file paths, detected licenes as well as some other information for the repositories.

      Supported Tasks and Leaderboards
    

    The main task is to recreateโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-metadata.

  5. h

    the-stack-v2-java

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShangyiGeng, the-stack-v2-java [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-java
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    ShangyiGeng
    Description

    Reset23/the-stack-v2-java dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    the-stack-v2-python

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShangyiGeng, the-stack-v2-python [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    ShangyiGeng
    Description

    Reset23/the-stack-v2-python dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    the-stack-v2-train-smol-ids-updated

    • huggingface.co
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Grigorev (2025). the-stack-v2-train-smol-ids-updated [Dataset]. https://huggingface.co/datasets/thepowerfuldeez/the-stack-v2-train-smol-ids-updated
    Explore at:
    Dataset updated
    Sep 19, 2025
    Authors
    George Grigorev
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Update on The Stack V2 dataset: https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids All repos from original dataset are parsed with Github API and re-downloaded, so respective updates are kept, metadata is updated. This took 10+ days to process due to GraphQL limits.

      Filtering rules
    

    Removed repos with no update in the last 6 years (no updates since September 2019) Removed files with a single line Removed repos with a single file Removed repos with more than 99%โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/thepowerfuldeez/the-stack-v2-train-smol-ids-updated.

  8. h

    the-stack-llm-annotations-v2

    • huggingface.co
    Updated Nov 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    devngho (2024). the-stack-llm-annotations-v2 [Dataset]. https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2
    Explore at:
    Dataset updated
    Nov 21, 2024
    Authors
    devngho
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    ์ด ๋ฐ์ดํ„ฐ์…‹์€ fineweb-edu์˜ ๋ฐฉ๋ฒ•์„ ์—ฌ๋Ÿฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง„ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด์— ์กด์žฌํ•˜๋˜ HuggingFaceTB/smollm-corpus์˜ Python-edu๋Š” Python์œผ๋กœ๋งŒ ํ•œ์ •๋˜์–ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์€ bigcode/the-stack-dedup์—์„œ 21๊ฐœ์˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด์—์„œ ๊ฐ๊ฐ 30k ์ƒ˜ํ”Œ์„ ์ถ”์ถœํ•ด ํ‰๊ฐ€ํ•ด ์—ฌ๋Ÿฌ ์–ธ์–ด์— ๋Œ€์‘ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” devngho/the-stack-mini-nonshuffled์˜ MIT, Apache 2.0, BSD 2-clause, BSD 3-clause ๋ผ์ด์„ ์Šค์ธ ์ฒซ 30k ์ƒ˜ํ”Œ์ด ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. devngho/the_stack_llm_annotations์™€ ์œ ์‚ฌํ•˜๋‚˜, ํ‰๊ฐ€์— Qwen2.5-32B ๋Œ€์‹  Qwen2.5-Coder-32B๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. This synthetic dataset was created to apply the methods ofโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2.

  9. h

    bowls-stack-2-test

    • huggingface.co
    Updated Jul 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Alaoui Mhamdi (2025). bowls-stack-2-test [Dataset]. https://huggingface.co/datasets/Mohamedal/bowls-stack-2-test
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Mohamed Alaoui Mhamdi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset was created using LeRobot.

      Dataset Structure
    

    meta/info.json: { "codebase_version": "v2.1", "robot_type": "Franka", "total_episodes": 6, "total_frames": 3130, "total_tasks": 1, "total_videos": 12, "total_chunks": 1, "chunks_size": 1000, "fps": 10, "splits": { "train": "0:6" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Mohamedal/bowls-stack-2-test.

  10. h

    algebraic-stack

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sparverius, algebraic-stack [Dataset]. https://huggingface.co/datasets/typeof/algebraic-stack
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    sparverius
    Description

    NOTE: Please see EleutherAI/proof-pile-2

    This is a cherry-picked repackaging of the algebraic-stack segment from the proof-pile-2 dataset as parquet files

      License
    

    see EleutherAI/proof-pile-2

      Citation
    

    see EleutherAI/proof-pile-2

  11. h

    stackv2_edu_filtered

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Pile, stackv2_edu_filtered [Dataset]. https://huggingface.co/datasets/common-pile/stackv2_edu_filtered
    Explore at:
    Dataset authored and provided by
    Common Pile
    Description

    Stack V2 Edu

      Description
    

    We filter the Stack V2 to only include code from openly licensed repositories, based on the license detection performed by the creators of Stack V2. When multiple licenses are detected in a single repository, we ensure that all of the licenses are on the Blue Oak Council certified license list. Per-document license information is available in the license entry of the metadata field of each example. Code for collecting, processing, and preparingโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/common-pile/stackv2_edu_filtered.

  12. h

    the-stack-v2-filtered-c

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShangyiGeng, the-stack-v2-filtered-c [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-filtered-c
    Explore at:
    Authors
    ShangyiGeng
    Description

    Reset23/the-stack-v2-filtered-c dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    the-stack-v2-blamed2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShangyiGeng, the-stack-v2-blamed2 [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-blamed2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    ShangyiGeng
    Description

    Reset23/the-stack-v2-blamed2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    the-stack-v2-new-c

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShangyiGeng, the-stack-v2-new-c [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-new-c
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    ShangyiGeng
    Description

    Reset23/the-stack-v2-new-c dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    genesis-stack-cube-2

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jade Choghari (2025). genesis-stack-cube-2 [Dataset]. https://huggingface.co/datasets/jadechoghari/genesis-stack-cube-2
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Jade Choghari
    Description

    jadechoghari/genesis-stack-cube-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    the-stack-v2-dedup-Python_10k_01

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuxiao Qu, the-stack-v2-dedup-Python_10k_01 [Dataset]. https://huggingface.co/datasets/CohenQu/the-stack-v2-dedup-Python_10k_01
    Explore at:
    Authors
    Yuxiao Qu
    Description

    CohenQu/the-stack-v2-dedup-Python_10k_01 dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    algebraic-stack-small

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    issei fujimoto (2024). algebraic-stack-small [Dataset]. https://huggingface.co/datasets/if001/algebraic-stack-small
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    issei fujimoto
    Description

    proof-pile-2ใฎalgebraic-stackใ‹ใ‚‰ใƒฉใƒณใƒ€ใƒ ใซๆ‰€ๅพ—ใ—ใŸใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ https://huggingface.co/datasets/EleutherAI/proof-pile-2 License see EleutherAI/proof-pile-2

  18. h

    the-stack-v2-extra-python-content

    • huggingface.co
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Grigorev (2025). the-stack-v2-extra-python-content [Dataset]. https://huggingface.co/datasets/thepowerfuldeez/the-stack-v2-extra-python-content
    Explore at:
    Dataset updated
    Sep 19, 2025
    Authors
    George Grigorev
    Description

    580M tokens

  19. h

    the-stack-v2-blamed-python

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShangyiGeng, the-stack-v2-blamed-python [Dataset]. https://huggingface.co/datasets/Reset23/the-stack-v2-blamed-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    ShangyiGeng
    Description

    Reset23/the-stack-v2-blamed-python dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    the-stack-v2-r-code

    • huggingface.co
    Updated Oct 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad Zaenal (2025). the-stack-v2-r-code [Dataset]. https://huggingface.co/datasets/zaenalium/the-stack-v2-r-code
    Explore at:
    Dataset updated
    Oct 10, 2025
    Authors
    Ahmad Zaenal
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The R code only of https://huggingface.co/datasets/bigcode/the-stack-v2, downloaded content and ready to use.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BigCode (2024). the-stack-v2 [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2

the-stack-v2

The-Stack-v2

bigcode/the-stack-v2

Explore at:
Dataset updated
Mar 1, 2024
Dataset authored and provided by
BigCode
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The Stack v2

The dataset consists of 4 versions:

bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.

Search
Clear search
Close search
Google apps
Main menu