4 datasets found
  1. rStar-Coder

    • huggingface.co
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2025). rStar-Coder [Dataset]. https://huggingface.co/datasets/microsoft/rStar-Coder
    Explore at:
    Dataset updated
    Jul 15, 2025
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    rStar-Coder Dataset

    Project GitHub | Paper

      Dataset Description
    

    rStar-Coder is a large-scale competitive code problem dataset containing 418K programming problems, 580K long-reasoning solutions, and rich test cases of varying difficulty levels. This dataset aims to enhance code reasoning capabilities in large language models, particularly in handling competitive code problems. Experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/rStar-Coder.

  2. h

    rStar-Coder-seed-test

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chiruan (2025). rStar-Coder-seed-test [Dataset]. https://huggingface.co/datasets/chiruan/rStar-Coder-seed-test
    Explore at:
    Dataset updated
    Jul 30, 2025
    Authors
    chiruan
    Description

    chiruan/rStar-Coder-seed-test dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    cleand_microsoft_rStar-Coder

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LLMTeamAkiyama, cleand_microsoft_rStar-Coder [Dataset]. https://huggingface.co/datasets/LLMTeamAkiyama/cleand_microsoft_rStar-Coder
    Explore at:
    Dataset authored and provided by
    LLMTeamAkiyama
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    元データ: https://huggingface.co/datasets/microsoft/rStar-Coder データ件数: 269,863 平均トークン数: 11674 最大トークン数: 31,184 合計トークン数: 3,150,447,484 ファイル形式: JSONL ファイルサイズ: 不明 加工内容

    synthetic_sftを使用 トークン処理が重たいので、文字数でフィルター seed_question < 6000 generation < 80000 thinkタグ除去 が中途半端なものを除外 トークナイズ処理(速度向上アップデート 繰り返し除去

  4. h

    qwen3-coder-480b-distill-mini

    • huggingface.co
    Updated Aug 18, 2003
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JIRONG (2003). qwen3-coder-480b-distill-mini [Dataset]. https://huggingface.co/datasets/Jackrong/qwen3-coder-480b-distill-mini
    Explore at:
    Dataset updated
    Aug 18, 2003
    Authors
    JIRONG
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    qwen3-coder-480b-distill-mini

      Short Description
    

    This dataset is distilled using Qwen3-Coder-480B-A35B-Instruct.We extracted 10,000 code questions from microsoft/rStar-Coder as seed problems, distilled them with 32K context, and after cleaning and filtering, 9,543 samples remain.License: Apache-2.0.

      Dataset Overview
    

    Seed Source: 10,000 code reasoning problems sampled from microsoft/rStar-Coder.
    Distillation Model: Qwen3-Coder-480B-A35B-Instruct (480B… See the full description on the dataset page: https://huggingface.co/datasets/Jackrong/qwen3-coder-480b-distill-mini.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Microsoft (2025). rStar-Coder [Dataset]. https://huggingface.co/datasets/microsoft/rStar-Coder
Organization logo

rStar-Coder

rStar-Coder

microsoft/rStar-Coder

Explore at:
13 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 15, 2025
Dataset authored and provided by
Microsofthttp://microsoft.com/
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

rStar-Coder Dataset

Project GitHub | Paper

  Dataset Description

rStar-Coder is a large-scale competitive code problem dataset containing 418K programming problems, 580K long-reasoning solutions, and rich test cases of varying difficulty levels. This dataset aims to enhance code reasoning capabilities in large language models, particularly in handling competitive code problems. Experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/rStar-Coder.

Search
Clear search
Close search
Google apps
Main menu