4 datasets found

rStar-Coder
huggingface.co
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2025). rStar-Coder [Dataset]. https://huggingface.co/datasets/microsoft/rStar-Coder
Explore at:
Dataset updated
Jul 15, 2025
Dataset authored and provided by
Microsofthttp://microsoft.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
rStar-Coder Dataset

Project GitHub | Paper

Dataset Description

rStar-Coder is a large-scale competitive code problem dataset containing 418K programming problems, 580K long-reasoning solutions, and rich test cases of varying difficulty levels. This dataset aims to enhance code reasoning capabilities in large language models, particularly in handling competitive code problems. Experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/rStar-Coder.
h
rStar-Coder-seed-test
huggingface.co
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chiruan (2025). rStar-Coder-seed-test [Dataset]. https://huggingface.co/datasets/chiruan/rStar-Coder-seed-test
Explore at:
Dataset updated
Jul 30, 2025
Authors
chiruan
Description
chiruan/rStar-Coder-seed-test dataset hosted on Hugging Face and contributed by the HF Datasets community
h
cleand_microsoft_rStar-Coder
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LLMTeamAkiyama, cleand_microsoft_rStar-Coder [Dataset]. https://huggingface.co/datasets/LLMTeamAkiyama/cleand_microsoft_rStar-Coder
Explore at:
Dataset authored and provided by
LLMTeamAkiyama
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
元データ: https://huggingface.co/datasets/microsoft/rStar-Coder データ件数: 269,863 平均トークン数: 11674 最大トークン数: 31,184 合計トークン数: 3,150,447,484 ファイル形式: JSONL ファイルサイズ: 不明加工内容

synthetic_sftを使用トークン処理が重たいので、文字数でフィルター seed_question < 6000 generation < 80000 thinkタグ除去が中途半端なものを除外トークナイズ処理（速度向上アップデート繰り返し除去
h
qwen3-coder-480b-distill-mini
huggingface.co
Updated Aug 18, 2003
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JIRONG (2003). qwen3-coder-480b-distill-mini [Dataset]. https://huggingface.co/datasets/Jackrong/qwen3-coder-480b-distill-mini
Explore at:
Dataset updated
Aug 18, 2003
Authors
JIRONG
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
qwen3-coder-480b-distill-mini

Short Description

This dataset is distilled using Qwen3-Coder-480B-A35B-Instruct.We extracted 10,000 code questions from microsoft/rStar-Coder as seed problems, distilled them with 32K context, and after cleaning and filtering, 9,543 samples remain.License: Apache-2.0.

Dataset Overview

Seed Source: 10,000 code reasoning problems sampled from microsoft/rStar-Coder.
Distillation Model: Qwen3-Coder-480B-A35B-Instruct (480B… See the full description on the dataset page: https://huggingface.co/datasets/Jackrong/qwen3-coder-480b-distill-mini.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Microsoft (2025). rStar-Coder [Dataset]. https://huggingface.co/datasets/microsoft/rStar-Coder

rStar-Coder

microsoft/rStar-Coder

Explore at:

13 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jul 15, 2025

Dataset authored and provided by

Microsofthttp://microsoft.com/

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

rStar-Coder Dataset

Project GitHub | Paper

  Dataset Description

rStar-Coder is a large-scale competitive code problem dataset containing 418K programming problems, 580K long-reasoning solutions, and rich test cases of varying difficulty levels. This dataset aims to enhance code reasoning capabilities in large language models, particularly in handling competitive code problems. Experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/rStar-Coder.

Clear search

Close search

Google apps

Main menu

rStar-Coder

rStar-Coder-seed-test

cleand_microsoft_rStar-Coder

qwen3-coder-480b-distill-mini

rStar-Coder

rStar-Coder

microsoft/rStar-Coder