Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
rStar-Coder Dataset
Project GitHub | Paper
Dataset Description
rStar-Coder is a large-scale competitive code problem dataset containing 418K programming problems, 580K long-reasoning solutions, and rich test cases of varying difficulty levels. This dataset aims to enhance code reasoning capabilities in large language models, particularly in handling competitive code problems. Experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/rStar-Coder.
chiruan/rStar-Coder-seed-test dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
元データ: https://huggingface.co/datasets/microsoft/rStar-Coder データ件数: 269,863 平均トークン数: 11674 最大トークン数: 31,184 合計トークン数: 3,150,447,484 ファイル形式: JSONL ファイルサイズ: 不明 加工内容
synthetic_sftを使用 トークン処理が重たいので、文字数でフィルター seed_question < 6000 generation < 80000 thinkタグ除去 が中途半端なものを除外 トークナイズ処理(速度向上アップデート 繰り返し除去
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
qwen3-coder-480b-distill-mini
Short Description
This dataset is distilled using Qwen3-Coder-480B-A35B-Instruct.We extracted 10,000 code questions from microsoft/rStar-Coder as seed problems, distilled them with 32K context, and after cleaning and filtering, 9,543 samples remain.License: Apache-2.0.
Dataset Overview
Seed Source: 10,000 code reasoning problems sampled from microsoft/rStar-Coder.
Distillation Model: Qwen3-Coder-480B-A35B-Instruct (480B… See the full description on the dataset page: https://huggingface.co/datasets/Jackrong/qwen3-coder-480b-distill-mini.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
rStar-Coder Dataset
Project GitHub | Paper
Dataset Description
rStar-Coder is a large-scale competitive code problem dataset containing 418K programming problems, 580K long-reasoning solutions, and rich test cases of varying difficulty levels. This dataset aims to enhance code reasoning capabilities in large language models, particularly in handling competitive code problems. Experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/rStar-Coder.