Applying On-Policy Distillation (OPD) to Tool-Integrated Reasoning (TIR) suffers from cascading error propagation: incorrect tool calls inject out-of-distribution observations that progressively amplify the student-teacher distribution shift, rendering the teacher's token-level supervision unreliable or even harmful.
SOD (Step-wise On-policy Distillation) addresses this by introducing an adaptive step-level weighting mechanism that:
- Suppresses distillation loss on steps where the student has drifted far from the teacher (erroneous pattern)
- Restores full supervision when the student recovers alignment (recovery pattern)
- Maintains dense token-level guidance on well-aligned steps (stable pattern)
All at negligible additional computational cost β the divergence metric reuses log-probabilities already computed in the OPD forward pass.
Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on average@32 at AIME 2025.
Performance comparison of the Qwen3 series on 4 benchmarks. We report average@32.
| Params | Method | AIME 2024 | AIME 2025 | GPQA | LiveCodeBench | Average |
|---|---|---|---|---|---|---|
| 0.6B | Vanilla | 7.71 | 12.81 | 13.24 | 14.89 | 12.16 |
| SFT | 5.67 | 5.42 | 15.20 | 9.61 | 8.97 | |
| GRPO | 4.06 | 4.90 | 20.38 | 15.95 | 11.32 | |
| OPD | 16.82 | 22.95 | 17.76 | 22.65 | 20.04 | |
| OPSD_gt | 12.63 | 17.04 | 17.32 | 16.73 | 15.93 | |
| OPSD_hint | 9.77 | 14.12 | 15.98 | 12.65 | 13.13 | |
| SOD | 20.84 | 26.13 | 22.19 | 27.72 | 24.22 | |
| 1.7B | Vanilla | 9.90 | 8.96 | 26.80 | 22.73 | 17.10 |
| SFT | 26.77 | 22.40 | 29.85 | 24.63 | 25.91 | |
| GRPO | 25.63 | 21.67 | 33.55 | 20.70 | 25.39 | |
| OPD | 43.86 | 37.04 | 31.73 | 32.45 | 36.27 | |
| OPSD_gt | 33.85 | 24.69 | 35.02 | 22.73 | 29.07 | |
| OPSD_hint | 34.42 | 21.43 | 33.46 | 23.12 | 28.11 | |
| SOD | 50.83 | 41.72 | 38.72 | 40.63 | 42.98 |
git clone https://github.com/YoungZ365/SOD.git
conda create -n SOD python=3.11
conda activate SOD
cd SOD
bash scripts/install_vllm_sglang_mcore.sh
pip install -e .[vllm]Download the following datasets:
| Dataset | Link | Usage |
|---|---|---|
| 3K Agentic SFT Data | π€ HuggingFace | Cold-start SFT |
| 30K Agentic RL Data | π€ HuggingFace | RL / Distillation Training |
| Evaluation Benchmarks | π€ HuggingFace | AIME2024/2025, GPQA-Diamond, LiveCodeBench |
Configure SandboxFusion for code execution:
- Local Deployment: Refer to SandboxFusion deployment docs
- Cloud Service: Use Volcano Engine Code Sandbox
After obtaining an API endpoint, configure it in:
recipe/demystify/sandbox_fusion_tool_config.yaml- The function
check_correctnessinverl/utils/reward_score/livecodebench/code_math.py
Configure examples/SOD/run_sft.sh with your paths:
MODEL_PATH: Base model path (e.g., Qwen3-1.7B or Qwen3-0.6B)TRAIN_DATA: Path to the SFT.parquetfileSAVE_PATH: Directory to save SFT checkpoints
bash examples/SOD/run_sft.shAfter SFT, merge the model checkpoint:
python3 -m verl.model_merger merge --backend fsdp \
--local_dir <checkpoint_dir>/global_step_xxx \
--target_dir <checkpoint_dir>/global_step_xxx/huggingfaceConfigure examples/SOD/run_sod.sh with your paths:
MODEL_PATH: Path to the SFT student modelTEACHER_MODEL_PATH: Path to the teacher model (e.g., a GRPO-trained 4B model)TRAIN_DATA: Path to the RL.parquetfile (30K dataset)- Evaluation data paths for AIME2024/2025
bash examples/SOD/run_sod.shTraining Resources: 8Γ NVIDIA H20 96GB GPUs, batch size 64.
You can monitor training dynamics and evaluation results via Weights & Biases (wandb).
We support evaluation on AIME 2024/2025, GPQA-Diamond, and LiveCodeBench-v6.
Taking AIME as an example:
bash examples/SOD/eval/run_eval_aime.shYou can observe average@32 / pass@32 / maj@32 metrics from your wandb project.
If needed, you can download our distilled models directly:
| Model | Link | Description |
|---|---|---|
| SOD-0.6B | π€ HuggingFace | SOD distilled from Qwen3-4B teacher |
| SOD-1.7B | π€ HuggingFace | SOD distilled from Qwen3-4B teacher |
| SOD-GRPO_teacher-4B | π€ HuggingFace | GRPO-trained Qwen3-4B teacher model |
All models are also available in our HuggingFace Collection.
@article{zhong2026sod,
title={SOD: Step-wise On-policy Distillation for Small Language Model Agents},
author={Zhong, Qiyong and Zheng, Mao and Song, Mingyang and Lin, Xin and Sun, Jie and Jiang, Houcheng and Wang, Xiang and Fang, Junfeng},
journal={arXiv preprint arXiv:2605.07725},
year={2026}
}Our implementation builds upon the excellent codebases of VeRL, Open-AgentRL, and ReTool. We sincerely thank these projects for their valuable contributions to the community.

