Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
Abstract
RLRT enhances self-distillation by reinforcing successful student decisions that deviate from teacher predictions, enabling more effective exploration in reinforcement learning via self-reward.
Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.
Community
Enough with the obedient student. Time to rebel ๐งโ๐โก
So far in LLM post-training, on-policy self-distillation pulls the student toward the teacher (the same model, but one that has seen a correct solution). But what happens if we force the student to mimic the teacher even on paths it already solved correctly? โ Its own reasoning gets erased ๐ง ๐จ
We introduce RLRT (RLVR with Reversed Teacher) ๐. Instead of pulling the student toward the teacher, we amplify the tokens where the student diverged from the teacher (who has seen a correct solution) yet still arrived at the correct answer. These tokens depart from one correct path yet remain verified, making them both self-driven and valuable exploration. ๐
And the results? ๐
RLRT consistently outperforms GRPO, self-distillation, and exploration baselines across base, instruction-tuned, and thinking-tuned Qwen3 models on 6 math benchmarks (AIME24/25/26, HMMT26, AMC23, MATH500), with gains up to +18.0%.
Largest gains on base models, where the policy had substantial headroom to explore. ๐ฏ
Get this paper in your agent:
hf papers read 2605.10781 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
![[1] rlrt_concept](https://cdn-uploads.huggingface.co/production/uploads/63e48f6d9db5da2dc1f6288e/BNi9GN5i2Owpy3B526xUh.png)