arxiv:2605.10781

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Published on May 11

· Submitted by

JeonghyeKim on May 12

Microsoft Research

Upvote

Authors:

Abstract

RLRT enhances self-distillation by reinforcing successful student decisions that deviate from teacher predictions, enabling more effective exploration in reinforcement learning via self-reward.

AI-generated summary

Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.

View arXiv page View PDF Add to collection

Community

beanie00

Paper submitter about 17 hours ago

•

edited about 6 hours ago

Enough with the obedient student. Time to rebel 🧑‍🎓⚡

So far in LLM post-training, on-policy self-distillation pulls the student toward the teacher (the same model, but one that has seen a correct solution). But what happens if we force the student to mimic the teacher even on paths it already solved correctly? → Its own reasoning gets erased 🧠💨

We introduce RLRT (RLVR with Reversed Teacher) 🔄. Instead of pulling the student toward the teacher, we amplify the tokens where the student diverged from the teacher (who has seen a correct solution) yet still arrived at the correct answer. These tokens depart from one correct path yet remain verified, making them both self-driven and valuable exploration. 🌟

And the results? 📈

RLRT consistently outperforms GRPO, self-distillation, and exploration baselines across base, instruction-tuned, and thinking-tuned Qwen3 models on 6 math benchmarks (AIME24/25/26, HMMT26, AMC23, MATH500), with gains up to +18.0%.

Largest gains on base models, where the policy had substantial headroom to explore. 🎯