arxiv:2605.11609

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Published on May 12

· Submitted by

floyed shen on May 20

#3 Paper of the day

rednote-hilab

Upvote

Authors:

Dongcheng Zhao ,

Abstract

Anti-Self-Distillation reverses the direction of knowledge transfer in self-distillation to improve math reasoning efficiency and accuracy.

AI-generated summary

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.

View arXiv page View PDF GitHub 6 Add to collection

Community

floyed

Paper submitter about 18 hours ago

AntiSD reaches GRPO's accuracy in 2–10× fewer training steps and improves final accuracy by up to +11.5 points on AIME 2024/2025, HMMT 2025, and BeyondAIME — consistent across 4B–30B dense and MoE models.

Standard self-distillation in reasoning RL pulls the student toward a teacher conditioned on a verified solution. The privileged context makes the teacher sharp on template tokens but unsure on the deliberation tokens — "Wait", "Let", "Maybe" — that drive multi-step search; descending its divergence reinforces templates at the cost of reasoning.

AntiSD flips the sign: instead of descending the divergence, we ascend a bounded Jensen–Shannon between student and teacher, with an entropy-triggered gate. No token-level reward shaping, no length normalization, no schedule heuristics.

Code: https://github.com/FloyedShen/AntiSD
Paper: https://www.alphaxiv.org/abs/2605.11609