Papers
arxiv:2606.05152

Reinforcement Learning from Rich Feedback with Distributional DAgger

Published on Jun 3
ยท Submitted by
Rishabh Agrawal
on Jun 8
Authors:
,

Abstract

Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods.

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

Community

Paper author Paper submitter

We found something surprising about existing self-distillation methods:

๐—˜๐˜ƒ๐—ฒ๐—ป ๐˜„๐—ต๐—ฒ๐—ป ๐˜๐—ต๐—ฒ ๐—ณ๐—ฒ๐—ฒ๐—ฑ๐—ฏ๐—ฎ๐—ฐ๐—ธ-๐—ฐ๐—ผ๐—ป๐—ฑ๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—ฒ๐—ฑ "๐˜๐—ฒ๐—ฎ๐—ฐ๐—ต๐—ฒ๐—ฟ" ๐—ถ๐˜€ ๐—ฏ๐—ฒ๐˜๐˜๐—ฒ๐—ฟ ๐˜๐—ต๐—ฎ๐—ป ๐˜๐—ต๐—ฒ ๐˜€๐˜๐˜‚๐—ฑ๐—ฒ๐—ป๐˜, ๐—ถ๐—บ๐—ถ๐˜๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—ถ๐˜ ๐—ฐ๐—ฎ๐—ป ๐˜€๐˜๐—ถ๐—น๐—น ๐—บ๐—ฎ๐—ธ๐—ฒ ๐˜๐—ต๐—ฒ ๐˜€๐˜๐˜‚๐—ฑ๐—ฒ๐—ป๐˜ ๐˜„๐—ผ๐—ฟ๐˜€๐—ฒ.

This is particularly striking because self-distillation has become one of the most promising ways to go beyond RLVR, where every token receives the same trajectory-level reward.

So we asked:

๐—–๐—ฎ๐—ป ๐˜„๐—ฒ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ฟ๐—ถ๐—ฐ๐—ต ๐—ณ๐—ฒ๐—ฒ๐—ฑ๐—ฏ๐—ฎ๐—ฐ๐—ธ ๐—ถ๐—ป ๐—ฎ ๐˜„๐—ฎ๐˜† ๐˜๐—ต๐—ฎ๐˜ ๐—ฎ๐—ฐ๐˜๐˜‚๐—ฎ๐—น๐—น๐˜† ๐—ด๐˜‚๐—ฎ๐—ฟ๐—ฎ๐—ป๐˜๐—ฒ๐—ฒ๐˜€ ๐—บ๐—ผ๐—ป๐—ผ๐˜๐—ผ๐—ป๐—ถ๐—ฐ ๐—ฝ๐—ผ๐—น๐—ถ๐—ฐ๐˜† ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ๐—บ๐—ฒ๐—ป๐˜?

Introducing ๐——๐—ถ๐˜€๐˜๐—œ๐—Ÿ: ๐—ฅ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ฅ๐—ถ๐—ฐ๐—ต ๐—™๐—ฒ๐—ฒ๐—ฑ๐—ฏ๐—ฎ๐—ฐ๐—ธ ๐˜„๐—ถ๐˜๐—ต ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น ๐——๐—”๐—ด๐—ด๐—ฒ๐—ฟ.

Core idea: view rich-feedback RL as distributional imitation learning.

This gives:
โ€ข monotonic policy improvement guarantees
โ€ข regret bounds

And empirically, DistIL improves over RLVR, SDPO, and OPSD on:
โ€ข science reasoning
โ€ข coding
โ€ข mathematical reasoning

Paper: https://arxiv.org/pdf/2606.05152

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05152 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.05152 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05152 in a Space README.md to link it from this page.

Collections including this paper 1