Flexible Entropy Control in RLVR with Gradient-Preserving Perspective
Abstract
Reinforcement learning with verifiable rewards faces entropy collapse issues due to gradient-preserving clipping, which this paper addresses through dynamic entropy control mechanisms.
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.
Community
This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models (2026)
- Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR (2026)
- DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning (2026)
- ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization (2026)
- Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning (2026)
- Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards (2025)
- Rethinking the Trust Region in LLM Reinforcement Learning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper