DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning Paper β’ 2605.25604 β’ Published 4 days ago β’ 128
SkillOpt: Executive Strategy for Self-Evolving Agent Skills Paper β’ 2605.23904 β’ Published 7 days ago β’ 186
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information Paper β’ 2605.11609 β’ Published 17 days ago β’ 193
RLPR: Extrapolating RLVR to General Domains without Verifiers Paper β’ 2506.18254 β’ Published Jun 23, 2025 β’ 35
Reinforcement-aware Knowledge Distillation for LLM Reasoning Paper β’ 2602.22495 β’ Published Feb 26 β’ 5
Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning Paper β’ 2602.01058 β’ Published Feb 1 β’ 45
Running 345 LLM Embeddings Explained: A Visual and Intuitive Guide π 345 How Language Models Turn Text into Meaning, From Traditional