Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training Paper • 2602.01511 • Published 5 days ago • 13
Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning Paper • 2510.03259 • Published Sep 26, 2025 • 57
OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment Paper • 2510.07743 • Published Oct 9, 2025 • 10
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning Paper • 2505.16421 • Published May 22, 2025 • 19
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models Paper • 2505.16265 • Published May 22, 2025 • 8
Discriminative Finetuning of Generative Large Language Models without Reward Models and Preference Data Paper • 2502.18679 • Published Feb 25, 2025 • 2