Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges Paper • 2604.13602 • Published 27 days ago • 32
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation Paper • 2601.14691 • Published Jan 21 • 1
ThinkPRM Collection Process Reward Models that Think -- https://arxiv.org/abs/2504.16828 • 8 items • Updated Jul 29, 2025 • 6
view article Article SmolLM3: smol, multilingual, long-context reasoner +21 eliebak, cmpatino, anton-l, edbeeching, m-ric, nouamanetazi, akseljoonas, guipenedo, hynky, clefourrier, SaylorTwift, kashif, qgallouedec, hlarcher, glutamatt, Xenova, reach-vb, ngxson, craffel, lewtun, loubnabnl, lvwerra, thomwolf • Jul 8, 2025 • 773
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists Paper • 2506.01241 • Published Jun 2, 2025 • 9
ThinkPRM Collection Process Reward Models that Think -- https://arxiv.org/abs/2504.16828 • 8 items • Updated Jul 29, 2025 • 6