Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders Paper • 2605.16339 • Published May 7