Jailbreak Distillation: Renewable Safety Benchmarking Paper • 2505.22037 • Published May 28, 2025 • 1
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety Paper • 2510.08240 • Published Oct 9, 2025 • 41
Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models Paper • 2510.21978 • Published Oct 24, 2025 • 16
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Paper • 2603.18886 • Published Mar 19 • 6
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts Paper • 2401.13136 • Published Jan 23, 2024
Certified Mitigation of Worst-Case LLM Copyright Infringement Paper • 2504.16046 • Published Apr 22, 2025 • 13
Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data Paper • 2404.03862 • Published Apr 5, 2024
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements Paper • 2410.08968 • Published Oct 11, 2024 • 14
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning Paper • 2410.01044 • Published Oct 1, 2024 • 35