Process Reward Models that Think -- https://arxiv.org/abs/2504.16828
AI & ML interests
Factuality, reasoning, alignment, LLM applications
Recent Activity
spaces 7
Running
LudoBench
🎲
Multimodal Game Reasoning Benchmark [ICLR 2026]
Running
Answer Convergence Early Stopping
🛑
Demo for EMNLP Paper "Answer Convergence as a Signal..."
Runtime error
FactRBench
🏆
View and analyze long-form factuality leaderboard
Running
3
ExpertLongBench
🚀
Leaderboard for ExpertLongBench
Sleeping
1
ManyICLBench
🚀
Leaderboard for ManyICLBench
Running
MLRC-BENCH
📊
Display model performance rankings
datasets 13
launch/LudoBench
Viewer • Updated • 638 • 180
launch/ExpertLongBench
Preview • Updated • 639 • 10
launch/thinkprm-1K-verification-cots
Viewer • Updated • 1k • 23 • 6
launch/ManyICLBench
Viewer • Updated • 66 • 569 • 1
launch/CMV
Viewer • Updated • 133 • 18
launch/FactRBench
Viewer • Updated • 1.06k • 84 • 1
launch/FactBench
Viewer • Updated • 1k • 87 • 3
launch/CLASH
Viewer • Updated • 345 • 93 • 4
launch/gov_report
Viewer • Updated • 58.4k • 379 • 11
launch/gov_report_qs
Viewer • Updated • 7.87k • 76 • 4