Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers
Abstract
Reasoning traces in large language models reveal that partial-input success in multiple-choice question answering can indicate valid inference rather than shallow shortcuts, suggesting a potential method to distinguish problematic from less problematic reasoning patterns.
Large language models (LLMs) now give reasoning before answering, excelling in tasks like multiple-choice question answering (MCQA). Yet, a concern is that LLMs do not solve MCQs as intended, as work finds LLMs sans reasoning succeed in MCQA without using the question, i.e., choices-only. Such partial-input success is often linked to trivial shortcuts, but reasoning traces could reveal if choices-only strategies are truly shallow. To examine these strategies, we have reasoning LLMs solve MCQs in full and choices-only inputs; test-time reasoning often boosts accuracy in full and in choices-only, half the time. While possibly due to shallow shortcuts, choices-only success is barely affected by the length of reasoning traces, and after finding traces pass faithfulness tests, we show they use less problematic strategies like inferring missing questions. In all, we challenge claims that partial-input success is always a flaw, so we propose how reasoning traces could separate problematic data from less problematic reasoning.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper