Brevity Constraints Reverse Performance Hierarchies in Language Models
Abstract
Large language models can underperform smaller ones due to verbose responses that introduce errors, but constraining output length reveals their superior capabilities and improves performance across benchmarks.
Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9 percentage point advantages over small models -- direct inversions of the original gaps. These reversals prove large models possess superior latent capabilities that universal prompting masks. We validate findings through three independent contamination tests and demonstrate inverse scaling operates continuously across the full parameter spectrum, with dataset-specific optimal scales ranging from 0.5B to 3.0B parameters. Our results establish that maximizing large model performance requires scale-aware prompt engineering rather than universal evaluation protocols, with immediate implications for deployment: prompt adaptation simultaneously improves accuracy and reduces computational costs.
Community
We evaluate 31 language models (0.5B–405B parameters) across 1,485 problems
from five standard benchmarks and identify a systematic but correctable failure
mode: on 7.7% of problems, small models (≤10B) outperform large models (≥70B)
by an average of 28.4 percentage points (Cohen's d = 1.34).
The mechanism is scale-dependent verbosity. Large models spontaneously
generate responses 59% longer than small models on these problems — not through
more explicit reasoning steps (9.1 vs 10.5), but through verbose implicit
elaboration that accumulates errors. We call this overthinking.
A simple intervention reverses the hierarchy. Adding brevity constraints
("answer in under 50 words") improves large model accuracy by +26.3pp and
reduces the performance gap by 67%. Critically, on GSM8K and MMLU-STEM, the
gap doesn't just close — it fully reverses: large models go from losing by
13.1pp and 27.3pp to winning by 7.7pp and 15.9pp respectively.
This effect is architecture-independent, replicating across Llama, Qwen,
Gemma, and Mistral families (5/5 datasets each), and operates continuously
across the full parameter spectrum (Pearson r = −0.388, p = 0.0035).
Three independent contamination tests (response diversity: 89–100% unique;
length variability: CV = 0.31–1.21; error taxonomy: 41–82% over-reasoning
failures) confirm genuine capability differences rather than memorization
artifacts.
The implication: inverse scaling on standard benchmarks reflects prompt
design failure, not architectural limitation. Large models possess superior
latent capabilities that universal prompting obscures. Scale-aware prompt
engineering — not larger models or retraining — is sufficient to recover them.
Get this paper in your agent:
hf papers read 2604.00025 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper