Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3
Abstract
Majority voting improves mathematical reasoning but is limited by correlated errors; diverse reasoning strategies and model capability are more impactful than prompt engineering.
Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix is to assign different reasoning strategies to different voters. The approach, Diverse Prompt Mixer, is tested on the AIMO 3 competition: 3 models, 23+ experiments, 50 IMO-level problems, one H100 80 GB, 5-hour limit. Every prompt-level intervention fails. High-temperature sampling already decorrelates errors; weaker strategies reduce accuracy more than they reduce correlation. Across an 8-point capability gap at equal N=8 and every optimization tested, model capability dominates. The gap between the best majority-vote score (42/50) and pass@20 (~45.5) is selection loss, not prompt loss. A verifier-based selector could close it. Prompt engineering cannot.
Community
Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3
Diverse Prompt Mixer assigns different reasoning strategies to majority-voting members to decorrelate errors. Tested on 50 IMO-level problems (1×H100, 5-hour limit, 3 models, 23+ experiments). It does not work.
Why it fails:
High-temperature sampling already pushes pairwise error correlation to zero or below (mean ρ̂ = −0.348 across 19 computable points). There is no correlation headroom left. Diverse prompts reduce per-attempt accuracy more than they reduce correlation.
What dominates:
At equal N=8, the 8-point model capability gap (gpt-oss-120b at 39.3 vs. gpt-oss-20b at 31.0) is 4× larger than any prompt optimization (±2 points). Scaling N past the compute budget backfires.
Where the real gap is:
The model's pass@20 ≈ 45.5, but majority voting peaks at 42. Six points of selection loss. A verifier-based selector could close it. Prompt engineering cannot.
Get this paper in your agent:
hf papers read 2603.27844 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
