arxiv:2603.27844

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Published on Apr 16

· Submitted by

Natapong Nitarach (Schwyter) on Apr 17

Upvote

Authors:

Natapong Nitarach

Abstract

Majority voting improves mathematical reasoning but is limited by correlated errors; diverse reasoning strategies and model capability are more impactful than prompt engineering.

AI-generated summary

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix is to assign different reasoning strategies to different voters. The approach, Diverse Prompt Mixer, is tested on the AIMO 3 competition: 3 models, 23+ experiments, 50 IMO-level problems, one H100 80 GB, 5-hour limit. Every prompt-level intervention fails. High-temperature sampling already decorrelates errors; weaker strategies reduce accuracy more than they reduce correlation. Across an 8-point capability gap at equal N=8 and every optimization tested, model capability dominates. The gap between the best majority-vote score (42/50) and pass@20 (~45.5) is selection loss, not prompt loss. A verifier-based selector could close it. Prompt engineering cannot.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

natnitaract

Paper author Paper submitter about 3 hours ago

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Diverse Prompt Mixer assigns different reasoning strategies to majority-voting members to decorrelate errors. Tested on 50 IMO-level problems (1×H100, 5-hour limit, 3 models, 23+ experiments). It does not work.

Why it fails:
High-temperature sampling already pushes pairwise error correlation to zero or below (mean ρ̂ = −0.348 across 19 computable points). There is no correlation headroom left. Diverse prompts reduce per-attempt accuracy more than they reduce correlation.

What dominates:
At equal N=8, the 8-point model capability gap (gpt-oss-120b at 39.3 vs. gpt-oss-20b at 31.0) is 4× larger than any prompt optimization (±2 points). Scaling N past the compute budget backfires.

Where the real gap is:
The model's pass@20 ≈ 45.5, but majority voting peaks at 42. Six points of selection loss. A verifier-based selector could close it. Prompt engineering cannot.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.27844

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.27844 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.27844 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.27844 in a Space README.md to link it from this page.

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Abstract

Community

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1