Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
Abstract
Research reveals that harness self-evolution capabilities in LLM agents show unexpected patterns: harness-updating effectiveness is consistent across model capabilities, while harness-benefit follows a non-monotonic trend with mid-tier models performing best.
LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.
Community
This paper studies self-evolving LLM agents that improve by updating external harnesses, and separates harness-evolution from base task-solving capability along two dimensions: harness-updating, which writes useful persistent updates, and harness-benefit, which measures whether agents can benefit from those updates in future tasks.
The key findings are twofold. First, stronger base models are not necessarily better harness updaters: evolvers across capability tiers yield surprisingly similar gains, with even the smaller Qwen3.5-9B evolver matching much stronger models such as Claude Opus 4.6. Second, harness-benefit is non-monotonic: weak models benefit little, mid-tier models benefit the most, and strong models benefit less than mid-tier models. The paper further shows that weak models often fail to activate relevant harness artifacts or follow them faithfully.
Overall, the paper suggests that the main bottleneck in self-evolving agents may be less about using the strongest evolver and more about enabling agents to invoke and follow updated harnesses effectively.
Our code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.
Get this paper in your agent:
hf papers read 2605.30621 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper