arxiv:2605.30621

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Published on May 28

· Submitted by

Minhua Lin on Jun 1

Amazon

Upvote

Authors:

Abstract

Research reveals that harness self-evolution capabilities in LLM agents show unexpected patterns: harness-updating effectiveness is consistent across model capabilities, while harness-benefit follows a non-monotonic trend with mid-tier models performing best.

AI-generated summary

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.

View arXiv page View PDF Add to collection

Community

ventr1c

Paper submitter about 2 hours ago

This paper studies self-evolving LLM agents that improve by updating external harnesses, and separates harness-evolution from base task-solving capability along two dimensions: harness-updating, which writes useful persistent updates, and harness-benefit, which measures whether agents can benefit from those updates in future tasks.

The key findings are twofold. First, stronger base models are not necessarily better harness updaters: evolvers across capability tiers yield surprisingly similar gains, with even the smaller Qwen3.5-9B evolver matching much stronger models such as Claude Opus 4.6. Second, harness-benefit is non-monotonic: weak models benefit little, mid-tier models benefit the most, and strong models benefit less than mid-tier models. The paper further shows that weak models often fail to activate relevant harness artifacts or follow them faithfully.

Overall, the paper suggests that the main bottleneck in self-evolving agents may be less about using the strongest evolver and more about enabling agents to invoke and follow updated harnesses effectively.

Our code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.30621

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30621 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30621 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30621 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.