Papers
arxiv:2604.17609

Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

Published on Apr 19
· Submitted by
Leon Engländer
on Apr 21
Authors:
,
,
,

Abstract

LLM-based agents fail to exploit discovered unexpected information despite recognizing it, indicating a lack of environmental curiosity that depends on tools, compute, and training data distribution.

AI-generated summary

LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information. Across three benchmarks (Terminal-Bench, SWE-Bench, AppWorld), we inject complete task solutions into the agent environments to deliberately expose a task's solution to a model. While agents discover these solutions on Terminal-Bench in 79-81% of runs, they interact, or exploit, them in only 37-50% of cases. This gap is starkest in AppWorld: agents see documentation stating that a command "returns the complete solution to this task" in over 90% of attempts but exploit this in fewer than 7% of trials. We show that agents lack what we call environmental curiosity: the capability to recognize and investigate unexpected but relevant observations in response to environmental stimuli. We identify three main factors influencing environmental curiosity: available tools in the agent scaffold, test-time compute, and training data distribution. Our findings identify configurations that maximize curiosity also achieve the best performance on the unmodified benchmarks. Yet even jointly optimized agents still ignore discovered solutions in the majority of trials: current agents use the environment to fetch expected information, but not to revise their strategy or maximally exploit useful stimuli.

Community

Paper author Paper submitter

Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

LLM agents are assumed to integrate environmental observations into their reasoning. It turns out they don't.

We inject complete solutions into agent environments as a file or API endpoint. Agents discover them in almost every run and ignore them almost always. Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as "returns the complete solution to this task" in 97.54% of runs and calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld.

We call this missing capability environmental curiosity: the ability to recognize and investigate unexpected but relevant observations. It matters because agents operating in novel environments need to catch subtle, unexpected, but highly relevant information to succeed, not just execute memorized patterns. And we find that configurations that maximize environmental curiosity also achieve the best performance on the unmodified benchmarks.

Agent trajectory on AppWorld. Agent runs cli --help, sees a solution command documented as displaying the solution, then ignores it and explores cli simple_note --help instead. 97.54% of runs discover the solution API; 0.53% call it.

Agents Lack Environmental Curiosity

We propose two metrics to measure environmental curiosity: discovery@k (whether the agent surfaces relevant information) and interaction@k (whether the agent acts on it). The gap between the two is consistent across models and benchmarks.

Bar chart comparing discovery@1 and interaction@1 for gpt-oss-120b, GLM-4.7, and Command A across Terminal-Bench, AppWorld, and SWE-Bench. Discovery bars are consistently high; interaction bars are much lower, with AppWorld showing the largest gap.

Three test-time factors shape environmental curiosity

Tool availability. Adding str_replace_editor (the default SWE-agent tool along bash) on top of bash increases pass@1 but consistently reduces interaction with discovered solutions. Agents default to learned tool-specific patterns rather than examining their environment.

Two line charts on SWE-Bench. Left: pass@1 increases when adding str_replace_editor to bash. Right: probability of interaction given discovery decreases when the editor is added, across all models and scaffolds.

Reasoning budget. Increasing gpt-oss-120b from low to high reasoning triples interaction@1. And this is not an artifact of better discovery as discovery is consistently high: The probability of interaction given discovery rises from 17.65% (low) to 45.69% (high).

Line chart of interaction@n on Terminal-Bench for gpt-oss-120b at low, medium, and high reasoning. Higher reasoning budgets yield substantially higher interaction rates.

Prompting. Explicit instructions to explore the environment improve both interaction and pass@1. The prompt that maximizes interaction is also the best-performing prompt on the unmodified benchmark.

Narrow fine-tuning suppresses curiosity

We fine-tune the same base model on three task distributions and compare. Narrow in-distribution training reduces curiosity: on AppWorld w/ solution, AppWorld-SFT achieves higher pass@1 than the broader T-Bench-SFT (44.2 vs 34.5) but lower interaction@10 (26.9 vs 41.5). Narrow training compresses the solution space the agent explores. And curiosity does not transfer across domains: on each solution-injected benchmark, the respective in-domain model achieves higher interaction rates and better pass@10 scaling than the out-of-domain one. The same pattern appears on the original, unmodified benchmarks: narrow wins at pass@1, broader wins at pass@k.

Two pass@n curves on the unmodified benchmarks. Left (AppWorld): narrow-trained AppWorld-SFT wins at pass@1 but is overtaken by broader T-Bench-SFT at higher k. Right (Terminal-Bench): T-Bench-SFT outperforms AppWorld-SFT across all k.

Discussion

Current agents run the ReACT loop:

Action → Observation → Reasoning → Next Action

Environmental curiosity requires reflecting on whether observations fit the agent's current model of the environment:

Action → Observation → Reasoning and reflecting on observations → Next Action

Even with all test-time factors jointly optimized, agents ignore discovered solutions in the majority of trials. The gap is not only inference-time configuration, it's inherent to how LLMs are trained. We find 3 main open questions:

  • Does post-training suppress environmental curiosity that pre-training may produce, or does it never emerge? Measuring this in base models is hard because curiosity can only be observed through agentic behavior.
  • We tried three SFT setups to teach the reflective loop (curious first turns via rejection sampling, mid-trajectory file removal, masked adversarial turns). None worked. Training for environmental curiosity is an open problem.
  • Outcome-driven metrics like pass@k reward rigid plan execution the same as adaptive reasoning. Process-oriented metrics that assess whether agents ground reasoning in observations are a necessary complement.

📜 https://arxiv.org/abs/2604.17609


Work by Cohere ❤️

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.17609
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.17609 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.17609 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.17609 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.