arxiv:2606.20515

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Published on Jun 18

· Submitted by

leoli on Jun 19

Ropedia

Upvote

Authors:

Shulin Tian ,

Abstract

S-Agent is a spatial reasoning framework that enhances visual language models with temporal memory and hierarchical spatial tools to enable continuous 3D world understanding from multi-view imagery.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textsc{S-Agent}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, S-Agent reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, S-Agent casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (e.g., counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that S-Agent consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on S-Agent-generated spatial trajectories S-300K yields S-Agent-8B, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

View arXiv page View PDF Project page GitHub 34 Add to collection

Community

lifuguan

Paper submitter 2 days ago

The first agentic model for Spatial Intelligence.
S-Agent turns perception into action: grounding, reconstructing, and reasoning with tools to solve complex spatial tasks step by step.

jhegedus

1 day ago

•

edited about 9 hours ago

My goal is to put time into GPT - via text - using phase parameters. Here are some PDFs I made :https://gpt2pdfsite.vercel.app/

noahml

about 14 hours ago

Neat paper. I like the shift from frame-centric recognition to treating spatial reasoning as spatio-temporal evidence accumulation. It makes a lot of sense that VLMs struggle with 3D space when they are just looking at isolated, static snapshots.

Since this approach is training-free, how much of the performance gain is limited by the quality of the individual spatial tools versus the agent's ability to plan and sequence those tool calls?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/decf3fe8-9ff6-4fde-8d04-49497660ec51