Papers
arxiv:2606.08952

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

Published on Jun 8
· Submitted by
RSW
on Jun 15
Authors:
,
,
,
,
,
,
,

Abstract

AlloSpatial framework enhances spatial reasoning in foundation models by converting egocentric observations into structured allocentric representations and enabling reliable spatial cognition through cognitive mapping and tool-use reasoning.

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.

Community

Paper submitter

Spatial reasoning is one of the most stubborn blind spots in today's multimodal foundation models. A system that can write code, analyze images, and hold a fluent conversation will still stumble on something as basic as how far the door is from the telephone, or whether turning left will bring you closer to the sofa. AlloSpatial, a new framework from researchers at Beihang University, traces this failure to a single missing ingredient: models reason from fleeting, first-person snapshots of a scene, but they never build the stable, map-like, bird's-eye representation that brains rely on to find their way through the world.

The team's answer is to give models that missing mental map. At the center of AlloSpatial is World2Mind, a plug-and-play "cognitive mapping sandbox" that turns egocentric video or a handful of multi-view photos into structured allocentric priors. Its key data structure, the Allocentric-Spatial Tree, compresses noisy 3D reconstructions into a compact, machine-readable layout of objects — their positions, footprints, orientations, and how they nest inside one another — alongside route maps that capture what is walkable and where the camera has been. Crucially, the model does not simply trust these maps. A three-stage Spatial Reasoning Harness makes it decide when a question actually needs spatial tooling, gather visual and map evidence through separate channels, and then cross-examine the two for conflicts before committing to an answer.

The results are striking. As a training-free add-on, AlloSpatial lifts frontier systems including GPT-5.2, Claude-4.6-Opus, and Gemini-3-Pro by roughly 5 to 18 points on the VSI-Bench and MindCube spatial benchmarks, with the biggest gains exactly where you would expect — relative direction, route planning, and viewpoint-dependent reasoning. More provocatively, when the same structured maps are baked into small open-weight Qwen3-VL agents through cold-start fine-tuning and reinforcement learning, the resulting 4B and 8B models outscore far larger general-purpose models and purpose-built spatial systems, while answering in roughly a third as many tokens. Even with the pictures taken away entirely, the spatial trees alone carry enough information to support competent reasoning. It is a compelling demonstration that the path to spatially capable AI may run less through ever-bigger visual encoders and more through giving models the right kind of spatial memory.

image

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.08952
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.08952 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.08952 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.08952 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.