Papers
arxiv:2604.19945

Visual Reasoning through Tool-supervised Reinforcement Learning

Published on Apr 21
· Submitted by
Qihua Dong
on Apr 23
Authors:
,
,
,
,
,

Abstract

A novel Tool-supervised Reinforcement Learning framework is presented that enables multimodal large language models to effectively learn tool-use for complex visual reasoning through a two-stage curriculum approach.

AI-generated summary

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

Community

Paper author Paper submitter
edited about 4 hours ago

TL;DR: Most "think-with-image" / tool-use MLLMs end up with short reasoning chains and a single tool (usually zoom-in). ToolsRL uses a two-stage RL curriculum to teach the model how to call a small set of native visual tools (zoom / rotate / flip / draw point/line) before compositional use of them, which yields longer, more flexible tool chains that actually solve the task.

Current tool-augmented MLLMs tend to collapse to one or two zoom-ins and call it a day, because mixing "learn to call the tool" with "learn to answer correctly" creates conflicting gradients during RL. We decouple the two: stage 1 trains tool calling under tool-specific rewards, stage 2 trains accuracy while allowing tool use. The result is models that compose multiple tools per query (see DeepEyes vs. ToolsRL below) and reason longer when needed. We hope this could be a useful data point for the agentic-MLLM community.

rebuttal_compare_deepeyes-1

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.19945
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.19945 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.19945 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.19945 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.