arxiv:2603.03143

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Published on Mar 3

· Submitted by

JiYuan Wang on Mar 11

Authors:

Jiyuan Wang ,

Yuyang Yin ,

Lang Nie ,

Zhenlong Yuan ,

Kang Liao ,

Abstract

RL3DEdit uses reinforcement learning with rewards from a 3D foundation model to achieve multi-view consistent 3D editing from 2D editing priors.

AI-generated summary

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

View arXiv page View PDF Project page GitHub 82 Add to collection

Community

exander

Paper author Paper submitter 1 day ago

•

edited about 21 hours ago

"While generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning as a feasible solution."
----RL3DEdit

Paper: https://arxiv.org/abs/2603.03143

Project Page: https://amap-ml.github.io/RL3DEdit/

Code: https://github.com/AMAP-ML/RL3DEdit

librarian-bot

about 3 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

ezharjan

about 3 hours ago

Good paper!
But if our 2D editing models are now being supervised entirely by the confidence maps of frozen 3D foundation models (like VGGT), aren't we just shifting the bottleneck? While the authors note that data-driven verifiers are harder to "reward-hack" than traditional methods, how can we prevent the RL policy from eventually exploiting the verifier's blind spots?

And what happens to our edits when the 3D verifier itself has inherent geometric biases for out-of-distribution scenes?

ezharjan

about 2 hours ago

One thing I'm a bit worried about: what happens when VGGT hits tricky, highly reflective textures?
If the internal pose estimator gets tripped up there, I'm concerned the RL policy might just take the easy way out. Could it end up spawning adversarial artifacts just to game the confidence score, rather than actually learning true 3D consistency?

exander

Paper author about 1 hour ago

Thank you for your interest and insightful comments! You are right. In our experiments, we indeed observed that in certain scenarios, VGGT does not output the stable confidence maps that reflect 3D consistency as we might hope. However, the priors learned by VGGT from millions of training images remain exceptionally powerful. While imperfect, they are more than sufficient to drive meaningful progress in 3D editing, especially given the current scarcity of 3D data.

Our empirical experiment results demonstrate that the RL policy can achieve 3D-consistent outputs. Regarding the more complex corner cases you mentioned, we believe addressing them might be slightly premature for the field right now. Currently, 3D editing models still struggle with executing complex editing instructions even in simple scenes. It may be more practical to further mature the foundational capabilities in the field of 3D editing before dealing with these edge situations.

Thank you!