arxiv:2604.13517

Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO

Published on May 30

· Submitted by

Jing Sun on May 26

Upvote

Authors:

Jing Sun

Abstract

Research demonstrates how temporal routing in reinforcement learning can create numerical shortcuts instead of reliable temporal abstractions, identifying vulnerabilities in differentiable and heuristic routing approaches and proposing target decoupling as a structural solution.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Temporal credit assignment in reinforcement learning is often approached by introducing value estimates at multiple discount factors. A natural next step is to let the actor dynamically route among these temporal heads, using either differentiable attention or heuristic uncertainty weights. This paper argues that such routing can create a numerical shortcut rather than a reliable temporal abstraction. We study this issue in a controlled PPO setting on LunarLander-v2, using the environment as a visual sandbox for diagnosing failure modes. First, we formalize Surrogate Objective Hacking: a differentiable softmax router exposed to the PPO surrogate receives a direct gradient toward advantage heads that are numerically favorable for the current update, even when this routing change does not correspond to improved physical control. Because unnormalized advantages at different discount factors have different effective scales, this creates a scale-discrepancy vulnerability. Second, we identify the Paradox of Temporal Uncertainty in gradient-free error-based routing: short-horizon heads can receive the largest routing share because their prediction targets are easier, even when they are less aligned with delayed task success. As a structural response, we study Target Decoupling: the critic may retain multi-timescale auxiliary heads, but the actor is updated only with the long-horizon advantage. Target Decoupling is not presented as a broad performance booster; in this run set it removes the exploitable actor-side routing pathway and improves the observed worst-seed return. Code is available at https://github.com/ben-dlwlrma/Representation-Over-Routing.

View arXiv page View PDF Project page GitHub 11 Add to collection

Community

ben-dlwlrma

Paper author Paper submitter 7 days ago

Motivation: Fusing multi-timescale signals in RL can trigger optimization pathologies. We identify two specific failure modes in Actor-Critic architectures: Surrogate Objective Hacking (policy gradients exploiting dynamic routing weights) and the Paradox of Temporal Uncertainty (myopic degeneration under gradient-free routing).

Method: We propose a Target Decoupling architecture ("Representation over Routing"). We remove routing aggregation from the Actor. Instead, the Critic fits multiple temporal horizons as an auxiliary representation learning task, while the Actor updates solely on the long-term advantage.

Results: On the LunarLander-v2 delayed-reward benchmark, our decoupled agent avoids the "hovering for survival" local optimum. It eliminates policy collapse and stably surpasses the "Environment Solved" threshold without hyperparameter hacking.

Code and reproducible scripts are open-sourced in the repo.

avahal

7 days ago

the core idea that really sticks is target decoupling: keep multi-timescale predictions on the critic for auxiliary representation learning, while the actor updates are driven only by long-horizon advantages. this separation seems to block the gradient hijacking channel they expose with surrogate objectives, which explains why naive fusion often destabilizes learning in delayed reward tasks. i’d love to see a more explicit ablation on how the critic's auxiliary losses interact with value variance, since a couple of minor differences in those terms can swing policy stability. the arxivlens breakdown helped me parse the method details and the exact routing vs target setup, a nice quick walkthrough when you skim the paper (https://arxivlens.com/PaperView/Details/representation-over-routing-overcoming-surrogate-hacking-in-multi-timescale-ppo-256-832e3787). one question: would removing any long-horizon signal from the actor completely break performance on harder benchmarks, or is there a minimal long-horizon component that still preserves stability in noisy environments?

ben-dlwlrma

Paper author 7 days ago

Thanks for reading and sharing the ArxivLens summary.

Regarding the variance ablation: I agree. While Figure 6 shows the aggregated value loss dropping, explicitly plotting the variance of the long-term advantage under different auxiliary weights would better illustrate the scaffolding effect. It's a solid suggestion for a future revision.

To your question about the Actor's horizon: completely removing the long-horizon signal (e.g., dropping gamma to 0.5) definitely breaks performance. The agent falls back into the hovering local optimum because it loses the sparse delayed reward signal.

However, your intuition about noisy environments is correct. In highly stochastic tasks, gamma=0.999 might introduce too much variance to the Actor. There should be a "minimal effective horizon" (e.g., 0.95 or 0.99) that balances capturing the delayed reward with resisting environmental noise. Target decoupling is useful specifically here because it allows tuning this Actor horizon purely for the bias-variance tradeoff, without degrading the Critic's multi-timescale state representation.

librarian-bot

7 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.13517

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.13517 in a dataset README.md to link it from this page.

Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 1

Collections including this paper 1