Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO
Abstract
Research demonstrates how temporal routing in reinforcement learning can create numerical shortcuts instead of reliable temporal abstractions, identifying vulnerabilities in differentiable and heuristic routing approaches and proposing target decoupling as a structural solution.
Temporal credit assignment in reinforcement learning is often approached by introducing value estimates at multiple discount factors. A natural next step is to let the actor dynamically route among these temporal heads, using either differentiable attention or heuristic uncertainty weights. This paper argues that such routing can create a numerical shortcut rather than a reliable temporal abstraction. We study this issue in a controlled PPO setting on LunarLander-v2, using the environment as a visual sandbox for diagnosing failure modes. First, we formalize Surrogate Objective Hacking: a differentiable softmax router exposed to the PPO surrogate receives a direct gradient toward advantage heads that are numerically favorable for the current update, even when this routing change does not correspond to improved physical control. Because unnormalized advantages at different discount factors have different effective scales, this creates a scale-discrepancy vulnerability. Second, we identify the Paradox of Temporal Uncertainty in gradient-free error-based routing: short-horizon heads can receive the largest routing share because their prediction targets are easier, even when they are less aligned with delayed task success. As a structural response, we study Target Decoupling: the critic may retain multi-timescale auxiliary heads, but the actor is updated only with the long-horizon advantage. Target Decoupling is not presented as a broad performance booster; in this run set it removes the exploitable actor-side routing pathway and improves the observed worst-seed return. Code is available at https://github.com/ben-dlwlrma/Representation-Over-Routing.
Community
Motivation: Fusing multi-timescale signals in RL can trigger optimization pathologies. We identify two specific failure modes in Actor-Critic architectures: Surrogate Objective Hacking (policy gradients exploiting dynamic routing weights) and the Paradox of Temporal Uncertainty (myopic degeneration under gradient-free routing).
Method: We propose a Target Decoupling architecture ("Representation over Routing"). We remove routing aggregation from the Actor. Instead, the Critic fits multiple temporal horizons as an auxiliary representation learning task, while the Actor updates solely on the long-term advantage.
Results: On the LunarLander-v2 delayed-reward benchmark, our decoupled agent avoids the "hovering for survival" local optimum. It eliminates policy collapse and stably surpasses the "Environment Solved" threshold without hyperparameter hacking.
Code and reproducible scripts are open-sourced in the repo.
the core idea that really sticks is target decoupling: keep multi-timescale predictions on the critic for auxiliary representation learning, while the actor updates are driven only by long-horizon advantages. this separation seems to block the gradient hijacking channel they expose with surrogate objectives, which explains why naive fusion often destabilizes learning in delayed reward tasks. i’d love to see a more explicit ablation on how the critic's auxiliary losses interact with value variance, since a couple of minor differences in those terms can swing policy stability. the arxivlens breakdown helped me parse the method details and the exact routing vs target setup, a nice quick walkthrough when you skim the paper (https://arxivlens.com/PaperView/Details/representation-over-routing-overcoming-surrogate-hacking-in-multi-timescale-ppo-256-832e3787). one question: would removing any long-horizon signal from the actor completely break performance on harder benchmarks, or is there a minimal long-horizon component that still preserves stability in noisy environments?
Thanks for reading and sharing the ArxivLens summary.
Regarding the variance ablation: I agree. While Figure 6 shows the aggregated value loss dropping, explicitly plotting the variance of the long-term advantage under different auxiliary weights would better illustrate the scaffolding effect. It's a solid suggestion for a future revision.
To your question about the Actor's horizon: completely removing the long-horizon signal (e.g., dropping gamma to 0.5) definitely breaks performance. The agent falls back into the hovering local optimum because it loses the sparse delayed reward signal.
However, your intuition about noisy environments is correct. In highly stochastic tasks, gamma=0.999 might introduce too much variance to the Actor. There should be a "minimal effective horizon" (e.g., 0.95 or 0.99) that balances capturing the delayed reward with resisting environmental noise. Target decoupling is useful specifically here because it allows tuning this Actor horizon purely for the bias-variance tradeoff, without degrading the Critic's multi-timescale state representation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment (2026)
- Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR (2026)
- GAGPO: Generalized Advantage Grouped Policy Optimization (2026)
- AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning (2026)
- Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning (2026)
- LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models (2026)
- Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.13517 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper