Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
Abstract
SPEED is a phase-asymmetric KV-visibility policy that reduces long-context inference costs in decoder-only language models by processing prompt tokens in lower layers during prefill while maintaining full-depth attention during decoding.
Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce Shallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.
Community
The paper studies why long prompts are expensive in decoder-only LMs: prompt KV states are usually materialized at every layer during Prefill and then attended to throughout Decode. SPEED keeps a small set of anchor prompt tokens visible in upper layers, stores non-anchor prompt tokens only in lower layers, and keeps new Decode tokens full-depth. In the Llama-3.1-8B study, using 75% of layers for Prefill tokens preserved benchmark quality within 0.2 average score while improving TTFT, TPOT, and active KV memory at 128K context.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling (2026)
- DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference (2026)
- SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models (2026)
- Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs (2026)
- Residual-Mass Accounting for Partial-KV Decoding (2026)
- UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification (2026)
- Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.06105 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper