FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
Abstract
FastKV is a KV cache compression framework that reduces prefill and decoding latency by decoupling context computation from cache budget through token-selective propagation and independent key-value entry selection.
While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82times in prefill and 2.87times in decoding compared to the full-context baseline, while matching the accuracy of the baselines that only accelerate the decoding stage. Our code is available at https://github.com/dongwonjo/FastKV.
Community
Introducing FastKV, a novel KV cache compression method designed to enhance inference efficiency for long-context LLMs while maintaining high accuracy.
Paper: https://arxiv.org/abs/2502.01068
Github: https://github.com/dongwonjo/FastKV
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression (2024)
- HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing (2024)
- Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models (2025)
- DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs (2024)
- Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression (2024)
- SCBench: A KV Cache-Centric Analysis of Long-Context Methods (2024)
- SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper