FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
Abstract
FlashPrefill enables ultra-fast prefilling for large language models by discovering dynamic sparse attention patterns and using dynamic thresholding to eliminate long-tail distributions, achieving significant speedups across various sequence lengths.
Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling (2026)
- RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference (2026)
- HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference (2026)
- S2O: Early Stopping for Sparse Attention via Online Permutation (2026)
- Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection (2026)
- Prism: Spectral-Aware Block-Sparse Attention (2026)
- FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper