Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding
Abstract
Domino is a speculative decoding framework that improves LLM inference speed by decoupling causal dependency modeling from autoregressive drafting through a parallel backbone and lightweight causal refinement head, achieving significant speedups in both end-to-end execution and throughput.
Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.
Community
Domino is a speculative decoding method that improves parallel drafting by adding lightweight causal correction. It aims to retain the efficiency of block-parallel draft generation while recovering part of the causal dependency modeling lost in fully parallel draft models. Code and models are available at: https://github.com/jianuo-huang/Domino
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration (2026)
- Draft-OPD: On-Policy Distillation for Speculative Draft Models (2026)
- PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding (2026)
- TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding (2026)
- Accelerating Speculative Decoding with Block Diffusion Draft Trees (2026)
- DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding (2026)
- SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.29707 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Huang2020/Qwen3-8B-Domino-b16
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper