Papers
arxiv:2601.02569

LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference

Published on Jan 5
Authors:
,
,
,
,

Abstract

LoRA-Drop accelerates autoregressive language model decoding by applying temporal compute scheduling with low-rank LoRA corrections, achieving faster inference and reduced memory usage without significant accuracy loss.

AI-generated summary

Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic-depth and layer-skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed layers are left uncompensated. We present LoRA-Drop, a plug-and-play inference framework that accelerates decoding by applying a temporal compute schedule to a fixed subset of intermediate layers: on most decoding steps, selected layers reuse the previous-token hidden state and apply a low-rank LoRA correction, while periodic refresh steps execute the full model to prevent drift. LoRA-Drop requires no routing network, is compatible with standard KV caching, and can reduce KV-cache footprint by skipping KV updates in droppable layers during LoRA steps and refreshing periodically. Across LLaMA2-7B, LLaMA3-8B, Qwen2.5-7B, and Qwen2.5-14B, LoRA-Drop achieves up to 2.6times faster decoding and 45--55\% KV-cache reduction while staying within 0.5 percentage points (pp) of baseline accuracy. Evaluations on reasoning (GSM8K, MATH, BBH), code generation (HumanEval, MBPP), and long-context/multilingual benchmarks (LongBench, XNLI, XCOPA) identify a consistent safe zone of scheduling configurations that preserves quality while delivering substantial efficiency gains, providing a simple path toward adaptive-capacity inference in LLMs. Codes are available at https://github.com/hosseinbv/LoRA-Drop.git.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.02569 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.02569 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.02569 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.