An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU
Abstract
SlideFormer enables efficient fine-tuning of large language models on single GPUs through asynchronous processing, heterogeneous memory management, and optimized kernels, achieving higher throughput and reduced memory usage.
Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8x larger batch sizes and 6x larger models. In evaluations, SlideFormer achieves 1.40x to 6.27x higher throughput while roughly halving CPU/GPU memory usage compared to baselines, sustaining >95% peak performance on both NVIDIA and AMD GPUs.
Community
This is not a brand-new topic. Single-GPU training / fine-tuning of very large models under heterogeneous memory constraints has been explored for years, from systems such as STRONGHOLD (SC’22) to Ratel (ICDE’25) and now our work, SlideFormer, which was publicly posted on March 17, 2026 and accepted at DAC 2026, making it an earlier public and peer-reviewed contribution in this line of research.
In SlideFormer, we study full-parameter LLM fine-tuning on a single GPU through a heterogeneous co-design across GPU / CPU / RAM / NVMe, with:
a lightweight asynchronous layer-sliding engine,
efficient heterogeneous memory management,
integrated advanced I/O and optimized Triton kernels.
SlideFormer enables fine-tuning of 123B+ models on a single RTX 4090, sustains >95% peak performance on both NVIDIA and AMD GPUs, and improves throughput by 1.40×–6.27× over baselines while substantially reducing memory usage.
That said, we also want to emphasize that “single-GPU fine-tuning of 100B+ models” should mainly be viewed as a systems stress test / roofline-style extreme point for evaluating framework design and memory orchestration. In practice, for GPUs such as the RTX 4090 / RTX 5090 / RTX Pro 6000, the more realistic sweet spot for productive fine-tuning is often closer to the 3B–14B range, where turnaround time is much more practical.
Our code release is planned for May 2026, as we are still working on the next stage of this project. In the meantime, the core ideas and system design are already described in the paper and can largely be understood from the manuscript itself. We welcome discussion from the community, and we are also happy to see related ideas adopted, extended, or integrated into existing training frameworks.
Paper: arXiv:2603.16428
Code release: planned for May 2026
Get this paper in your agent:
hf papers read 2603.16428 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper

