arxiv:2603.16428

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

Published on Mar 17

Authors:

Abstract

SlideFormer enables efficient fine-tuning of large language models on single GPUs through asynchronous processing, heterogeneous memory management, and optimized kernels, achieving higher throughput and reduced memory usage.

AI-generated summary

Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8x larger batch sizes and 6x larger models. In evaluations, SlideFormer achieves 1.40x to 6.27x higher throughput while roughly halving CPU/GPU memory usage compared to baselines, sustaining >95% peak performance on both NVIDIA and AMD GPUs.

View arXiv page View PDF Add to collection

Community

Regiayoung

Paper author about 7 hours ago

This comment has been hidden (marked as Abuse)

Regiayoung

Paper author about 6 hours ago

•

edited about 6 hours ago

This is not a brand-new topic. Single-GPU training / fine-tuning of very large models under heterogeneous memory constraints has been explored for years, from systems such as STRONGHOLD (SC’22) to Ratel (ICDE’25) and now our work, SlideFormer, which was publicly posted on March 17, 2026 and accepted at DAC 2026, making it an earlier public and peer-reviewed contribution in this line of research.

In SlideFormer, we study full-parameter LLM fine-tuning on a single GPU through a heterogeneous co-design across GPU / CPU / RAM / NVMe, with:

a lightweight asynchronous layer-sliding engine,
efficient heterogeneous memory management,
integrated advanced I/O and optimized Triton kernels.

SlideFormer enables fine-tuning of 123B+ models on a single RTX 4090, sustains >95% peak performance on both NVIDIA and AMD GPUs, and improves throughput by 1.40×–6.27× over baselines while substantially reducing memory usage.

That said, we also want to emphasize that “single-GPU fine-tuning of 100B+ models” should mainly be viewed as a systems stress test / roofline-style extreme point for evaluating framework design and memory orchestration. In practice, for GPUs such as the RTX 4090 / RTX 5090 / RTX Pro 6000, the more realistic sweet spot for productive fine-tuning is often closer to the 3B–14B range, where turnaround time is much more practical.

Our code release is planned for May 2026, as we are still working on the next stage of this project. In the meantime, the core ideas and system design are already described in the paper and can largely be understood from the manuscript itself. We welcome discussion from the community, and we are also happy to see related ideas adopted, extended, or integrated into existing training frameworks.

Paper: arXiv:2603.16428
Code release: planned for May 2026