EGM-Qwen3-VL-4B-SFT

[Project Page]   [Code]  

Model Summary

EGM-Qwen3-VL-4B-SFT is the supervised fine-tuning (SFT) checkpoint from the first stage of the EGM (Efficient Visual Grounding Language Models) training pipeline. It is built on top of Qwen3-VL-4B-Thinking.

This is an intermediate checkpoint intended for further reinforcement learning training. For the final model with best performance, see nvidia/EGM-4B.

Training Details

SFT Stage

In the SFT stage, a proprietary VLM generates detailed chain-of-thought reasoning steps for visual grounding training data. The base Qwen3-VL-4B-Thinking model is then fine-tuned on this reasoning-augmented data to learn structured visual grounding with explicit reasoning.

This SFT checkpoint serves as the initialization for the subsequent RL stage (GRPO), which yields the final EGM-4B model.

How to Use for RL Training

pip install -U huggingface_hub
huggingface-cli download nvidia/EGM-4B-SFT --local-dir ./models/EGM-4B-SFT

Then follow the installation and RL training instructions in the EGM repository.

Model Architecture

Component Details
Architecture Qwen3VLForConditionalGeneration
Precision bfloat16
Text Hidden Size 2560
Text Layers 36
Attention Heads 32 (8 KV heads)
Text Intermediate Size 9728
Vision Hidden Size 1024
Vision Layers 24
Patch Size 16 x 16
Max Position Embeddings 262,144
Vocabulary Size 151,936

Related Models

Model Description
nvidia/EGM-4B Final RL-trained model (best performance)
nvidia/EGM-8B-SFT SFT checkpoint for the 8B variant
nvidia/EGM-8B Final RL-trained 8B model

Citation

@article{zhan2026EGM,
    author = {Zhan, Guanqi and Li, Changye and Liu, Zhijian and Lu, Yao and Wu, Yi and Han, Song and Zhu, Ligeng},
    title = {EGM: Efficient Visual Grounding Language Models},
    booktitle = {arXiv},
    year = {2026}
}

Acknowledgment

This repository benefits from Qwen3-VL, InternVL, verl and verl-internvl.

Downloads last month
118
Safetensors
Model size
570k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nvidia/EGM-4B-SFT

Finetuned
(20)
this model

Collection including nvidia/EGM-4B-SFT