Instructions to use xiaomoguhzz/VisionEncoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use xiaomoguhzz/VisionEncoder with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("xiaomoguhzz/VisionEncoder", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - vision-encoder | |
| - distillation | |
| - video-language | |
| - siglip2 | |
| - dinov3 | |
| # VisionEncoder Checkpoints | |
| Final model checkpoints from the **VisionEncoder** research project. | |
| **Training code**: https://github.com/xiaomoguhz/VisionEncoder | |
| ## Contents | |
| Each directory corresponds to one training pipeline in the code repo: | |
| | Directory | Training code | | |
| |---|---| | |
| | `declip_siglip2/spatial_align/` | `declip_siglip2/` — DeCLIP spatial alignment distillation on SigLIP2 using DINOv2 / DINOv3 as teacher | | |
| | `kd_mllm/s1_kd_pretrain/` | `ms-swift/kd_mllm/` stage-1 pretrain (`ms-swift/run_s1.sh`) | | |
| | `kd_mllm/s1_siglip2_qwen3_4b/` | `ms-swift/kd_mllm/` stage-1, SigLIP2 + Qwen3-4B backbone | | |
| | `kd_mllm/s2_siglip2_qwen3_4b_10pct/` | `ms-swift/kd_mllm/` stage-2 SFT on 10% data (`run_s2.sh`) | | |
| | `self_refine/qwen3vl_2b_10pct/` | `ms-swift/self_refine/` — register token injection + auto-calibrated GP threshold loss | | |
| | `video_mllm_swift/s1_siglip2_qwen3_1.7b/` | `ms-swift/video_mllm/` stage-1 with SigLIP2 encoder | | |
| | `video_mllm_swift/s1_declip_siglip2_qwen3_1.7b/` | `ms-swift/video_mllm/` stage-1 with DeCLIP-SigLIP2 encoder | | |
| | `video_mllm_swift/s2_siglip2_qwen3_1.7b_10pct/` | `ms-swift/video_mllm/` stage-2 SFT, SigLIP2 | | |
| | `video_mllm_swift/s2_declip_siglip2_qwen3_1.7b_10pct/` | `ms-swift/video_mllm/` stage-2 SFT, DeCLIP-SigLIP2 | | |
| | `video_mllm_swift/s2_image_only_10pct/` | Ablation: image-only stage-2 training | | |
| | `ms-swift-data/` | Not a checkpoint — preprocessed SFT training data (`ms-swift/data/`) used by the pipelines above | | |
| ## Related repositories | |
| - **Code**: https://github.com/xiaomoguhz/VisionEncoder | |
| - **Evaluation data (~323 GB tarballs)**: https://huggingface.co/datasets/xiaomoguhzz/R3-Bench-data | |