AI Lab
Collection
3 items β’ Updated
β’ 2
PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.
Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image
model_name = "tencent/Penguin-Encoder"
image_path = "your_img.jpg"
images = load_image(image_path)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)
inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)
| Model | Base Model | HF Link |
|---|---|---|
| PenguinVL-8B | Qwen3-8B | tencent/Penguin-VL-8B |
| PenguinVL-2B | Qwen3-1.7B | tencent/Penguin-VL-2B |
| PenguinVL-Encoder | Qwen3-0.6B | tencent/Penguin-Encoder |
Ablation Study:
Main Results can see the ablation section in our paper.
If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
...