You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ViX-Ray — Fine-tuned Medical Vision-Language Models

Fine-tuned weights for Vietnamese chest X-ray report generation across 3 clinical tasks and 6 model architectures.

Best overall performance: Qwen2-VL-7B across all 3 tasks.


Tasks

# Task Description
1 finding Generate radiology findings from a chest X-ray image
2 impression Generate the clinical impression (final diagnosis) from a chest X-ray image
3 multi Multi-turn dialogue — findings → impression via conversation history

Models

Key Base model Size
Intern InternVL2.5-1B 1B
Vintern Vintern-1B-v3.5 1B
Qwen2B Qwen2-VL-2B-Instruct 2B
Qwen7B Qwen2-VL-7B-Instruct ⭐ 7B
MiniCPM MiniCPM-V-2_6 8B
LaVy LaVy-Instruct 7B

Quick Start

1. Install

pip install huggingface_hub transformers torch torchvision pillow

For Qwen models, also install:

pip install qwen-vl-utils

For Intern / Vintern models, also install:

pip install decord

For MiniCPM, pin versions:

pip install Pillow==10.1.0 torch==2.1.2 torchvision==0.16.2 transformers==4.40.0 sentencepiece==0.1.99 decord

2. Download a model zip

# task  : finding | impression | multi
# model : Intern | Vintern | Qwen2B | Qwen7B | MiniCPM | LaVy

huggingface-cli download presencesw/ViX-Ray <task>/<Model>.zip \
    --repo-type model --local-dir ./

Example — download the best model for finding:

huggingface-cli download presencesw/ViX-Ray finding/Qwen7B.zip \
    --repo-type model --local-dir ./

Download all models at once:

huggingface-cli download presencesw/ViX-Ray \
    --repo-type model --local-dir ./vix_ray_models

3. Unzip

unzip <task>/<Model>.zip -d ./models/<task>/
# result: ./models/<task>/<Model>/

Or in Python:

import zipfile
with zipfile.ZipFile("<task>/<Model>.zip") as zf:
    zf.extractall("./models/<task>/")

4. Load & infer

Set model_path = "./models/<task>/<Model>" then use the snippet for your model family.

Qwen2-VL (Qwen2B / Qwen7B)

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_path = "./models/<task>/<Model>"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "your_image.jpg"},
            {"type": "text",  "text": "Mô tả hình ảnh X-quang ngực này."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs,
                   padding=True, return_tensors="pt").to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, generated_ids)]
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])

InternVL / Vintern (Intern / Vintern)

import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

model_path = "./models/<task>/<Model>"

model = AutoModel.from_pretrained(
    model_path, torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)

MEAN, STD = (0.485, 0.456, 0.406), (0.229, 0.224, 0.225)
transform = T.Compose([
    T.Lambda(lambda img: img.convert("RGB")),
    T.Resize((448, 448), interpolation=InterpolationMode.BICUBIC),
    T.ToTensor(),
    T.Normalize(mean=MEAN, std=STD),
])

pixel_values = transform(Image.open("your_image.jpg")).unsqueeze(0).to(torch.bfloat16).cuda()
response = model.chat(tokenizer, pixel_values, "<image>\nMô tả hình ảnh X-quang ngực này.",
                      dict(max_new_tokens=512, do_sample=True))
print(response)

MiniCPM-V

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model_path = "./models/<task>/<Model>"

model = AutoModel.from_pretrained(
    model_path, trust_remote_code=True,
    attn_implementation="sdpa", torch_dtype=torch.bfloat16
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

image = Image.open("your_image.jpg").convert("RGB")
msgs = [{"role": "user", "content": [image, "Mô tả hình ảnh X-quang ngực này."]}]
print(model.chat(image=None, msgs=msgs, tokenizer=tokenizer))

LaVy

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

model_path = "./models/<task>/<Model>"

model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

inputs = processor(
    images=Image.open("your_image.jpg").convert("RGB"),
    text="Mô tả hình ảnh X-quang ngực này.",
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

Multi-turn (Task 3)

For the multi task, pass conversation history between turns:

# Turn 1 — findings
response1 = ...  # run inference as above

# Turn 2 — impression (append assistant turn then ask)
messages.append({"role": "assistant", "content": [{"type": "text", "text": response1}]})
messages.append({"role": "user",      "content": [{"type": "text", "text": "Kết luận bệnh gì?"}]})
response2 = ...  # run inference again with updated messages

See readme/<task>_<Model>.md for the full per-model multi-turn example.


Full Model Table

Task Model Base Zip path
finding Intern InternVL2.5-1B finding/Intern.zip
finding Vintern Vintern-1B-v3.5 finding/Vintern.zip
finding Qwen2B Qwen2-VL-2B finding/Qwen2B.zip
finding Qwen7B ⭐ Qwen2-VL-7B finding/Qwen7B.zip
finding MiniCPM MiniCPM-V-2_6 finding/MiniCPM.zip
finding LaVy LaVy-Instruct finding/LaVy.zip
impression Intern InternVL2.5-1B impression/Intern.zip
impression Vintern Vintern-1B-v3.5 impression/Vintern.zip
impression Qwen2B Qwen2-VL-2B impression/Qwen2B.zip
impression Qwen7B ⭐ Qwen2-VL-7B impression/Qwen7B.zip
impression MiniCPM MiniCPM-V-2_6 impression/MiniCPM.zip
impression LaVy LaVy-Instruct impression/LaVy.zip
multi Intern InternVL2.5-1B multi/Intern.zip
multi Vintern Vintern-1B-v3.5 multi/Vintern.zip
multi Qwen2B Qwen2-VL-2B multi/Qwen2B.zip
multi Qwen7B ⭐ Qwen2-VL-7B multi/Qwen7B.zip
multi MiniCPM MiniCPM-V-2_6 multi/MiniCPM.zip
multi LaVy LaVy-Instruct multi/LaVy.zip

Per-model details (installation, full inference code) are in readme/<task>_<Model>.md.


Citation

If you use these models or the ViX-Ray dataset in your research, please cite:

@article{nguyen2026vix,
  title={ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models},
  author={Nguyen, Duy Vu Minh and Truong, Chinh Thanh and Tran, Phuc Hoang and Le, Hung Tuan and Dat, Nguyen Van-Thanh and Pham, Trung Hieu and Van Nguyen, Kiet},
  journal={arXiv preprint arXiv:2603.15513},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for presencesw/ViX-Ray