Trustworthy and Fair SkinGPT-R1 for Democratizing Dermatological Reasoning across Diverse Ethnicities
Update: We will soon release the SkinGPT-R1-7B weights.
SkinGPT-R1 is a dermatological reasoning vision language model for research and education. π©Ίβ¨
The Chinese University of Hong Kong, Shenzhen
Disclaimer
This project is for research and educational use only. It is not a substitute for professional medical advice, diagnosis, or treatment. β οΈ
License
This repository is released under CC BY-NC-SA 4.0. See LICENSE for details.
Overview
SkinGPT-R1/
βββ checkpoints/
βββ inference/
β βββ full_precision/
β βββ int4_quantized/
βββ requirements.txt
βββ README.md
Checkpoint paths:
- Full precision:
./checkpoints/full_precision - INT4 quantized:
./checkpoints/int4
Highlights
- π¬ Dermatology-oriented multimodal reasoning
- π§ Full-precision and INT4 inference paths
- π¬ Multi-turn chat and API serving
- β‘ RTX 50 series friendly SDPA-backed INT4 runtime
Install
conda create -n skingpt-r1 python=3.10 -y
conda activate skingpt-r1
pip install -r requirements.txt
Attention Backend Notes
This repo uses two attention acceleration paths:
flash_attention_2: external package, optionalsdpa: PyTorch native scaled dot product attention
Recommended choice:
- π RTX 50 series: use
sdpa - π A100 / RTX 3090 / RTX 4090 / H100 and other GPUs explicitly listed by the FlashAttention project: you can try
flash_attention_2
Practical notes:
- The current repo pins
torch==2.4.0, and SDPA is already built into PyTorch in this version. - FlashAttention's official README currently lists Ampere, Ada, and Hopper support for FlashAttention-2. It does not list RTX 50 / Blackwell consumer GPUs in that section, so this repo defaults to
sdpafor that path. - PyTorch 2.5 added a newer cuDNN SDPA backend for H100-class or newer GPUs, but this repo is pinned to PyTorch 2.4, so you should not assume those 2.5-specific gains here.
If you are on an RTX 5090 and flash-attn is unavailable or unstable in your environment, use the INT4 path in this repo, which is already configured with attn_implementation="sdpa".
Usage
Full Precision
Single image:
bash inference/full_precision/run_infer.sh --image ./test_images/lesion.jpg
Multi-turn chat:
bash inference/full_precision/run_chat.sh --image ./test_images/lesion.jpg
API service:
bash inference/full_precision/run_api.sh
Default API port: 5900
INT4 Quantized
Single image:
bash inference/int4_quantized/run_infer.sh --image_path ./test_images/lesion.jpg
Multi-turn chat:
bash inference/int4_quantized/run_chat.sh --image ./test_images/lesion.jpg
API service:
bash inference/int4_quantized/run_api.sh
Default API port: 5901
The INT4 path uses:
bitsandbytes4-bit quantizationattn_implementation="sdpa"- the adapter-aware quantized model implementation in
inference/int4_quantized/
GPU Selection
You do not need to add CUDA_VISIBLE_DEVICES=0 if the machine has only one visible GPU or if you are fine with the default CUDA device. π§©
Use it only when you want to pin the process to a specific GPU, for example on a multi-GPU server:
CUDA_VISIBLE_DEVICES=0 bash inference/int4_quantized/run_infer.sh --image_path ./test_images/lesion.jpg
The same pattern also works for:
inference/full_precision/run_infer.shinference/full_precision/run_chat.shinference/full_precision/run_api.shinference/int4_quantized/run_chat.shinference/int4_quantized/run_api.sh
API Endpoints
Both API services expose the same endpoints:
POST /v1/upload/{state_id}POST /v1/predict/{state_id}POST /v1/reset/{state_id}POST /diagnose/streamGET /health
Which One To Use
- π― Use
full_precisionwhen you want the original model path and best fidelity. - β‘ Use
int4_quantizedwhen GPU memory is tight or when you are on an environment whereflash-attnis not the practical option.

