Qwen3-VL-4B GSPO Fine-tuned
This model is fine-tuned from Qwen3-VL-4B-Instruct using GSPO (Group Sequence Policy Optimization).
π Training Details
- Algorithm: GSPO (Group Sequence Policy Optimization)
- Base Model: Qwen3-VL-4B-Instruct
- Training Data: Quinn777/merged1031_simplified_o3_easy
- Infrastructure: 2 nodes Γ 8 H100 GPUs (16 GPUs total)
- Checkpoint: global_step_250
Hyperparameters
- Global Batch Size: 128
- Rollout Batch Size: 256
- Rollout Number: 4
- Max Prompt Length: 4096
- Max Response Length: 4096
- Total Epochs: 10
π Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"ttlynne/qwen3vl-4b-gspo-merged",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
"ttlynne/qwen3vl-4b-gspo-merged",
trust_remote_code=True
)
# Example inference
messages = [
{"role": "user", "content": "Solve: 2x + 5 = 13"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Performance
(Add your evaluation results here)
π Acknowledgements
- Downloads last month
- 14
Model tree for ttlynne/qwen3vl-4b-gspo-merged
Base model
Qwen/Qwen3-VL-4B-Instruct