PRTS-4B — Primitive Reasoning and Tasking System

PRTS-4B is a Vision–Language–Action (VLA) foundation model that, for the first time, scales reward-label-free contrastive RL into VLA pre-training itself. By treating language instructions as goals and supervising a contrastive value head co-trained inside the same forward pass as behavior cloning, PRTS equips a Qwen3-VL-4B backbone with a quantitative, language-grounded sense of how close the current state is to satisfying the instruction.

The released checkpoint is the result of pre-training on ~167 B tokens of action-labeled and embodied-reasoning data on 64 × H100 GPUs.

📄 Paper: arXiv:2604.27472 · 💻 Code: github.com/TeleHuman/PRTS · 🌐 Project: rhodes-team-prts.github.io

Highlights

Goal-reachability awareness, end-to-end. The contrastive value head is co-trained inside the policy backbone — no separate value network, no curated reward dataset, no offline-RL post-training loop.
Reward-label-free. Supervision comes purely from the temporal structure of demonstrations.
Out-of-distribution wins grow with the shift. On 5 simulation suites and 14 real-world tasks, PRTS matches or exceeds the strongest prior VLAs at ¼–⅛ the post-training compute, with the gap widening off-distribution — novel-instruction following (+38.8 over π_0.5), long-horizon execution, and recovery under human intervention.

Loading the checkpoint

The released model ships its own modeling_*.py, configuration_*.py, and processing_*.py next to the weights, so it can be loaded directly via transformers with trust_remote_code=True. No need to clone the GitHub repo for a smoke test.

Recommended environment

Component	Note
Python	3.10+ (3.11+ recommended)
`transformers`	`== 4.57.3`
PyTorch	recent CUDA build from pytorch.org

pip install "transformers==4.57.3" torch safetensors huggingface_hub \
            numpy pillow sentencepiece protobuf colorama tokenizers
pip install accelerate     # recommended for device_map="auto"

From the Hub

import torch
from transformers import AutoConfig, AutoModel, AutoProcessor

REPO_ID = "TeleEmbodied/PRTS-4B"

config    = AutoConfig.from_pretrained(REPO_ID, trust_remote_code=True)
model     = AutoModel.from_pretrained(
    REPO_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(REPO_ID, trust_remote_code=True)

print(config.model_type)        # prts_qwen3_vl
print(type(model).__name__)
print(type(processor).__name__)

Prompt format

PRTS expects a single user turn containing camera images, a discretized proprioceptive state, and a language instruction, followed by an assistant turn that emits the action chunk. The full prompt is built from these constants (declared in prts/constants.py of the open-source repo):

Token	Meaning
`<\|im_start\|>` `<\|im_end\|>`	Qwen-style turn delimiters
`<\|vision_start\|>` `<\|image_pad\|>` `<\|vision_end\|>`	One image placeholder block per camera
`<\|goal_repr\|>`	CRL value-head anchor tokens
`<\|action_start\|>` `<\|action_pad\|>` `<\|action_end\|>`	Slot the action expert fills with the predicted action chunk token

Layout of one rollout step

<|im_start|>system
You are a helpful physical assistant.<|im_end|>
<|im_start|>user
{cam_1_name}: <|vision_start|><|image_pad|><|vision_end|>
{cam_2_name}: <|vision_start|><|image_pad|><|vision_end|>
...
Proprioception (normalized to 0-1000 scale): {s_1} {s_2} ... {s_D}
Instruction: {language instruction}
Predict the next action chunk in low-level robotics action format.<|im_end|>
<|im_start|>assistant
<|action_start|><|action_token_1|>...<|action_token_999|><|action_end|><|im_end|>

Field-by-field spec

System message: fixed to You are a helpful physical assistant.
Image block: One {cam_name}: <|vision_start|><|image_pad|><|vision_end|> line per camera.
Proprioceptive state: The robot state is q01/q99-normalized to [-1, 1] per dimension (using stats from compute_stats.py), then linearly remapped to integers in [0, 1000] and rendered as a space-separated list. The line is prefixed by Proprioception (normalized to 0-1000 scale): . Omit the line entirely if the embodiment has no proprioception channel. Values outside the bounds are clipped to -1 or 1000.
Instruction: Free-form English natural-language goal (e.g. Left gripper sequentially grasps two shoes and places them in the shoebox. Right gripper closes the shoebox.).
Suffix: Always end the user turn with Predict the next action chunk in low-level robotics action format. if you want to use PRTS to generate actions.

License

This model is released under CC BY-NC 4.0. Free for academic and non-commercial research; commercial use is not permitted under this license.

Citation

If you find PRTS useful, please cite:

@article{zhang2026prts,
  title   = {PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations},
  author={Yang Zhang and Jiangyuan Zhao and Chenyou Fan and Fangzheng Yan and Tian Li and Haitong Tang and Sen Fu and Xuan'er Wu and Qizhen Weng and Weinan Zhang and Xiu Li and Chi Zhang and Chenjia Bai and Xuelong Li},
  journal = {arXiv preprint arXiv:2604.27472},
  year    = {2026},
}

Acknowledgements

PRTS builds on Qwen3-VL, FlashAttention, LeRobot, and OpenPI. We thank the authors of Contrastive RL for the ideas behind the contrastive value formulation.