PRIMO COT SFT 7B

This model is part of the PRIMO series and is trained for video-based reasoning in robotic manipulation settings. This is an ablation model, which is compared to our RRIMP R1 in the paper。

Model Description

Current video MLLMs often function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. PRIMO R1 transforms these models into active "Critics" by:

Reinforcement Learning: Leveraging outcome-based RL to incentivize explicit Chain-of-Thought (CoT) generation for progress estimation.
Temporal Anchoring: Constructing a structured temporal input that explicitly anchors the video sequence between initial and current state images.
Process Reasoning: Focusing on evaluating the current state against the intended task goal to detect failures and track progress.

Citations

If you find our work helpful for your research, please consider citing our work.

@misc{liu2026passiveobserveractivecritic,
      title={From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation}, 
      author={Yibin Liu and Yaxing Lyu and Daqi Gao and Zhixuan Liang and Weiliang Tang and Shilong Mu and Xiaokang Yang and Yao Mu},
      year={2026},
      eprint={2603.15600},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.15600}, 
}