arxiv:2602.12705

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Published on Feb 13

· Submitted by

kai on Feb 16

#2 Paper of the day

ByteDance

Upvote

Authors:

Xiaozhong Ji ,

Abstract

MedXIAOHE is a medical vision-language foundation model that enhances clinical understanding through entity-aware continual pretraining, reinforcement learning, and tool-augmented agentic training for reliable diagnostic reasoning.

AI-generated summary

We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.