MemCode-VLA v11

Memory-conditioned visuomotor policy for robot manipulation.

Architecture: SmolVLM2-2.2B VLM → MoT dual-path 24-layer denoiser (LaST0 pattern) → Flow Matching action head → DeltaMem online memory → World Model Expert (LeWM-style) → Cascade Anchor Decoder (DiffusionDrive/BridgeDrive)

Training: 8×H100-80GB DDP, 45K steps (checkpoints at 5K intervals)

Config:

  • B_ep=32, W=48, 24 MoT layers
  • VLM: single layer 14/24 (GR00T N1 pattern)
  • Anchor: 512 anchors, Sinkhorn+centering+focal KL, cosine distance
  • DeltaMem: rank-8, per-layer delta-rule associative memory
  • World Model: LeWM-style ARPredictor, H=8 history, S=2 stride
  • CoT: LaST0 <|latent_pad|> pattern, 4 latent reasoning tokens

Checkpoints:

Step Action Loss Anchor Eff Rank WM Active
5000 - - -
10000 - - -
15000 - - -
20000 - - -
25000 - - -
30000 - - -
35000 - - -
40000 0.020 326/512 0.145
45000 - - -

Data: RobotWin (clean+aug, 5%) + AgiBot World (45%) + InternData-A1 (45%)

Resume:

PYTORCH_ALLOC_CONF=expandable_segments:True CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  torchrun --standalone --nnodes=1 --nproc_per_node=8 \
  -m xq_memcodevla.training.train train pretraining --resume

Code: https://github.com/guohetian/XQ-MemCodeVLA (branch: dev)

Papers: MemCode-VLA (memory + planning) + TokenAct (efficient execution)

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading