MemCode-VLA v11
Memory-conditioned visuomotor policy for robot manipulation.
Architecture: SmolVLM2-2.2B VLM → MoT dual-path 24-layer denoiser (LaST0 pattern) → Flow Matching action head → DeltaMem online memory → World Model Expert (LeWM-style) → Cascade Anchor Decoder (DiffusionDrive/BridgeDrive)
Training: 8×H100-80GB DDP, 45K steps (checkpoints at 5K intervals)
Config:
- B_ep=32, W=48, 24 MoT layers
- VLM: single layer 14/24 (GR00T N1 pattern)
- Anchor: 512 anchors, Sinkhorn+centering+focal KL, cosine distance
- DeltaMem: rank-8, per-layer delta-rule associative memory
- World Model: LeWM-style ARPredictor, H=8 history, S=2 stride
- CoT: LaST0
<|latent_pad|>pattern, 4 latent reasoning tokens
Checkpoints:
| Step | Action Loss | Anchor Eff Rank | WM Active |
|---|---|---|---|
| 5000 | - | - | - |
| 10000 | - | - | - |
| 15000 | - | - | - |
| 20000 | - | - | - |
| 25000 | - | - | - |
| 30000 | - | - | - |
| 35000 | - | - | - |
| 40000 | 0.020 | 326/512 | 0.145 |
| 45000 | - | - | - |
Data: RobotWin (clean+aug, 5%) + AgiBot World (45%) + InternData-A1 (45%)
Resume:
PYTORCH_ALLOC_CONF=expandable_segments:True CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
torchrun --standalone --nnodes=1 --nproc_per_node=8 \
-m xq_memcodevla.training.train train pretraining --resume
Code: https://github.com/guohetian/XQ-MemCodeVLA (branch: dev)
Papers: MemCode-VLA (memory + planning) + TokenAct (efficient execution)