# VLAlert — Code & Models Source code for **VLAlert**, a vision-language driver-alerting framework that produces structured per-frame safety `<|BELIEF|>` tokens from dashcam video and maps them to three alert actions: **SILENT / OBSERVE / ALERT**. This repository contains the **training and evaluation code** for all model variants. Model weights / checkpoints are **not** included. The benchmark data and experimental results are hosted separately at [`AsianPlayer/VLAlert-Bench`](https://huggingface.co/datasets/AsianPlayer/VLAlert-Bench). ## Architecture ``` 8 dashcam frames │ ▼ Qwen3-VL-4B + LoRA ──► [Analysis] reasoning + [Safety Assessment] <|BELIEF|> ... <|ACTION|> (per frame) │ ├─ belief span (mean-pool layers {20,24,28,32}) → z_t ∈ ℝ^10240 ─► DangerHead (14.8M) └─ close-tag hidden state (layer 33) → r_t ∈ ℝ^2560 ─► PolicyHead (7.0M) │ a_{t-1} feedback ◄──── FSM Decoder ──► Action a_t ``` ## Repository Structure ``` lkalert/ models/ # model architectures danger_head.py # per-frame + clip danger regressor (PMA aggregator) policy_head_v2.py # GRU 3-class policy head (SILENT/OBSERVE/ALERT) adaptive_window.py # adaptive temporal-window selection (VLAlert-X) components.py # MultiQueryPMA aggregator, legacy heads belief_vlm.py # integrated VLM + belief/action heads multichannel_belief.py # LKAlert-MCB gated multi-channel fusion lora.py # LoRA implementation utils/, data/ # core library training/ VLA/ # belief-token SFT on Qwen3-VL-4B train_cot_belief_v2.py # v2 SFT (belief + action per frame) train_vlalert_sft_v3.py# v3 SFT (reasoning → belief, embedding loss option) cot_belief_dataset_v2.py Policy/ # downstream head training train_danger_head.py # DangerHead (5-seed) train_policy_head_v2.py# PolicyHead (5-seed) train_vlalert_x.py # VLAlert-X adaptive-window end-to-end train_head_dpo.py # DPO preference fine-tuning train_head_kto.py # KTO fine-tuning train_head_ppo.py # PPO fine-tuning SFT/ # Qwen2.5-VL-3B monolithic SFT (VLAlert-2.5) DPO/ # preference-pair training pretrain*/ # 2-stage vision-language pretraining Nexar/ # CNN baselines (ResNet50-LSTM, R3D-18, MViT-V2-S) tools/ # data preparation relabel_dada_nexar.py # action labels via risky_time + 2s rule relabel_dota_corpus.py # BADAS-gated OBSERVE labels generate_beliefs.py # rule-based belief content run_v1_gpt5_cot.py # GPT-4o belief generation build_v5_benchmark.py # unified benchmark builder # belief cache extraction make_cache_x_v2.py # dual-stream cache (belief_content + policy_position) run_qwen3_cache_fast.py # cache extraction with Conv3d→Linear patch # evaluation demo_compare_pipeline.py # multi-model demo scoring + visualization score_*.py, compute_daus_v6.py # figures render_modelarchi_v4.py, render_belief_span.py PATCH_conv3d_linear.md # Conv3d→Linear acceleration (64× on Blackwell GPUs) requirements.txt ``` ## The Conv3d → Linear Patch `PATCH_conv3d_linear.md` documents a 64× end-to-end speedup of Qwen3-VL vision patch embedding on Blackwell GPUs (RTX 5090), by replacing the degenerate `nn.Conv3d(kernel=stride)` patchification with a mathematically equivalent `nn.Linear`. This makes large-scale belief-cache extraction feasible (6 days → ~2 hours). Equivalence is proven and verified (`tools/verify_patch_embed_correctness.py`). ## Reproduction 1. Prepare benchmark annotations from [`AsianPlayer/VLAlert-Bench`](https://huggingface.co/datasets/AsianPlayer/VLAlert-Bench). 2. **Stage 1 — SFT**: `training/VLA/train_vlalert_sft_v3.py` 3. **Stage 2 — cache extraction**: `tools/make_cache_x_v2.py` 4. **Stage 3 — heads**: `training/Policy/train_danger_head.py`, `train_policy_head_v2.py` 5. **Evaluation**: `tools/score_*.py`, `tools/compute_daus_v6.py` Paths in scripts use `PROJECT_ROOT` as a placeholder for the repository root. ## License Code released for research review. The benchmark builds on Nexar, DADA-2000, DoTA, and DAD source datasets; see the dataset repository for source licenses and citations.