File size: 4,611 Bytes
1e05592
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# VLAlert β€” Code & Models

Source code for **VLAlert**, a vision-language driver-alerting framework that
produces structured per-frame safety `<|BELIEF|>` tokens from dashcam video and
maps them to three alert actions: **SILENT / OBSERVE / ALERT**.

This repository contains the **training and evaluation code** for all model
variants. Model weights / checkpoints are **not** included. The benchmark data
and experimental results are hosted separately at
[`AsianPlayer/VLAlert-Bench`](https://huggingface.co/datasets/AsianPlayer/VLAlert-Bench).

## Architecture

```
8 dashcam frames
      β”‚
      β–Ό
Qwen3-VL-4B + LoRA  ──►  [Analysis] reasoning  +  [Safety Assessment]
                          <|BELIEF|> ... </|BELIEF|> <|ACTION|>  (per frame)
      β”‚
      β”œβ”€ belief span (mean-pool layers {20,24,28,32}) β†’ z_t ∈ ℝ^10240 ─► DangerHead (14.8M)
      └─ close-tag hidden state (layer 33)            β†’ r_t ∈ ℝ^2560  ─► PolicyHead (7.0M)
                                                                            β”‚
                                            a_{t-1} feedback ◄──── FSM Decoder ──► Action a_t
```

## Repository Structure

```
lkalert/
  models/                  # model architectures
    danger_head.py         #   per-frame + clip danger regressor (PMA aggregator)
    policy_head_v2.py      #   GRU 3-class policy head (SILENT/OBSERVE/ALERT)
    adaptive_window.py     #   adaptive temporal-window selection (VLAlert-X)
    components.py          #   MultiQueryPMA aggregator, legacy heads
    belief_vlm.py          #   integrated VLM + belief/action heads
    multichannel_belief.py #   LKAlert-MCB gated multi-channel fusion
    lora.py                #   LoRA implementation
  utils/, data/            # core library

training/
  VLA/                     # belief-token SFT on Qwen3-VL-4B
    train_cot_belief_v2.py #   v2 SFT (belief + action per frame)
    train_vlalert_sft_v3.py#   v3 SFT (reasoning β†’ belief, embedding loss option)
    cot_belief_dataset_v2.py
  Policy/                  # downstream head training
    train_danger_head.py   #   DangerHead (5-seed)
    train_policy_head_v2.py#   PolicyHead (5-seed)
    train_vlalert_x.py     #   VLAlert-X adaptive-window end-to-end
    train_head_dpo.py      #   DPO preference fine-tuning
    train_head_kto.py      #   KTO fine-tuning
    train_head_ppo.py      #   PPO fine-tuning
  SFT/                     # Qwen2.5-VL-3B monolithic SFT (VLAlert-2.5)
  DPO/                     # preference-pair training
  pretrain*/               # 2-stage vision-language pretraining
  Nexar/                   # CNN baselines (ResNet50-LSTM, R3D-18, MViT-V2-S)

tools/
  # data preparation
  relabel_dada_nexar.py    # action labels via risky_time + 2s rule
  relabel_dota_corpus.py   # BADAS-gated OBSERVE labels
  generate_beliefs.py      # rule-based belief content
  run_v1_gpt5_cot.py       # GPT-4o belief generation
  build_v5_benchmark.py    # unified benchmark builder
  # belief cache extraction
  make_cache_x_v2.py       # dual-stream cache (belief_content + policy_position)
  run_qwen3_cache_fast.py  # cache extraction with Conv3d→Linear patch
  # evaluation
  demo_compare_pipeline.py # multi-model demo scoring + visualization
  score_*.py, compute_daus_v6.py
  # figures
  render_modelarchi_v4.py, render_belief_span.py

PATCH_conv3d_linear.md     # Conv3d→Linear acceleration (64× on Blackwell GPUs)
requirements.txt
```

## The Conv3d β†’ Linear Patch

`PATCH_conv3d_linear.md` documents a 64Γ— end-to-end speedup of Qwen3-VL vision
patch embedding on Blackwell GPUs (RTX 5090), by replacing the degenerate
`nn.Conv3d(kernel=stride)` patchification with a mathematically equivalent
`nn.Linear`. This makes large-scale belief-cache extraction feasible
(6 days β†’ ~2 hours). Equivalence is proven and verified
(`tools/verify_patch_embed_correctness.py`).

## Reproduction

1. Prepare benchmark annotations from
   [`AsianPlayer/VLAlert-Bench`](https://huggingface.co/datasets/AsianPlayer/VLAlert-Bench).
2. **Stage 1 β€” SFT**: `training/VLA/train_vlalert_sft_v3.py`
3. **Stage 2 β€” cache extraction**: `tools/make_cache_x_v2.py`
4. **Stage 3 β€” heads**: `training/Policy/train_danger_head.py`, `train_policy_head_v2.py`
5. **Evaluation**: `tools/score_*.py`, `tools/compute_daus_v6.py`

Paths in scripts use `PROJECT_ROOT` as a placeholder for the repository root.

## License

Code released for research review. The benchmark builds on Nexar, DADA-2000,
DoTA, and DAD source datasets; see the dataset repository for source licenses
and citations.