File size: 4,611 Bytes
1e05592 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | # VLAlert β Code & Models
Source code for **VLAlert**, a vision-language driver-alerting framework that
produces structured per-frame safety `<|BELIEF|>` tokens from dashcam video and
maps them to three alert actions: **SILENT / OBSERVE / ALERT**.
This repository contains the **training and evaluation code** for all model
variants. Model weights / checkpoints are **not** included. The benchmark data
and experimental results are hosted separately at
[`AsianPlayer/VLAlert-Bench`](https://huggingface.co/datasets/AsianPlayer/VLAlert-Bench).
## Architecture
```
8 dashcam frames
β
βΌ
Qwen3-VL-4B + LoRA βββΊ [Analysis] reasoning + [Safety Assessment]
<|BELIEF|> ... </|BELIEF|> <|ACTION|> (per frame)
β
ββ belief span (mean-pool layers {20,24,28,32}) β z_t β β^10240 ββΊ DangerHead (14.8M)
ββ close-tag hidden state (layer 33) β r_t β β^2560 ββΊ PolicyHead (7.0M)
β
a_{t-1} feedback βββββ FSM Decoder βββΊ Action a_t
```
## Repository Structure
```
lkalert/
models/ # model architectures
danger_head.py # per-frame + clip danger regressor (PMA aggregator)
policy_head_v2.py # GRU 3-class policy head (SILENT/OBSERVE/ALERT)
adaptive_window.py # adaptive temporal-window selection (VLAlert-X)
components.py # MultiQueryPMA aggregator, legacy heads
belief_vlm.py # integrated VLM + belief/action heads
multichannel_belief.py # LKAlert-MCB gated multi-channel fusion
lora.py # LoRA implementation
utils/, data/ # core library
training/
VLA/ # belief-token SFT on Qwen3-VL-4B
train_cot_belief_v2.py # v2 SFT (belief + action per frame)
train_vlalert_sft_v3.py# v3 SFT (reasoning β belief, embedding loss option)
cot_belief_dataset_v2.py
Policy/ # downstream head training
train_danger_head.py # DangerHead (5-seed)
train_policy_head_v2.py# PolicyHead (5-seed)
train_vlalert_x.py # VLAlert-X adaptive-window end-to-end
train_head_dpo.py # DPO preference fine-tuning
train_head_kto.py # KTO fine-tuning
train_head_ppo.py # PPO fine-tuning
SFT/ # Qwen2.5-VL-3B monolithic SFT (VLAlert-2.5)
DPO/ # preference-pair training
pretrain*/ # 2-stage vision-language pretraining
Nexar/ # CNN baselines (ResNet50-LSTM, R3D-18, MViT-V2-S)
tools/
# data preparation
relabel_dada_nexar.py # action labels via risky_time + 2s rule
relabel_dota_corpus.py # BADAS-gated OBSERVE labels
generate_beliefs.py # rule-based belief content
run_v1_gpt5_cot.py # GPT-4o belief generation
build_v5_benchmark.py # unified benchmark builder
# belief cache extraction
make_cache_x_v2.py # dual-stream cache (belief_content + policy_position)
run_qwen3_cache_fast.py # cache extraction with Conv3dβLinear patch
# evaluation
demo_compare_pipeline.py # multi-model demo scoring + visualization
score_*.py, compute_daus_v6.py
# figures
render_modelarchi_v4.py, render_belief_span.py
PATCH_conv3d_linear.md # Conv3dβLinear acceleration (64Γ on Blackwell GPUs)
requirements.txt
```
## The Conv3d β Linear Patch
`PATCH_conv3d_linear.md` documents a 64Γ end-to-end speedup of Qwen3-VL vision
patch embedding on Blackwell GPUs (RTX 5090), by replacing the degenerate
`nn.Conv3d(kernel=stride)` patchification with a mathematically equivalent
`nn.Linear`. This makes large-scale belief-cache extraction feasible
(6 days β ~2 hours). Equivalence is proven and verified
(`tools/verify_patch_embed_correctness.py`).
## Reproduction
1. Prepare benchmark annotations from
[`AsianPlayer/VLAlert-Bench`](https://huggingface.co/datasets/AsianPlayer/VLAlert-Bench).
2. **Stage 1 β SFT**: `training/VLA/train_vlalert_sft_v3.py`
3. **Stage 2 β cache extraction**: `tools/make_cache_x_v2.py`
4. **Stage 3 β heads**: `training/Policy/train_danger_head.py`, `train_policy_head_v2.py`
5. **Evaluation**: `tools/score_*.py`, `tools/compute_daus_v6.py`
Paths in scripts use `PROJECT_ROOT` as a placeholder for the repository root.
## License
Code released for research review. The benchmark builds on Nexar, DADA-2000,
DoTA, and DAD source datasets; see the dataset repository for source licenses
and citations.
|