YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
VLAlert β Code & Models
Source code for VLAlert, a vision-language driver-alerting framework that
produces structured per-frame safety <|BELIEF|> tokens from dashcam video and
maps them to three alert actions: SILENT / OBSERVE / ALERT.
This repository contains the training and evaluation code for all model
variants. Model weights / checkpoints are not included. The benchmark data
and experimental results are hosted separately at
AsianPlayer/VLAlert-Bench.
Architecture
8 dashcam frames
β
βΌ
Qwen3-VL-4B + LoRA βββΊ [Analysis] reasoning + [Safety Assessment]
<|BELIEF|> ... </|BELIEF|> <|ACTION|> (per frame)
β
ββ belief span (mean-pool layers {20,24,28,32}) β z_t β β^10240 ββΊ DangerHead (14.8M)
ββ close-tag hidden state (layer 33) β r_t β β^2560 ββΊ PolicyHead (7.0M)
β
a_{t-1} feedback βββββ FSM Decoder βββΊ Action a_t
Repository Structure
lkalert/
models/ # model architectures
danger_head.py # per-frame + clip danger regressor (PMA aggregator)
policy_head_v2.py # GRU 3-class policy head (SILENT/OBSERVE/ALERT)
adaptive_window.py # adaptive temporal-window selection (VLAlert-X)
components.py # MultiQueryPMA aggregator, legacy heads
belief_vlm.py # integrated VLM + belief/action heads
multichannel_belief.py # LKAlert-MCB gated multi-channel fusion
lora.py # LoRA implementation
utils/, data/ # core library
training/
VLA/ # belief-token SFT on Qwen3-VL-4B
train_cot_belief_v2.py # v2 SFT (belief + action per frame)
train_vlalert_sft_v3.py# v3 SFT (reasoning β belief, embedding loss option)
cot_belief_dataset_v2.py
Policy/ # downstream head training
train_danger_head.py # DangerHead (5-seed)
train_policy_head_v2.py# PolicyHead (5-seed)
train_vlalert_x.py # VLAlert-X adaptive-window end-to-end
train_head_dpo.py # DPO preference fine-tuning
train_head_kto.py # KTO fine-tuning
train_head_ppo.py # PPO fine-tuning
SFT/ # Qwen2.5-VL-3B monolithic SFT (VLAlert-2.5)
DPO/ # preference-pair training
pretrain*/ # 2-stage vision-language pretraining
Nexar/ # CNN baselines (ResNet50-LSTM, R3D-18, MViT-V2-S)
tools/
# data preparation
relabel_dada_nexar.py # action labels via risky_time + 2s rule
relabel_dota_corpus.py # BADAS-gated OBSERVE labels
generate_beliefs.py # rule-based belief content
run_v1_gpt5_cot.py # GPT-4o belief generation
build_v5_benchmark.py # unified benchmark builder
# belief cache extraction
make_cache_x_v2.py # dual-stream cache (belief_content + policy_position)
run_qwen3_cache_fast.py # cache extraction with Conv3dβLinear patch
# evaluation
demo_compare_pipeline.py # multi-model demo scoring + visualization
score_*.py, compute_daus_v6.py
# figures
render_modelarchi_v4.py, render_belief_span.py
PATCH_conv3d_linear.md # Conv3dβLinear acceleration (64Γ on Blackwell GPUs)
requirements.txt
The Conv3d β Linear Patch
PATCH_conv3d_linear.md documents a 64Γ end-to-end speedup of Qwen3-VL vision
patch embedding on Blackwell GPUs (RTX 5090), by replacing the degenerate
nn.Conv3d(kernel=stride) patchification with a mathematically equivalent
nn.Linear. This makes large-scale belief-cache extraction feasible
(6 days β ~2 hours). Equivalence is proven and verified
(tools/verify_patch_embed_correctness.py).
Reproduction
- Prepare benchmark annotations from
AsianPlayer/VLAlert-Bench. - Stage 1 β SFT:
training/VLA/train_vlalert_sft_v3.py - Stage 2 β cache extraction:
tools/make_cache_x_v2.py - Stage 3 β heads:
training/Policy/train_danger_head.py,train_policy_head_v2.py - Evaluation:
tools/score_*.py,tools/compute_daus_v6.py
Paths in scripts use PROJECT_ROOT as a placeholder for the repository root.
License
Code released for research review. The benchmark builds on Nexar, DADA-2000, DoTA, and DAD source datasets; see the dataset repository for source licenses and citations.