VLAlert — Code & Models

Source code for VLAlert, a vision-language driver-alerting framework that produces structured per-frame safety <|BELIEF|> tokens from dashcam video and maps them to three alert actions: SILENT / OBSERVE / ALERT.

This repository contains the training and evaluation code for all model variants. Model weights / checkpoints are not included. The benchmark data and experimental results are hosted separately at AsianPlayer/VLAlert-Bench.

Architecture

8 dashcam frames
      │
      ▼
Qwen3-VL-4B + LoRA  ──►  [Analysis] reasoning  +  [Safety Assessment]
                          <|BELIEF|> ... </|BELIEF|> <|ACTION|>  (per frame)
      │
      ├─ belief span (mean-pool layers {20,24,28,32}) → z_t ∈ ℝ^10240 ─► DangerHead (14.8M)
      └─ close-tag hidden state (layer 33)            → r_t ∈ ℝ^2560  ─► PolicyHead (7.0M)
                                                                            │
                                            a_{t-1} feedback ◄──── FSM Decoder ──► Action a_t

Repository Structure

lkalert/
  models/                  # model architectures
    danger_head.py         #   per-frame + clip danger regressor (PMA aggregator)
    policy_head_v2.py      #   GRU 3-class policy head (SILENT/OBSERVE/ALERT)
    adaptive_window.py     #   adaptive temporal-window selection (VLAlert-X)
    components.py          #   MultiQueryPMA aggregator, legacy heads
    belief_vlm.py          #   integrated VLM + belief/action heads
    multichannel_belief.py #   LKAlert-MCB gated multi-channel fusion
    lora.py                #   LoRA implementation
  utils/, data/            # core library

training/
  VLA/                     # belief-token SFT on Qwen3-VL-4B
    train_cot_belief_v2.py #   v2 SFT (belief + action per frame)
    train_vlalert_sft_v3.py#   v3 SFT (reasoning → belief, embedding loss option)
    cot_belief_dataset_v2.py
  Policy/                  # downstream head training
    train_danger_head.py   #   DangerHead (5-seed)
    train_policy_head_v2.py#   PolicyHead (5-seed)
    train_vlalert_x.py     #   VLAlert-X adaptive-window end-to-end
    train_head_dpo.py      #   DPO preference fine-tuning
    train_head_kto.py      #   KTO fine-tuning
    train_head_ppo.py      #   PPO fine-tuning
  SFT/                     # Qwen2.5-VL-3B monolithic SFT (VLAlert-2.5)
  DPO/                     # preference-pair training
  pretrain*/               # 2-stage vision-language pretraining
  Nexar/                   # CNN baselines (ResNet50-LSTM, R3D-18, MViT-V2-S)

tools/
  # data preparation
  relabel_dada_nexar.py    # action labels via risky_time + 2s rule
  relabel_dota_corpus.py   # BADAS-gated OBSERVE labels
  generate_beliefs.py      # rule-based belief content
  run_v1_gpt5_cot.py       # GPT-4o belief generation
  build_v5_benchmark.py    # unified benchmark builder
  # belief cache extraction
  make_cache_x_v2.py       # dual-stream cache (belief_content + policy_position)
  run_qwen3_cache_fast.py  # cache extraction with Conv3d→Linear patch
  # evaluation
  demo_compare_pipeline.py # multi-model demo scoring + visualization
  score_*.py, compute_daus_v6.py
  # figures
  render_modelarchi_v4.py, render_belief_span.py

PATCH_conv3d_linear.md     # Conv3d→Linear acceleration (64× on Blackwell GPUs)
requirements.txt

The Conv3d → Linear Patch

PATCH_conv3d_linear.md documents a 64× end-to-end speedup of Qwen3-VL vision patch embedding on Blackwell GPUs (RTX 5090), by replacing the degenerate nn.Conv3d(kernel=stride) patchification with a mathematically equivalent nn.Linear. This makes large-scale belief-cache extraction feasible (6 days → ~2 hours). Equivalence is proven and verified (tools/verify_patch_embed_correctness.py).

Reproduction

Prepare benchmark annotations from AsianPlayer/VLAlert-Bench.
Stage 1 — SFT: training/VLA/train_vlalert_sft_v3.py
Stage 2 — cache extraction: tools/make_cache_x_v2.py
Stage 3 — heads: training/Policy/train_danger_head.py, train_policy_head_v2.py
Evaluation: tools/score_*.py, tools/compute_daus_v6.py

Paths in scripts use PROJECT_ROOT as a placeholder for the repository root.

License

Code released for research review. The benchmark builds on Nexar, DADA-2000, DoTA, and DAD source datasets; see the dataset repository for source licenses and citations.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support