AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

AutoMoT is a unified Vision-Language-Action (VLA) framework for end-to-end autonomous driving. It features an Asynchronous Mixture-of-Transformers (MoT) architecture that unifies scene reasoning and action generation within a single model.

Paper: AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
Project Page: https://automot-website.github.io/
Repository: https://github.com/OscarHuangWind/AutoMoT

Method Overview

AutoMoT employs a fast-slow inference mechanism through its Asynchronous Mixture-of-Transformers design:

Understanding Expert (4B): Performs low-frequency, high-level reasoning to understand complex driving scenes.
Action Expert (1.6B): Operates at a higher frequency to decode 3-second driving decisions and spatial-temporal waypoints via KV-cache bridging.

This architecture preserves the general reasoning capabilities of pre-trained Vision-Language Models (VLMs) while ensuring the low-latency inference required for real-time autonomous vehicle control.

Benchmark Results

AutoMoT achieves SOTA performance on the Bench2Drive 220-route closed-loop evaluation (CARLA 0.9.15):

Driving Score (DS): 89.42
Success Rate (SR): 74.09

Citation

If you find this work useful, please cite:

@article{huang2026automot,
  title   = {AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving},
  author  = {Wenhui Huang and Songyan Zhang and Qihang Huang and Zhidong Wang and Zhiqi Mao and Collister Chua and Zhan Chen and Long Chen and Chen Lv},
  journal = {arXiv preprint arXiv:2603.14851},
  year    = {2026},
  url     = {https://arxiv.org/abs/2603.14851}
}

@inproceedings{jia2024bench,
  title     = {Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving},
  author    = {Xiaosong Jia and Zhenjie Yang and Qifeng Li and Zhiyuan Zhang and Junchi Yan},
  booktitle = {NeurIPS 2024 Datasets and Benchmarks Track},
  year      = {2024}
}

Downloads last month: 18

Safetensors

Model size

7B params

Tensor type

F32

BF16

Video Preview

Robotics

Paper for Oscar-Huang/AutoMoT

AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

Paper • 2603.14851 • Published Mar 18