AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
Paper • 2603.14851 • Published
AutoMoT is a unified Vision-Language-Action (VLA) framework for end-to-end autonomous driving. It features an Asynchronous Mixture-of-Transformers (MoT) architecture that unifies scene reasoning and action generation within a single model.
AutoMoT employs a fast-slow inference mechanism through its Asynchronous Mixture-of-Transformers design:
This architecture preserves the general reasoning capabilities of pre-trained Vision-Language Models (VLMs) while ensuring the low-latency inference required for real-time autonomous vehicle control.
AutoMoT achieves SOTA performance on the Bench2Drive 220-route closed-loop evaluation (CARLA 0.9.15):
If you find this work useful, please cite:
@article{huang2026automot,
title = {AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving},
author = {Wenhui Huang and Songyan Zhang and Qihang Huang and Zhidong Wang and Zhiqi Mao and Collister Chua and Zhan Chen and Long Chen and Chen Lv},
journal = {arXiv preprint arXiv:2603.14851},
year = {2026},
url = {https://arxiv.org/abs/2603.14851}
}
@inproceedings{jia2024bench,
title = {Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving},
author = {Xiaosong Jia and Zhenjie Yang and Qifeng Li and Zhiyuan Zhang and Junchi Yan},
booktitle = {NeurIPS 2024 Datasets and Benchmarks Track},
year = {2024}
}