AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

By: Yuxuan Han, Kunyuan Wu, Qianyi Shao, Renxiang Xiao, Zilu Wang, Cansen Jiang, Yi Xiao, Liang Hu, Yunjiang Lou

Published: 2026-02-04

View on arXiv →
#cs.AI

Abstract

End-to-end autonomous driving has emerged as a promising paradigm. We propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. The vision encoder fuses spatial-temporal information from multi-view images using a deformable transformer, enhancing robustness. A dedicated planning modality encodes explicit Bird's-Eye-View spatial information, mitigating language biases. We deploy AppleVLM on an AGV platform, successfully showcasing real-world end-to-end autonomous driving in complex outdoor environments.

FEEDBACK

Projects

No projects yet

AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models | ArXiv Intelligence