AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

End-to-end autonomous driving has emerged as a promising paradigm. We propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. The vision encoder fuses spatial-temporal information from multi-view images using a deformable transformer, enhancing robustness. A dedicated planning modality encodes explicit Bird's-Eye-View spatial information, mitigating language biases. We deploy AppleVLM on an AGV platform, successfully showcasing real-world end-to-end autonomous driving in complex outdoor environments.

AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

Abstract

Projects