CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?
By: Siqi Wang, Chao Liang, Yunfan Gao, Erxin Yu, Sen Li, Yushi Li, Jing Li, Haofen Wang
Published: 2025-12-19
View on arXiv →#cs.AI
Abstract
CitySeeker investigates how Vision-Language Models (VLMs) can effectively perform embodied urban navigation while implicitly understanding and addressing human needs. We propose a framework that integrates visual perception, language understanding, and commonsense reasoning to enable VLMs to navigate complex urban environments, interpret human instructions, and make decisions that align with user preferences and safety. This research has significant implications for developing intelligent assistive navigation systems and autonomous vehicles.