CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?

CitySeeker investigates how Vision-Language Models (VLMs) can effectively perform embodied urban navigation while implicitly understanding and addressing human needs. We propose a framework that integrates visual perception, language understanding, and commonsense reasoning to enable VLMs to navigate complex urban environments, interpret human instructions, and make decisions that align with user preferences and safety. This research has significant implications for developing intelligent assistive navigation systems and autonomous vehicles.

CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?

Abstract

Projects