BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

Multimodal Large Language Models (MLLMs) are evolving into autonomous agents capable of multimodal web browsing and deep searching. Existing benchmarks fall short in task complexity, evidence accessibility, and evaluation granularity. This paper introduces BrowseComp-V^3, a new benchmark with 300 challenging questions across diverse domains, emphasizing deep multi-layered, cross-modal, multi-hop reasoning with publicly searchable evidence for reproducibility. Beyond final answer accuracy, it incorporates an expert-validated subgoal-driven process evaluation. Experiments show that even state-of-the-art models achieve only 36% accuracy, highlighting significant bottlenecks in multimodal information integration and detailed perception.

BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

Abstract

Projects