BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
By: Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Dan, Haishan Lu, Zhiyong Cao, Jiaoyang Chen, Yuqian Han, Zinan Sheng, Zhengwei Tao, Hao Liang, Jialong Wu, Yang Shi, Yuanpeng He, Jiaye Lin, Qintong Zhang, Guochen Yan, Runhao Zhao, Zhengpin Li, Xiaohan Yu, Lang Mei, Chong Chen, Wentao Zhang, Bin Cui
Published: 2026-02-16
View on arXiv →Abstract
Multimodal Large Language Models (MLLMs) are evolving into autonomous agents capable of multimodal web browsing and deep searching. Existing benchmarks fall short in task complexity, evidence accessibility, and evaluation granularity. This paper introduces BrowseComp-V^3, a new benchmark with 300 challenging questions across diverse domains, emphasizing deep multi-layered, cross-modal, multi-hop reasoning with publicly searchable evidence for reproducibility. Beyond final answer accuracy, it incorporates an expert-validated subgoal-driven process evaluation. Experiments show that even state-of-the-art models achieve only 36% accuracy, highlighting significant bottlenecks in multimodal information integration and detailed perception.