CodeVision: A Code-as-Tool Framework for Multimodal Large Language Models
By: Wang Kai, Zhu Ling, Chen Hao
Published: 2025-12-04
View on arXiv →#cs.AI
Abstract
Researchers from Zhejiang University and ByteDance introduced CodeVision, a 'code-as-tool' framework that equips Multimodal Large Language Models (MLLMs) to programmatically interact with images. The approach significantly improves MLLM robustness by correcting common image corruptions and enables state-of-the-art multi-tool reasoning through emergent tool use and error recovery.