CodeVision: A Code-as-Tool Framework for Multimodal Large Language Models

Researchers from Zhejiang University and ByteDance introduced CodeVision, a 'code-as-tool' framework that equips Multimodal Large Language Models (MLLMs) to programmatically interact with images. The approach significantly improves MLLM robustness by correcting common image corruptions and enables state-of-the-art multi-tool reasoning through emergent tool use and error recovery.

CodeVision: A Code-as-Tool Framework for Multimodal Large Language Models

Abstract

Projects