CodeVision: A Code-as-Tool Framework for Multimodal Large Language Models

By: Wang Kai, Zhu Ling, Chen Hao

Published: 2025-12-04

View on arXiv →
#cs.AI

Abstract

Researchers from Zhejiang University and ByteDance introduced CodeVision, a 'code-as-tool' framework that equips Multimodal Large Language Models (MLLMs) to programmatically interact with images. The approach significantly improves MLLM robustness by correcting common image corruptions and enables state-of-the-art multi-tool reasoning through emergent tool use and error recovery.

FEEDBACK

Projects

No projects yet

CodeVision: A Code-as-Tool Framework for Multimodal Large Language Models | ArXiv Intelligence