Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

By: Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, Yubo Zhu, Qianyu Li, Di Yin, Haoyu Cao, Weibo Gu, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Mingkong Tang, Shuangyin Liu, Lexiang Tang, Haodong Lin, Junru Lu, Jiarui Qin, Lingfeng Qiao, Ruizhi Qiao, Bo Ke, Jianfeng He, Ke Li, Yangning Li, Yunhang Shen, Mengdan Zhang, Peixian Chen, Kun Yin, Bing Liu, Yunfei Wu, Huang Chen, Zhongpeng Cai, Xiaotian Li

Published: 2026-01-27

View on arXiv →
#cs.AI

Abstract

Tencent researchers introduced Youtu-VL, a Vision-Language Model framework addressing fine-grained visual information loss with a "vision-as-target" optimization paradigm, achieving competitive performance across 75 benchmarks and significantly reduced hallucination. This model unifies a wide range of vision-centric and general multimodal tasks within a standard architecture, establishing a robust foundation for generalist visual agents.

FEEDBACK

Projects

No projects yet