EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

By: Li Wei, Chen Jing, Wang Yong

Published: 2025-12-04

View on arXiv →
#cs.AI

Abstract

Huawei Inc. researchers developed EMMA, a unified multimodal architecture for understanding, generation, and editing, utilizing 32x visual token compression and channel-wise feature fusion to enhance efficiency. The model achieved 79.6% average accuracy across 11 understanding benchmarks, a 0.91 GenEval score for text-to-image generation, and demonstrated emergent cross-lingual and multi-step instruction following capabilities.

FEEDBACK

Projects

No projects yet

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture | ArXiv Intelligence