EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
By: Li Wei, Chen Jing, Wang Yong
Published: 2025-12-04
View on arXiv →#cs.AI
Abstract
Huawei Inc. researchers developed EMMA, a unified multimodal architecture for understanding, generation, and editing, utilizing 32x visual token compression and channel-wise feature fusion to enhance efficiency. The model achieved 79.6% average accuracy across 11 understanding benchmarks, a 0.91 GenEval score for text-to-image generation, and demonstrated emergent cross-lingual and multi-step instruction following capabilities.