BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation

By: Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, Wei Li, Jianyu Chen

Published: 2026-02-10

View on arXiv →
#cs.AI

Abstract

Equipping embodied agents with the ability to reason about tasks, foresee physical outcomes, and generate precise actions is essential for general-purpose manipulation. BagelVLA is a unified model that integrates linguistic planning, visual forecasting, and action generation within a single framework to enhance long-horizon manipulation. Initialized from a pretrained unified understanding and generative model, BagelVLA interleaves textual reasoning and visual prediction directly into the action execution loop. It introduces Residual Flow Guidance (RFG) for efficient modality coupling, leveraging single-step denoising for predictive visual features and minimal latency. Extensive experiments demonstrate BagelVLA significantly outperforms baselines on simulated and real-world benchmarks, especially in multi-stage reasoning tasks.

FEEDBACK

Projects

No projects yet