BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation

Equipping embodied agents with the ability to reason about tasks, foresee physical outcomes, and generate precise actions is essential for general-purpose manipulation. BagelVLA is a unified model that integrates linguistic planning, visual forecasting, and action generation within a single framework to enhance long-horizon manipulation. Initialized from a pretrained unified understanding and generative model, BagelVLA interleaves textual reasoning and visual prediction directly into the action execution loop. It introduces Residual Flow Guidance (RFG) for efficient modality coupling, leveraging single-step denoising for predictive visual features and minimal latency. Extensive experiments demonstrate BagelVLA significantly outperforms baselines on simulated and real-world benchmarks, especially in multi-stage reasoning tasks.

BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation

Abstract

Projects