DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
By: Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, Ziwei Liu
Published: 2026-01-29
View on arXiv →Abstract
DynamicVLA is a novel framework for dynamic object manipulation, addressing the challenges faced by Vision-Language-Action (VLA) models in scenarios requiring rapid perception and continuous control of moving objects. It features a compact 0.4B VLA model with a convolutional vision encoder for efficient inference, Continuous Inference for low-latency adaptation, and Latent-aware Action Streaming for temporal alignment. The paper also introduces the Dynamic Object Manipulation (DOM) benchmark, a new dataset for evaluating dynamic manipulation tasks. This framework significantly improves response speed, perception, and generalization, offering a unified solution for robust dynamic object manipulation in robotics.