DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

By: Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, Ziwei Liu

Published: 2026-01-29

View on arXiv →
#cs.AI

Abstract

DynamicVLA is a novel framework for dynamic object manipulation, addressing the challenges faced by Vision-Language-Action (VLA) models in scenarios requiring rapid perception and continuous control of moving objects. It features a compact 0.4B VLA model with a convolutional vision encoder for efficient inference, Continuous Inference for low-latency adaptation, and Latent-aware Action Streaming for temporal alignment. The paper also introduces the Dynamic Object Manipulation (DOM) benchmark, a new dataset for evaluating dynamic manipulation tasks. This framework significantly improves response speed, perception, and generalization, offering a unified solution for robust dynamic object manipulation in robotics.

FEEDBACK

Projects

No projects yet