DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

DynamicVLA is a novel framework for dynamic object manipulation, addressing the challenges faced by Vision-Language-Action (VLA) models in scenarios requiring rapid perception and continuous control of moving objects. It features a compact 0.4B VLA model with a convolutional vision encoder for efficient inference, Continuous Inference for low-latency adaptation, and Latent-aware Action Streaming for temporal alignment. The paper also introduces the Dynamic Object Manipulation (DOM) benchmark, a new dataset for evaluating dynamic manipulation tasks. This framework significantly improves response speed, perception, and generalization, offering a unified solution for robust dynamic object manipulation in robotics.

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Abstract

Projects