Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

The goal of general-purpose robots relies on their ability to understand and execute natural language instructions, but Vision-Language-Action (VLA) models often misalign actions with instructions. This paper explores test-time verification to reduce this 'intention-action gap.' It demonstrates that jointly scaling rephrased instructions and generated actions significantly increases test-time sample diversity, leading to more efficient recovery of correct actions. Introducing CoVer, a contrastive verifier for VLA alignment, the framework shows graceful scaling with computational resources and data. Compared to scaling policy pre-training, CoVer achieves substantial gains in-distribution (22%) and out-of-distribution (13%) on the SIMPLER benchmark, with a further 45% improvement in real-world experiments.

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

Abstract

Projects