QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Vision-Language Models (VLMs) have shown remarkable progress, but their ability to reason about the physical world, crucial for real-world applications like robotics, remains underexplored. This paper introduces QuantiPhy, a quantitative benchmark designed to evaluate the physical reasoning capabilities of VLMs. QuantiPhy assesses how well VLMs understand and predict outcomes of physical interactions, such as object stability, movement, and collision, based on visual input. The benchmark provides a standardized method to measure progress in this critical area, pushing towards more robust and intelligent embodied AI systems that can operate effectively in complex physical environments.

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Abstract

Projects