HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

This paper introduces HiL-Bench, a Human-in-the-Loop Benchmark designed to measure the critical help-seeking judgment of AI agents. It evaluates an agent's ability to recognize when information is missing, ambiguous, or contradictory, and to proactively ask targeted questions rather than making assumptions. The benchmark uses tasks with human-validated blockers that emerge through progressive exploration, revealing a significant judgment gap in current frontier models and demonstrating that this selective escalation skill is trainable.

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

Abstract

Projects