HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
By: Mohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Montoya, Nandan Marwaha, Yannis He, Charles Wang, Fernando Crabedo, Alessa Castilo, Bing Liu
Published: 2026-04-13
View on arXiv →#cs.AI
Abstract
This paper introduces HiL-Bench, a Human-in-the-Loop Benchmark designed to measure the critical help-seeking judgment of AI agents. It evaluates an agent's ability to recognize when information is missing, ambiguous, or contradictory, and to proactively ask targeted questions rather than making assumptions. The benchmark uses tasks with human-validated blockers that emerge through progressive exploration, revealing a significant judgment gap in current frontier models and demonstrating that this selective escalation skill is trainable.