HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

By: Mohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Montoya, Nandan Marwaha, Yannis He, Charles Wang, Fernando Crabedo, Alessa Castilo, Bing Liu

Published: 2026-04-13

View on arXiv →
#cs.AI

Abstract

This paper introduces HiL-Bench, a Human-in-the-Loop Benchmark designed to measure the critical help-seeking judgment of AI agents. It evaluates an agent's ability to recognize when information is missing, ambiguous, or contradictory, and to proactively ask targeted questions rather than making assumptions. The benchmark uses tasks with human-validated blockers that emerge through progressive exploration, revealing a significant judgment gap in current frontier models and demonstrating that this selective escalation skill is trainable.

FEEDBACK

Projects

No projects yet

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help? | ArXiv Intelligence