Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Agentic reasoning models, which leverage external tools for multi-step tasks, hold immense promise but also introduce new safety challenges. A critical aspect of their safe deployment is the ability to intelligently decide when to act and when to refuse an action, especially when faced with uncertain or potentially harmful tool outputs. This paper proposes a novel framework for guarding agentic reasoning models by explicitly training them to learn refusal policies. Our approach integrates a confidence estimation module and a refusal mechanism directly into the agent's decision-making loop. The confidence module assesses the reliability of generated tool calls and intermediate reasoning steps, while the refusal mechanism triggers a safe fallback (e.g., asking for human intervention or re-planning) if confidence drops below a learned threshold. Through extensive experiments on various tool-use benchmarks involving web navigation, API calls, and code execution, we demonstrate that our guarded agent significantly improves safety and robustness, reducing harmful actions and erroneous tool uses by up to 70% while maintaining high task completion rates. This work provides a crucial step towards building more trustworthy and controllable agentic AI systems for real-world applications.

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Abstract

Projects