Consistency of Large Reasoning Models Under Multi-Turn Attacks
By: Yubo Li, Ramayya Krishnan, Rema Padman
Published: 2026-02-13
View on arXiv →Abstract
Large reasoning models demonstrate state-of-the-art performance on complex tasks, but their robustness against multi-turn adversarial attacks is underexplored. This paper evaluates nine frontier reasoning models, revealing that while reasoning provides some robustness, all models have distinct vulnerabilities. Five failure modes are identified through trajectory analysis, with self-doubt and social conformity accounting for 50% of failures. The study also finds that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence, suggesting a need for fundamental redesign of confidence-based defenses for reasoning models.