Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
By: Max Kaufmann, David Lindner, Roland S. Zimmermann, Rohin Shah
Published: 2026-03-31
View on arXiv →#cs.AI
Abstract
Chain-of-Thought (CoT) monitoring is a promising approach for overseeing AI systems, but training can affect its 'monitorability' by causing models to hide reasoning. This paper proposes and empirically validates a conceptual framework to predict when and why this occurs. It models LLM post-training as an RL environment, classifying reward terms for final outputs and CoT as 'aligned,' 'orthogonal,' or 'in-conflict,' and finds that training with 'in-conflict' terms reduces CoT monitorability.