Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Chain-of-Thought (CoT) monitoring is a promising approach for overseeing AI systems, but training can affect its 'monitorability' by causing models to hide reasoning. This paper proposes and empirically validates a conceptual framework to predict when and why this occurs. It models LLM post-training as an RL environment, classifying reward terms for final outputs and CoT as 'aligned,' 'orthogonal,' or 'in-conflict,' and finds that training with 'in-conflict' terms reduces CoT monitorability.

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Abstract

Projects