Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

By: Max Kaufmann, David Lindner, Roland S. Zimmermann, Rohin Shah

Published: 2026-03-31

View on arXiv →
#cs.AI

Abstract

Chain-of-Thought (CoT) monitoring is a promising approach for overseeing AI systems, but training can affect its 'monitorability' by causing models to hide reasoning. This paper proposes and empirically validates a conceptual framework to predict when and why this occurs. It models LLM post-training as an RL environment, classifying reward terms for final outputs and CoT as 'aligned,' 'orthogonal,' or 'in-conflict,' and finds that training with 'in-conflict' terms reduces CoT monitorability.

FEEDBACK

Projects

No projects yet

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought? | ArXiv Intelligence