Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants
By: Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt
Published: 2025-12-18
View on arXiv →#cs.AI
Abstract
This paper introduces Predictive Concept Decoders (PCDs), a novel framework for training scalable end-to-end interpretability assistants. PCDs aim to provide human-understandable explanations for AI model predictions by directly mapping internal activations to meaningful concepts, enabling a deeper understanding of complex models. This approach could significantly enhance trust and transparency in real-world AI applications.