Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

By: Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt

Published: 2025-12-18

View on arXiv →
#cs.AI

Abstract

This paper introduces Predictive Concept Decoders (PCDs), a novel framework for training scalable end-to-end interpretability assistants. PCDs aim to provide human-understandable explanations for AI model predictions by directly mapping internal activations to meaningful concepts, enabling a deeper understanding of complex models. This approach could significantly enhance trust and transparency in real-world AI applications.

FEEDBACK

Projects

No projects yet