The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference
By: Alex Chen, Benjamin Lee, Catherine Wang, David Kim, Emily Zhao
Published: 2026-03-20
View on arXiv →#cs.AI
Abstract
This paper presents a groundbreaking discovery regarding the redundancy of the KV (Key-Value) cache in Transformer inference, proposing that the residual stream alone may be sufficient for maintaining performance. This finding has profound implications for optimizing the efficiency, memory footprint, and computational cost of large language models, potentially enabling the deployment of larger and more complex models on resource-constrained devices and reducing operational expenses for AI services.