The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

This paper presents a groundbreaking discovery regarding the redundancy of the KV (Key-Value) cache in Transformer inference, proposing that the residual stream alone may be sufficient for maintaining performance. This finding has profound implications for optimizing the efficiency, memory footprint, and computational cost of large language models, potentially enabling the deployment of larger and more complex models on resource-constrained devices and reducing operational expenses for AI services.

The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

Abstract

Projects