ViT-5: Vision Transformers for The Mid-2020s

This work systematically investigates modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens forms a new generation of Vision Transformers, called ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base achieves 84.2% top-1 accuracy, exceeding DeiT-III-Base. ViT-5 also serves as a stronger backbone for generative modeling, achieving 1.84 FID in an SiT diffusion framework versus 2.06 with a vanilla ViT backbone.

ViT-5: Vision Transformers for The Mid-2020s

Abstract

Projects