Olaf-World: Orienting Latent Actions for Video World Modeling
By: Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou
Published: 2026-02-10
View on arXiv →Abstract
Scaling action-controllable world models is hindered by the scarcity of action labels. While latent action learning aims to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts, entangling scene-specific cues. Olaf-World introduces a framework for pretraining action-conditioned video world models from large-scale passive video. It achieves context-invariant and transferable latent actions by aligning them with observable semantic changes using a sequence-level control-effect alignment objective (SeqΔ-REPA), leading to robust zero-shot action transfer and data-efficient adaptation to new control interfaces.