Olaf-World: Orienting Latent Actions for Video World Modeling

Scaling action-controllable world models is hindered by the scarcity of action labels. While latent action learning aims to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts, entangling scene-specific cues. Olaf-World introduces a framework for pretraining action-conditioned video world models from large-scale passive video. It achieves context-invariant and transferable latent actions by aligning them with observable semantic changes using a sequence-level control-effect alignment objective (SeqΔ-REPA), leading to robust zero-shot action transfer and data-efficient adaptation to new control interfaces.

Olaf-World: Orienting Latent Actions for Video World Modeling

Abstract

Projects