Olaf-World: Orienting Latent Actions for Video World Modeling

By: Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

Published: 2026-02-10

View on arXiv →
#cs.AI

Abstract

Scaling action-controllable world models is hindered by the scarcity of action labels. While latent action learning aims to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts, entangling scene-specific cues. Olaf-World introduces a framework for pretraining action-conditioned video world models from large-scale passive video. It achieves context-invariant and transferable latent actions by aligning them with observable semantic changes using a sequence-level control-effect alignment objective (SeqΔ-REPA), leading to robust zero-shot action transfer and data-efficient adaptation to new control interfaces.

FEEDBACK

Projects

No projects yet

Olaf-World: Orienting Latent Actions for Video World Modeling | ArXiv Intelligence