ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

Xiong, Zhexiao; Song, Yizhi; Kang, Hao; Yan, Qing; Jiang, Liming; Yang, Jenson; Fu, Zhoujie; Fotiadis, Stathi; Wang, Angtian; Liu, Zichuan; Liu, Bo; Yang, Yiding; Lu, Xin; Jacobs, Nathan

Abstract

From visually explorable worlds to genuinely actionable ones.

Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (walk, turn, look around), while interaction with objects in the scene (pick up plates, open doors, trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. We present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue the navigation–interaction gap stems from two bottlenecks: a data bottleneck — the lack of human–object interaction data with accurate, dense labels; and a memory bottleneck — recency-biased history compression discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control.

Contributions

What makes ActWorld interactive

Closing the navigation–interaction gap from both the data side and the model side.

Hierarchical Action-Aware Memory

A local memory bank routes and amplifies interaction-critical frames inside the sliding window; a persistent bank keeps compact event-update and object-identity tokens that survive beyond the window's eviction horizon — curing the action-forgetting pathology of recency-based memory.

Interaction-Dense Dataset & Annotation

A high-quality 100K-video corpus spanning 40 action categories, with every chunk annotated by a chain-of-thought VLM with a dense caption and an interaction-phase label — supervision that existing navigation-centric datasets lack.

Unified Real-Time World Model

A single model that jointly supports flexible navigation and rich object interaction, combining action-aware memory with a dual-branch camera-conditioning module — validated on I-Bench, a new long-horizon benchmark that interleaves navigation and interaction.

Method

The ActWorld pipeline

An autoregressive DiT generates each chunk under high-level object-interaction commands and low-level keyboard/mouse controls.

ActWorld pipeline: hierarchical action-aware memory and dual-branch camera conditioning — **Left:** past observations flow through a hierarchical action-aware memory with two channels — a persistent action-aware bank of event/object tokens that survives long navigation gaps, and a local bank of EAFR-routed coarse/mid/fine history tokens. **Right:** inside each DiT block, self-attention amplifies history keys via ACHA, Plücker rays drive a per-token scale-and-shift, and cross-attention ingests the per-chunk caption combined with a symbolic text-camera embedding.

Chain-of-Thought per-chunk annotation

Each 33-frame chunk gets a dedicated description and structured interaction/phase labels; CoT prompting forces the VLM to reason over explicit visual evidence before committing, removing hallucinated interactions.

Dual-branch keyboard/mouse conditioning

A geometric branch routes per-pixel Plücker rays through a shared FiLM module; a symbolic branch maps the 81-entry (keyboard, mouse) vocabulary to a text embedding — keeping fine viewpoint control.

EAFR + ACHA memory routing

Event-Aware Frame Re-assignment replaces recency bucketing with importance-ranked routing, so causally-critical contact / manipulating frames survive compression; Action-Conditioned History Amplification sharpens attention onto the keys that matter for the current action.

Persistent memory bank + few-step distillation

A FIFO bank of event & object-identity tokens (pinned at interaction phases) carries object identity across long navigation gaps; a Helios-style distillation reduces the 50-step teacher to a 3-step generator for real-time use.

Quantitative Results

Best or near-best on every axis

Evaluated on the I-Bench testset along three complementary axes — perceptual/consistency quality (VBench), semantic instruction following (VLM-Action-Judge), and geometric controllability (Key-Mouse-Following). Higher is better; bold marks the column best.

VLM-Action-Judge — semantic instruction following

Does the commanded action actually happen? A judge VLM rates each chunk on a 0–3 scale.

Method	IF ↑	Succ. ↑	≥2 ↑
Yume 1.5	1.638	20.12	46.75
HY-World 1.5	0.709	5.19	18.70
Lingbot-World	1.635	19.89	49.91
Matrix-Game 3	0.295	1.88	8.27
Astra	0.949	5.83	19.36
Infinite-World	0.237	1.51	3.95
ActWorld (Ours)	2.557	57.8	84.5

IF: mean instruction-following score (0–3). Succ.: success rate (Level 3). ≥2: partial-or-better rate (Level ≥ 2). ActWorld's Level-3 success rate more than doubles every baseline.

Key-Mouse-Following — geometric controllability

Generate a clip from a GT (keys, mouse) sequence, recover its trajectory, and check each chunk moves as commanded.

Method	Acc_full ↑	Acc_keys ↑	Acc_mouse ↑
Yume 1.5	4.82	31.79	15.71
HY-World 1.5	9.17	42.78	25.56
Lingbot-World	2.67	28.00	13.33
Matrix-Game 3	20.00	45.01	40.83
Astra	11.43	28.57	28.57
Infinite-World	3.00	24.50	15.00
ActWorld (Ours)	20.62	41.02	43.67

Acc_full: joint (keys, mouse) match. Acc_keys / Acc_mouse: per-axis accuracy, from VIPE-recovered SE(3) trajectories.

VBench — perceptual & consistency metrics

Standard VBench-1.0 / VBench-i2v dimensions used by recent video world-model papers.

Method	SC	BC	MS	AQ	IQ	DD	TF	OC	i2v-S	i2v-B
Yume 1.5	0.743	0.872	0.985	0.432	0.645	1.000	0.962	0.180	0.930	0.943
HY-World 1.5	0.856	0.873	0.993	0.449	0.734	0.933	0.977	0.199	0.952	0.953
Lingbot-World	0.724	0.866	0.979	0.436	0.693	0.983	0.966	0.182	0.917	0.935
Matrix-Game 3	0.662	0.854	0.981	0.377	0.681	1.000	0.954	0.169	0.862	0.889
Astra	0.603	0.841	0.946	0.290	0.465	0.817	0.946	0.152	0.759	0.827
Infinite-World	0.801	0.863	0.987	0.442	0.748	0.967	0.959	0.087	0.716	0.743
ActWorld (Ours)	0.871	0.896	0.991	0.485	0.731	1.000	0.973	0.201	0.954	0.957

SC: subject-consistency · BC: background-consistency · MS: motion-smoothness · AQ: aesthetic-quality · IQ: imaging-quality · DD: dynamic-degree · TF: temporal-flickering · OC: overall-consistency · i2v-S / i2v-B: VBench-i2v subject / background.

BibTeX

If you find ActWorld useful, please cite our work.

@article{xiong2026actworld,
  title   = {ActWorld: From Explorable to Interactive World Model via Action-Aware Memory},
  author  = {Xiong, Zhexiao and Song, Yizhi and Kang, Hao and Yan, Qing and Jiang, Liming
             and Yang, Jenson and Fu, Zhoujie and Fotiadis, Stathi and Wang, Angtian
             and Liu, Zichuan and Liu, Bo and Yang, Yiding and Lu, Xin and Jacobs, Nathan},
  journal = {arXiv preprint arXiv:2606.17730},
  eprint  = {2606.17730},
  archivePrefix = {arXiv},
  year    = {2026}
}