Interactive World Model  ·  2026

ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

1Washington University in St. Louis    2Intelligent Creation, ByteDance
*Work done during internship at ByteDance    Project Lead
ActWorld teaser: multi-action rollouts with per-frame keyboard/mouse control
ActWorld handles both long-horizon navigation and mid-rollout object interaction within a single rollout, under per-frame keyboard & mouse control (WASD + arrow keys overlaid on each frame; yellow marks the active key). Each row is one continuous trajectory — time flows left→right and wraps to the next row; colored bands separate navigation segments from object-interaction segments (Insert → Pickup, Carry → Place → Wipe), with the action label shown above.

Abstract

From visually explorable worlds to genuinely actionable ones.

Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (walk, turn, look around), while interaction with objects in the scene (pick up plates, open doors, trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. We present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue the navigation–interaction gap stems from two bottlenecks: a data bottleneck — the lack of human–object interaction data with accurate, dense labels; and a memory bottleneck — recency-biased history compression discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control.

Contributions

What makes ActWorld interactive

Closing the navigation–interaction gap from both the data side and the model side.

Hierarchical Action-Aware Memory

A local memory bank routes and amplifies interaction-critical frames inside the sliding window; a persistent bank keeps compact event-update and object-identity tokens that survive beyond the window's eviction horizon — curing the action-forgetting pathology of recency-based memory.

Interaction-Dense Dataset & Annotation

A high-quality 100K-video corpus spanning 40 action categories, with every chunk annotated by a chain-of-thought VLM with a dense caption and an interaction-phase label — supervision that existing navigation-centric datasets lack.

Unified Real-Time World Model

A single model that jointly supports flexible navigation and rich object interaction, combining action-aware memory with a dual-branch camera-conditioning module — validated on I-Bench, a new long-horizon benchmark that interleaves navigation and interaction.

Method

The ActWorld pipeline

An autoregressive DiT generates each chunk under high-level object-interaction commands and low-level keyboard/mouse controls.
ActWorld pipeline: hierarchical action-aware memory and dual-branch camera conditioning
Left: past observations flow through a hierarchical action-aware memory with two channels — a persistent action-aware bank of event/object tokens that survives long navigation gaps, and a local bank of EAFR-routed coarse/mid/fine history tokens. Right: inside each DiT block, self-attention amplifies history keys via ACHA, Plücker rays drive a per-token scale-and-shift, and cross-attention ingests the per-chunk caption combined with a symbolic text-camera embedding.
1

Chain-of-Thought per-chunk annotation

Each 33-frame chunk gets a dedicated description and structured interaction/phase labels; CoT prompting forces the VLM to reason over explicit visual evidence before committing, removing hallucinated interactions.

2

Dual-branch keyboard/mouse conditioning

A geometric branch routes per-pixel Plücker rays through a shared FiLM module; a symbolic branch maps the 81-entry (keyboard, mouse) vocabulary to a text embedding — keeping fine viewpoint control.

3

EAFR + ACHA memory routing

Event-Aware Frame Re-assignment replaces recency bucketing with importance-ranked routing, so causally-critical contact / manipulating frames survive compression; Action-Conditioned History Amplification sharpens attention onto the keys that matter for the current action.

4

Persistent memory bank + few-step distillation

A FIFO bank of event & object-identity tokens (pinned at interaction phases) carries object identity across long navigation gaps; a Helios-style distillation reduces the 50-step teacher to a 3-step generator for real-time use.

Qualitative Comparison

Interactive side-by-side comparison

Each row is a benchmark sample from I-Bench. Six methods are rendered from the same first frame and the same ground-truth keyboard/mouse trajectory; overlays at the bottom-left of every clip show the GT keys (WASD), mouse (arrows), and active verb buttons. ActWorld (Ours) is highlighted.
Each row has its own ▶ Play row to compare locally.
Quantitative Results

Best or near-best on every axis

Evaluated on the I-Bench testset along three complementary axes — perceptual/consistency quality (VBench), semantic instruction following (VLM-Action-Judge), and geometric controllability (Key-Mouse-Following). Higher is better; bold marks the column best.

VLM-Action-Judge — semantic instruction following

Does the commanded action actually happen? A judge VLM rates each chunk on a 0–3 scale.

MethodIF ↑Succ. ↑≥2 ↑
Yume 1.51.63820.1246.75
HY-World 1.50.7095.1918.70
Lingbot-World1.63519.8949.91
Matrix-Game 30.2951.888.27
Astra0.9495.8319.36
Infinite-World0.2371.513.95
ActWorld (Ours)2.55757.884.5
IF: mean instruction-following score (0–3). Succ.: success rate (Level 3). ≥2: partial-or-better rate (Level ≥ 2). ActWorld's Level-3 success rate more than doubles every baseline.

Key-Mouse-Following — geometric controllability

Generate a clip from a GT (keys, mouse) sequence, recover its trajectory, and check each chunk moves as commanded.

MethodAccfullAcckeysAccmouse
Yume 1.54.8231.7915.71
HY-World 1.59.1742.7825.56
Lingbot-World2.6728.0013.33
Matrix-Game 320.0045.0140.83
Astra11.4328.5728.57
Infinite-World3.0024.5015.00
ActWorld (Ours)20.6241.0243.67
Accfull: joint (keys, mouse) match. Acckeys / Accmouse: per-axis accuracy, from VIPE-recovered SE(3) trajectories.

VBench — perceptual & consistency metrics

Standard VBench-1.0 / VBench-i2v dimensions used by recent video world-model papers.

MethodSCBCMSAQIQDDTFOCi2v-Si2v-B
Yume 1.50.7430.8720.9850.4320.6451.0000.9620.1800.9300.943
HY-World 1.50.8560.8730.9930.4490.7340.9330.9770.1990.9520.953
Lingbot-World0.7240.8660.9790.4360.6930.9830.9660.1820.9170.935
Matrix-Game 30.6620.8540.9810.3770.6811.0000.9540.1690.8620.889
Astra0.6030.8410.9460.2900.4650.8170.9460.1520.7590.827
Infinite-World0.8010.8630.9870.4420.7480.9670.9590.0870.7160.743
ActWorld (Ours)0.8710.8960.9910.4850.7311.0000.9730.2010.9540.957
SC: subject-consistency · BC: background-consistency · MS: motion-smoothness · AQ: aesthetic-quality · IQ: imaging-quality · DD: dynamic-degree · TF: temporal-flickering · OC: overall-consistency · i2v-S / i2v-B: VBench-i2v subject / background.

Rollouts across diverse scenes

Continuous trajectories that require both free-form navigation and object-centric interaction within a single rollout — preserving viewpoint controllability while producing coherent interaction outcomes.
ActWorld qualitative rollouts: Open/Close and Open/Pickup/Place sequences
Each row shows temporally ordered frames with per-frame keyboard/mouse controls overlaid (active inputs highlighted in yellow). ActWorld preserves the manipulated objects and the commanded action ordering across the full clip.

BibTeX

If you find ActWorld useful, please cite our work.
@article{xiong2026actworld,
  title   = {ActWorld: From Explorable to Interactive World Model via Action-Aware Memory},
  author  = {Xiong, Zhexiao and Song, Yizhi and Kang, Hao and Yan, Qing and Jiang, Liming
             and Yang, Jenson and Fu, Zhoujie and Fotiadis, Stathi and Wang, Angtian
             and Liu, Zichuan and Liu, Bo and Yang, Yiding and Lu, Xin and Jacobs, Nathan},
  journal = {arXiv preprint},
  year    = {2026}
}