Directing the World · Compositional Human-Camera Video Generation

AUTOREGRESSIVE WORLD MODEL✦ SMPL HUMAN-MOTION CONTROL✦ Plücker-ray CAMERA TRAJECTORIES✦ t-guided DYNAMIC PROJECTION✦ Fast–Slow MEMORY TRAINING✦ 15–20s LONG-HORIZON ROLLOUTS✦ 50K CURATED CLIPS✦ SCENE 01 · TAKE 1 · LIGHTS CAMERA ACTION✦ AUTOREGRESSIVE WORLD MODEL✦ SMPL HUMAN-MOTION CONTROL✦ Plücker-ray CAMERA TRAJECTORIES✦ t-guided DYNAMIC PROJECTION✦ Fast–Slow MEMORY TRAINING✦ 15–20s LONG-HORIZON ROLLOUTS✦ 50K CURATED CLIPS✦ SCENE 01 · TAKE 1 · LIGHTS CAMERA ACTION✦

01 / Overview

What if you could direct the world?

Simulate controllable real-world dynamics over long horizons — with precise human motion and coherent camera exploration in a single autoregressive world model.

Building interactive world models requires not only visually realistic videos, but also simulating controllable real-world dynamics over long temporal horizons. Autoregressive video generation provides a scalable foundation for such simulation, yet maintaining temporal stability over extended rollouts is fundamentally hard — small prediction errors accumulate and quickly degrade long-horizon generation.

This challenge intensifies under heterogeneous controls such as human motion and camera trajectories, where signals interfere and destabilize the video prior. We present Directing the World, a fast autoregressive framework that decouples the learning of dynamic control factors while preserving a unified autoregressive prior — enabling stable long-horizon generation, precise human-motion control, and coherent camera-controlled world dynamics.

02 / Framework

Decouple then integrate

A two-stage autoregressive ControlNet-based framework — learn human-motion control under static cameras, then progressively introduce camera-trajectory control on top of the acquired motion prior.

🎬 Method Pipeline

🗂️ Data Pipeline

Directing the World method framework pipeline

Figure 1. Controllable autoregressive framework: MMPL backbone · SMPL human-motion control with t-guided projection · causal-aligned camera control · Fast–Slow Memory training.

Figure 2. Dataset construction: 20M iStock videos → aesthetic & motion filtering → distributed condition extraction (normal/depth/SMPL) → SfM camera assessment → unified-latent-space alignment → decoupled motion-centric & camera-centric subsets.

🧠

01 MMPL Backbone

A Plan-then-Populate autoregressive backbone that predicts sparse planning frames (including the terminal anchor) to constrain each block, connecting consecutive segments into temporally consistent long videos with stable world memory.

CORE / MEMORYONLINE

🤸

02 Human Motion Control

SMPL sequences injected as structured 3D guidance via a t-guided Dynamic Projection that adapts motion conditions to timestep-dependent denoising latents — coarse-to-fine control that also supports simultaneous multi-person motion.

MOTION / SMPLONLINE

🎥

03 Causal-Aligned Camera Control

A camera pathway that decouples global Plücker-ray trajectory encoding from block-local feature injection — preserving long-range trajectory understanding while staying temporally aligned with block-wise autoregressive generation.

CAMERA / PLÜCKERONLINE

⚡

04 Fast–Slow Memory Training

A differential learning-rate strategy: self/cross-attention layers stay slow to preserve the long-video prior, while new control modules adapt fast — stabilizing controllable post-training and reducing signal interference.

TRAIN / SCHEDONLINE

03 / Multi-Agent World

Cue the cast: multi-agent world control

You don't just generate a clip — you direct a world. Directing the World lets you choreograph multiple human agents simultaneously, each driven by its own SMPL motion script, all moving through one persistent, camera-explored scene.

Each clip below shows the rendered SMPL motion script (the actor's blocking) feeding the generated world. Hover to focus your lens — the t-guided projection keeps every actor on mark while the world stays coherent.

▲ SMPL script▼ Generated world

04 / Evaluation

State-of-the-art under joint control

Evaluated on APRIL-AIGC/UltraVideo-Long under motion-only, camera-only, and joint motion-camera settings. Directing the World achieves the best overall score with strong temporal stability and camera fidelity.

Method	Control			Overall Score	Temporal		Motion	Camera
Method	Refined	Motion	Camera	Overall Score	Consist	Quality	Motion Err.	ATE	RPE	RRE
FunCamera	—	—	✓	3.401	86.59	59.77	0.213	0.601	0.101	0.970
FunMotion	—	✓	—	3.445	91.76	62.52	0.184	—	—	—
WanMove	—	✓	✓	3.401	89.67	64.23	0.194	0.669	0.077	0.685
Uni3C	✓	✓	✓	3.445	88.65	56.07	0.151	0.482	0.063	0.722
★ Ours	✓	✓	✓	3.868	90.25	62.98	0.161	0.532	0.070	0.349

★ Best 2nd Best · "Refined" = fine-grained controllability beyond coarse trajectory-level guidance. Metrics omitted where a method doesn't support the control.

MON 01 · LIVE

Joint Motion + Camera
Long-horizon world exploration

MON 02 · LIVE

Consistent Subject
Stable identity over 20s rollout

MON 03 · LIVE

Dynamic Scene
Large perspective change + motion

MON 04 · LIVE

World Exploration
Coherent camera-controlled dynamics

↳ MOVE YOUR MOUSE OVER THE MONITOR WALL TO STEER THE VIEW

05 / Data

A controllable world-model dataset

Built from 20M public iStock videos with synchronized video, text, human-motion (SMPL) and camera-trajectory annotations — filtered and aligned in a unified latent space.

Stage I

Motion-Centric Subset

~20K

High human motion · low camera motion · static/stable backgrounds — ideal for learning SMPL-to-video control.

Stage II

Camera-Centric Subset

~30K

High human motion · high camera motion · vivid viewpoints and large perspective change — for world-exploration camera control.

06 / Citation

Cite this work

directing_the_world.bib

@article{wang2026directing,
  title   = {Directing the World: Fast Autoregressive Video Generation with Compositional Human-Camera Control},
  author  = {Wang, Haoyuan and Chen, Yabo and Huang, Haibin and Zhang, Chi and Li, Xuelong},
  journal = {IEEE Transactions on Multimedia (TMM)},
  volume  = {XX},
  number  = {XX},
  year    = {2026}
}