SCENE 01 · TAKE 1
DIRECTING THE WORLD
CAMERA ROLLING · ✦ SOUND ✦
▶ TAP TO START THE SHOW
IEEE Transactions on Multimedia · 2026

Directing the World

Fast Autoregressive Video Generation with Compositional Human-Camera Control

Institute of Artificial Intelligence, China Telecom (TeleAI)
* Equal contribution  ·  Corresponding author: Xuelong Li
REC
00:00:14:12
0:00 / 0:00

Given an input image, an SMPL human-motion sequence, and a prescribed camera trajectory, Directing the World generates stable 15–20s long-horizon world-model videos — consistent human motion and coherent camera-based world exploration.

AUTOREGRESSIVE WORLD MODEL SMPL HUMAN-MOTION CONTROL Plücker-ray CAMERA TRAJECTORIES t-guided DYNAMIC PROJECTION Fast–Slow MEMORY TRAINING 15–20s LONG-HORIZON ROLLOUTS 50K CURATED CLIPS SCENE 01 · TAKE 1 · LIGHTS CAMERA ACTION AUTOREGRESSIVE WORLD MODEL SMPL HUMAN-MOTION CONTROL Plücker-ray CAMERA TRAJECTORIES t-guided DYNAMIC PROJECTION Fast–Slow MEMORY TRAINING 15–20s LONG-HORIZON ROLLOUTS 50K CURATED CLIPS SCENE 01 · TAKE 1 · LIGHTS CAMERA ACTION
01 / Overview

What if you could direct the world?

Simulate controllable real-world dynamics over long horizons — with precise human motion and coherent camera exploration in a single autoregressive world model.

Building interactive world models requires not only visually realistic videos, but also simulating controllable real-world dynamics over long temporal horizons. Autoregressive video generation provides a scalable foundation for such simulation, yet maintaining temporal stability over extended rollouts is fundamentally hard — small prediction errors accumulate and quickly degrade long-horizon generation.

This challenge intensifies under heterogeneous controls such as human motion and camera trajectories, where signals interfere and destabilize the video prior. We present Directing the World, a fast autoregressive framework that decouples the learning of dynamic control factors while preserving a unified autoregressive prior — enabling stable long-horizon generation, precise human-motion control, and coherent camera-controlled world dynamics.

02 / Framework

Decouple then integrate

A two-stage autoregressive ControlNet-based framework — learn human-motion control under static cameras, then progressively introduce camera-trajectory control on top of the acquired motion prior.

🎬 Method Pipeline
🗂️ Data Pipeline
Directing the World method framework pipeline

Figure 1. Controllable autoregressive framework: MMPL backbone · SMPL human-motion control with t-guided projection · causal-aligned camera control · Fast–Slow Memory training.

Dataset construction pipeline

Figure 2. Dataset construction: 20M iStock videos → aesthetic & motion filtering → distributed condition extraction (normal/depth/SMPL) → SfM camera assessment → unified-latent-space alignment → decoupled motion-centric & camera-centric subsets.

🧠

01 MMPL Backbone

A Plan-then-Populate autoregressive backbone that predicts sparse planning frames (including the terminal anchor) to constrain each block, connecting consecutive segments into temporally consistent long videos with stable world memory.

CORE / MEMORYONLINE
🤸

02 Human Motion Control

SMPL sequences injected as structured 3D guidance via a t-guided Dynamic Projection that adapts motion conditions to timestep-dependent denoising latents — coarse-to-fine control that also supports simultaneous multi-person motion.

MOTION / SMPLONLINE
🎥

03 Causal-Aligned Camera Control

A camera pathway that decouples global Plücker-ray trajectory encoding from block-local feature injection — preserving long-range trajectory understanding while staying temporally aligned with block-wise autoregressive generation.

CAMERA / PLÜCKERONLINE

04 Fast–Slow Memory Training

A differential learning-rate strategy: self/cross-attention layers stay slow to preserve the long-video prior, while new control modules adapt fast — stabilizing controllable post-training and reducing signal interference.

TRAIN / SCHEDONLINE
03 / Multi-Agent World

Cue the cast: multi-agent world control

You don't just generate a clip — you direct a world. Directing the World lets you choreograph multiple human agents simultaneously, each driven by its own SMPL motion script, all moving through one persistent, camera-explored scene.

Each clip below shows the rendered SMPL motion script (the actor's blocking) feeding the generated world. Hover to focus your lens — the t-guided projection keeps every actor on mark while the world stays coherent.

▲ SMPL script▼ Generated world
▲ SMPL script▼ Generated world
▲ SMPL script▼ Generated world
▲ SMPL script▼ Generated world
04 / Evaluation

State-of-the-art under joint control

Evaluated on APRIL-AIGC/UltraVideo-Long under motion-only, camera-only, and joint motion-camera settings. Directing the World achieves the best overall score with strong temporal stability and camera fidelity.

Method Control Overall Score Temporal Motion Camera
RefinedMotionCamera ConsistQuality Motion Err. ATERPERRE
FunCamera3.40186.5959.770.2130.6010.1010.970
FunMotion3.44591.7662.520.184
WanMove3.40189.6764.230.1940.6690.0770.685
Uni3C3.44588.6556.070.1510.4820.0630.722
★ Ours3.86890.2562.980.1610.5320.0700.349

★ Best 2nd Best  ·  "Refined" = fine-grained controllability beyond coarse trajectory-level guidance. Metrics omitted where a method doesn't support the control.

MON 01 · LIVE
Joint Motion + Camera
Long-horizon world exploration
MON 02 · LIVE
Consistent Subject
Stable identity over 20s rollout
MON 03 · LIVE
Dynamic Scene
Large perspective change + motion
MON 04 · LIVE
World Exploration
Coherent camera-controlled dynamics

↳ MOVE YOUR MOUSE OVER THE MONITOR WALL TO STEER THE VIEW

05 / Data

A controllable world-model dataset

Built from 20M public iStock videos with synchronized video, text, human-motion (SMPL) and camera-trajectory annotations — filtered and aligned in a unified latent space.

Stage I

Motion-Centric Subset

~20K

High human motion · low camera motion · static/stable backgrounds — ideal for learning SMPL-to-video control.

Stage II

Camera-Centric Subset

~30K

High human motion · high camera motion · vivid viewpoints and large perspective change — for world-exploration camera control.

06 / Citation

Cite this work

directing_the_world.bib
@article{wang2026directing,
  title   = {Directing the World: Fast Autoregressive Video Generation with Compositional Human-Camera Control},
  author  = {Wang, Haoyuan and Chen, Yabo and Huang, Haibin and Zhang, Chi and Li, Xuelong},
  journal = {IEEE Transactions on Multimedia (TMM)},
  volume  = {XX},
  number  = {XX},
  year    = {2026}
}
▶ 00:00:00