Video-Mirai

Autoregressive Video Diffusion Models Need Foresight

Video-Mirai closes the representation-level planning gap of causal video generators by letting future segments supervise the current causal state, only at training time.

Yonghao Yu1 Lang Huang2 Runyi Li3 Zerun Wang1 Toshihiko Yamasaki1
1The University of Tokyo 2National Institute of Informatics 3Peking University
📄 arXiv ⌨︎ Code ⊕ BibTeX ▶ Try the Demos
Baseline
Video-Mirai
5 s VBench Total
83.8284.62
vs Causal-Forcing
30 s Subject Consistency
84.9388.47
+3.54 beyond training horizon
30 s Background Consistency
90.2291.94
+1.72 beyond training horizon
Inference cost
+ 0
FLOPs · params · KV-cache
§ 1 The Planning Gap

A segment may look correct in isolation while failing to specify what must remain true later.

Present-segment supervision is under-constrained: many hidden states can generate the same plausible current segment, but only some retain identity, layout, and motion cues that future segments will need. We call this the representation-level planning gap.

Prompt: Baseline Video-Mirai
Baseline
Video-Mirai
Pick a different prompt to swap clips
§ 2 Method

Use foresight as supervision, not as input.

The causal generator rolls out causally under the same mask used at inference. A frozen foresight encoder then reads the completed rollout, including future segments, and produces future-aware feature targets. A lightweight predictor maps each causal hidden state to its fused target via a cosine loss. After training, the foresight encoder and predictor are discarded.

Overview of Video-Mirai (Figure 2 of the paper)
Overview of Video-Mirai. We take a three-segment video as an example. The causal DiT denoises X₂ from its noisy version conditioned on X₁ via KV-cache. A frozen foresight encoder processes the causal DiT’s clean rollout {X₁, X₂, X₃}, including the future segment X₃. A predictor maps the causal DiT’s hidden state h₂ into the encoder’s space, where the foresight loss aligns it with the encoder’s fused hidden state H̄₂, which contains foresight information.
Foresight Loss
$$\ell^{F}_i \;=\; 1 - \cos\!\big(\phi_\omega(\mathbf{h}_i^{L}),\; \mathrm{sg}[\bar{\mathbf{H}}_i]\big)$$

Stopped-gradient targets fused across a small look-ahead window $\Delta = \{0,1\}$; a 3-block DiT predictor at mid-depth ($\alpha\!=\!0.5$) projects the causal hidden state.

Key properties
  • The future is never fed into the generator at inference.

  • The encoder is the same Wan-14B already used by the DMD score teacher, so there are no extra parameters at deployment.

  • Drop-in on top of Self-Forcing and Causal-Forcing; frame-wise and chunk-wise.

§ 3 Interactive Playground

See what the model has internalized about the future.

Two interactive probes. The first looks inside the causal generator and asks: how much of the future can be decoded from its current hidden state? The second varies the look-ahead window the model was trained with and shows how visual quality and consistency move.

Demo · Foresight Readout Probe

Decoding the future from a frozen hidden state

An MLP readout reconstructs future RGB from layer-15 features, frozen at probe time.
Current frame
Current frame
Baseline readout
Baseline readout
Video-Mirai readout
Video-Mirai readout
Future frame
Future frame
Prompt
Look-ahead Δ 1
Higher Δ asks the predictor to recover frames further in the future. Video-Mirai degrades gracefully; the baseline collapses to the current frame.
Demo · Readout vs. Rollout Average

Video-Mirai internalizes the future distribution

The one-shot readout from the frozen Video-Mirai hidden state closely matches the empirical average of 4 stochastic causal rollouts from the same starting frame, suggesting that the hidden state encodes the marginal future distribution, not just one trajectory.
Current frame
Readout
Rollouts average
Current frame
Readout
Rollouts average
Stochastic rollout 1
Stochastic rollout 2
Stochastic rollout 3
Stochastic rollout 4
Readout
Sample 4×
Stochastic rollouts
Average
Hover any Readout or Stochastic rollouts to highlight its relation to Rollouts average.
Demo · Foresight Window Ablation

Train with different look-ahead windows

Each button corresponds to a separately trained model (Table 1 in paper).
$\{0,1\}$, meaning current plus one-segment lookahead, gives the best Quality and Total.
§ 4 Quantitative Results

Drop-in gains across settings, amplified beyond the training horizon.

Method Quality 5s Semantic 5s Total 5s Subj. Cons. 30s Bg. Cons. 30s Overall 30s

Best within each baseline pair in bold. All variants use the same training prompts and budget.

§ 6 Cite

BibTeX

If you find this work useful, please cite:

@article{yu2026videomirai,
  title={Video-Mirai: Autoregressive Video Diffusion Models Need Foresight},
  author={Yu, Yonghao and Huang, Lang and Li, Runyi and Wang, Zerun and Yamasaki, Toshihiko},
  journal={arXiv preprint arXiv:2606.03971},
  year={2026}
}