Video-Mirai · Autoregressive Video Diffusion Models Need Foresight

Baseline

Video-Mirai

5 s VBench Total

83.82 → 84.62

vs Causal-Forcing

30 s Subject Consistency

84.93 → 88.47

+3.54 beyond training horizon

30 s Background Consistency

90.22 → 91.94

+1.72 beyond training horizon

Inference cost

+ 0

FLOPs · params · KV-cache

§ 1 The Planning Gap

A segment may look correct in isolation while failing to specify what must remain true later.

Present-segment supervision is under-constrained: many hidden states can generate the same plausible current segment, but only some retain identity, layout, and motion cues that future segments will need. We call this the representation-level planning gap.

Prompt: Baseline Video-Mirai

Baseline

Video-Mirai

Pick a different prompt to swap clips

§ 2 Method

Use foresight as supervision, not as input.

The causal generator rolls out causally under the same mask used at inference. A frozen foresight encoder then reads the completed rollout, including future segments, and produces future-aware feature targets. A lightweight predictor maps each causal hidden state to its fused target via a cosine loss. After training, the foresight encoder and predictor are discarded.

Overview of Video-Mirai (Figure 2 of the paper) — Overview of Video-Mirai. We take a three-segment video as an example. The causal DiT denoises X₂ from its noisy version conditioned on X₁ via KV-cache. A frozen foresight encoder processes the causal DiT’s clean rollout {X₁, X₂, X₃}, including the future segment X₃. A predictor maps the causal DiT’s hidden state h₂ into the encoder’s space, where the foresight loss aligns it with the encoder’s fused hidden state H̄₂, which contains foresight information.

Foresight Loss

$$\ell^{F}_i \;=\; 1 - \cos\!\big(\phi_\omega(\mathbf{h}_i^{L}),\; \mathrm{sg}[\bar{\mathbf{H}}_i]\big)$$

Stopped-gradient targets fused across a small look-ahead window $\Delta = \{0,1\}$; a 3-block DiT predictor at mid-depth ($\alpha\!=\!0.5$) projects the causal hidden state.

Key properties

The future is never fed into the generator at inference.
The encoder is the same Wan-14B already used by the DMD score teacher, so there are no extra parameters at deployment.
Drop-in on top of Self-Forcing and Causal-Forcing; frame-wise and chunk-wise.

§ 3 Interactive Playground

See what the model has internalized about the future.

Two interactive probes. The first looks inside the causal generator and asks: how much of the future can be decoded from its current hidden state? The second varies the look-ahead window the model was trained with and shows how visual quality and consistency move.

Demo · Foresight Readout Probe

Decoding the future from a frozen hidden state

An MLP readout reconstructs future RGB from layer-15 features, frozen at probe time.

Prompt

Look-ahead Δ 1

Higher Δ asks the predictor to recover frames further in the future. Video-Mirai degrades gracefully; the baseline collapses to the current frame.

Demo · Readout vs. Rollout Average

Video-Mirai internalizes the future distribution

The one-shot readout from the frozen Video-Mirai hidden state closely matches the empirical average of 4 stochastic causal rollouts from the same starting frame, suggesting that the hidden state encodes the marginal future distribution, not just one trajectory.

Demo · Foresight Window Ablation

Train with different look-ahead windows

Each button corresponds to a separately trained model (Table 1 in paper).

$\{0,1\}$, meaning current plus one-segment lookahead, gives the best Quality and Total.

§ 4 Quantitative Results

Drop-in gains across settings, amplified beyond the training horizon.

Method	Quality 5s	Semantic 5s	Total 5s	Subj. Cons. 30s	Bg. Cons. 30s	Overall 30s

Best within each baseline pair in bold. All variants use the same training prompts and budget.

§ 5 Gallery

Generation quality, within and beyond the training horizon.

Each card shows a baseline / Video-Mirai pair on the same prompt.

5-second generations

Within the training horizon. Click to toggle view

30-second rollouts

Beyond the training horizon.

Video-Mirai composes with Rolling-Sink for minute-level long-video generation. The 30-second clips above are produced by stacking Rolling-Sink on top of the Video-Mirai foresight checkpoint, beyond the 5-second training horizon.

§ 6 Cite

BibTeX

If you find this work useful, please cite:

@article{yu2026videomirai,
  title={Video-Mirai: Autoregressive Video Diffusion Models Need Foresight},
  author={Yu, Yonghao and Huang, Lang and Li, Runyi and Wang, Zerun and Yamasaki, Toshihiko},
  journal={arXiv preprint arXiv:2606.03971},
  year={2026}
}