Mirai: Autoregressive Visual Generation Needs Foresight

Yu, Yonghao; Huang, Lang; Wang, Zerun; Li, Runyi; Yamasaki, Toshihiko

Mirai: Autoregressive Visual Generation Needs Foresight

Yonghao Yu¹, Lang Huang^2*, Zerun Wang¹, Runyi Li³, Toshihiko Yamasaki¹

¹ The University of Tokyo · ² National Institute of Informatics · ³ Peking University ^* Corresponding author

Code arXiv

Left: Comparison between LlamaGen-B and Mirai after 300 epochs. Right: Mirai significantly accelerates training convergence on ImageNet.

Abstract

Autoregressive (AR) visual generators model images as sequences of discrete tokens and are trained with next-token likelihood. This strict causality supervision optimizes each step only by its immediate next token, which diminishes global coherence and slows convergence.

We ask whether foresight—training signals that originate from later tokens—can help AR visual generation. Through controlled diagnostics across injection level, spatial layout, and foresight source, we find that aligning foresight to AR models' internal representation on the 2D image grids is crucial for effective causality modeling.

We instantiate this insight with Mirai (meaning “future” in Japanese), a general framework that injects future information into AR training without architecture changes or inference-time overhead. Mirai-E uses explicit foresight from multiple future positions of unidirectional representations, whereas Mirai-I everages implicit foresight from matched bidirectional representations. Extensive experiments on ImageNet show that Mirai both accelerates convergence and improves generation quality.

Method Overview

Mirai aligns the AR model's internal representations with the foresight from either bidirectional or unidirectional foresight encoder in a 2D grid.

Depending on the source of the foresight, Mirai admits two instantiations:

Mirai-E provides explicit, position–indexed foresight from the unidirectional AR model’s own Exponential Moving Average (EMA), aligning internal state to the foresights at a small set of nearby future locations.

Left: Mirai-E alignment visualization. Right: Strict casual.

Mirai-I supplies implicit, context–aggregating foresight by aligning internal states to features from a frozen bidirectional encoder at matched spatial locations.

System-Level Comparison

System-Level Comparison on ImageNet 256×256. ↓ and ↑ indicate whether lower or higher values are better, respectively.

FID Comparisons

FID comparisons between Mirai with vanilla LlamaGen across different model sizes and epochs on ImageNet 256×256.

Internal Representation Visualization

Visualization of layer-8 internal representations on the 2D token grid. Each token’s 2D t-SNE embedding is mapped to a color (with the Color Map at bottom left) and plotted at its original grid location. Smooth color fields indicate 2D-structured representations; the red rectangle in LlamaGen-B highlights abrupt color changes where spatial structure breaks down.

ImageNet Results

Generated samples on ImageNet 256×256 from LlamaGen-XL with Mirai-I. Mirai improves global structure, object integrity, and perceptual quality compared to the AR baseline.

BibTeX

@article{yu2026mirai,
  title={Mirai: Autoregressive Visual Generation Needs Foresight},
  author={Yu, Yonghao and Huang, Lang and Wang, Zerun and Li, Runyi and Yamasaki, Toshihiko},
  journal={arXiv preprint arXiv:2601.14671},
  year={2026}
}