AnyFlow: Any-Step Video Diffusion Model with
On-Policy Flow Map Distillation

Yuchao Gu1,2, Guian Fang2, Yuxin Jiang2, Weijia Mao2, Song Han1,3, Han Cai1*, Mike Zheng Shou2*

1NVIDIA     2Show Lab, National University of Singapore     3MIT

*Corresponding authors.

Arxiv 2026

Abstract

Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. We argue that this limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any-step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping (ztz0) to flow-map transition learning (ztzr) over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors (i.e., discretization error in few-step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance comparable to or better than consistency-based counterparts in the few-step regime, while supporting flexible and scalable sampling under varying step budgets.

Method Overview

Consistency Distillation vs. Flow Map Distillation

  • Consistency distillation.
    1. Forward training (initialization): learns the endpoint consistency mapping (ztz0).
    2. Consistency backward simulation (on-policy distillation): replaces the original Euler sampling trajectory with a consistency-sampling trajectory, and applies a truncated gradient to reduce rollout cost.
  • Flow map distillation.
    1. Forward training (initialization): learns an arbitrary two-time transition mapping (ztzr).
    2. Flow map backward simulation (on-policy distillation): preserves the original Euler sampling trajectory and decomposes the long trajectory into shortcut segments to reduce rollout cost.
Comparison of consistency distillation and flow map distillation paradigms

Figure 1. Consistency distillation vs. flow map distillation.

Main Results

  • AnyFlow vs. the flow-matching teacher: AnyFlow uniformly lifts performance across the entire sampling trajectory — improving over the flow-matching teacher at every step budget, with especially large gains in the few-step regime.
  • AnyFlow vs. consistency distillation: AnyFlow matches or surpasses consistency-distilled models in the few-step regime, and preserves the test-time scaling behavior of flow matching — quality continues to improve as more sampling steps are allocated.
Test-time scaling of AnyFlow vs. consistency baselines

Figure 2. Quantitative test-time scaling of AnyFlow.

Bidirectional Video Diffusion Model (1.3B)

AnyFlow vs. Flow-Matching Teacher

Teacher
4 NFEs
16 NFEs
32 NFEs
AnyFlow
4 NFEs
16 NFEs
32 NFEs

AnyFlow vs. Consistency Distillation (rCM)

rCM
4 NFEs
16 NFEs
32 NFEs
AnyFlow
4 NFEs
16 NFEs
32 NFEs

Causal Video Diffusion Model (1.3B)

AnyFlow-FAR vs. Consistency Distillation (Self-Forcing)

Self-Forcing
4 NFEs
16 NFEs
32 NFEs
AnyFlow-FAR
4 NFEs
16 NFEs
32 NFEs

Figure 3. Qualitative test-time scaling of AnyFlow.

1. Causal Video Diffusion Model (14B)

1.1 T2V Comparison

AnyFlow-FAR-Wan2.1-14B achieves better dynamic and quality compared to community-trained consistency counterparts (Krea-Realtime-Wan2.1-14B, LightX2V-Wan2.1-14B-CausVid and FastVideo-CausalWan2.2-A14B-Preview).

Krea-Realtime-Wan2.1-14B (4 NFEs)
LightX2V-Wan2.1-14B-CausVid (9 NFEs)
FastVideo-CausalWan2.2-A14B-Preview (8 NFEs)
AnyFlow-FAR-Wan2.1-14B (Ours) (4 NFEs)

1.2 I2V Comparison

Within the same model, AnyFlow-FAR-Wan2.1-14B can support I2V generation as well; it achieves similar quality compared to Wan2.1-I2V-14B (50×2 NFEs).

Wan2.1-I2V-14B (50*2 NFEs)
FastVideo-CausalWan2.2-A14B-Preview (8 NFEs)
AnyFlow-FAR-Wan2.1-14B (Ours) (4 NFEs)

1.3 V2V Visualization

Within the same model, AnyFlow-FAR-Wan2.1-14B can support V2V generation as well (4 NFEs).

2. Bidirectional Video Diffusion Model (14B)

2.1 T2V Comparison

rCM-Wan2.1-T2V-14B (4 NFEs)
AnyFlow-Wan2.1-T2V-14B (Ours) (4 NFEs)

3. Continued Training on Downstream Dataset

3.1 Pipeline Overview

The flow map formulation preserves a fine-grained instantaneous flow field, so the distilled model can be continued on downstream data while keeping few-step sampling. The training pipeline is illustrated below.

Pipeline for continued training on a downstream dataset after AnyFlow distillation

Figure 4. Continued training pipeline on a downstream dataset.

3.2 I2V Comparison

After fine-tuning AnyFlow-FAR-Wan2.1-1.3B on a specialized domain dataset, the model shows clear improvements in identity preservation (e.g., robot-arm type) and trajectory accuracy (e.g., moving pedestrians).

Condition image
Before fine-tuning (4 NFEs)
After fine-tuning (4 NFEs)
Condition image for sample 11
Condition image for sample 14

BibTeX

If you find our work useful, please consider citing:

@article{gu26anyflow,
      title={AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation},
      author={Yuchao Gu and Guian Fang and Yuxin Jiang and Weijia Mao and Song Han and Han Cai and Mike Zheng Shou},
      year={2026},
}