Fast-dVLM: Efficient Block-Diffusion VLM
via Direct Conversion from Autoregressive VLM

Chengyue Wu1,2*, Shiyi Lan2*, Yonggan Fu2, Sensen Gao4, Jin Wang1,2, Jincheng Yu2,
Jose M. Alvarez2, Pavlo Molchanov2, Ping Luo1, Song Han2,3, Ligeng Zhu2†, Enze Xie2†

1The University of Hong Kong, 2NVIDIA, 3MIT, 4MBZUAI
*Equal contribution    †Co-lead

Paper Code Fast-dLLM Fast-dLLM v2
Realtime throughput comparison between Fast-dVLM-3B and Qwen2.5-VL-3B.

Paper Introduction

About Fast-dVLM

Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized.

We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM.

We introduce a suite of multimodal diffusion adaptations—block-size annealing, causal context attention, auto-truncation masking, and vision-efficient concatenation—that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6× end-to-end inference speedup over the AR baseline.

Overview of Fast-dVLM: (a) accuracy vs speedup on MMMU-Pro-V, (b) benchmark comparison against Qwen2.5-VL-3B, (c) up to 6.18x end-to-end speedup.

Direct vs. Two-Stage Conversion

We compare two strategies for converting a pretrained AR VLM into a block-diffusion model:

    •   Two-stage path: Stage 1 converts only the LLM backbone via text-only diffusion fine-tuning; Stage 2 attaches the vision encoder and MLP projector and jointly fine-tunes the entire model on multimodal data.

    •   Direct path: The complete AR VLM is directly fine-tuned for block-diffusion on multimodal data in a single stage, yielding a simpler pipeline that leverages the pretrained multimodal alignment.

Two AR-to-diffusion conversion strategies: two-stage vs. direct path.

Under comparable training budgets (both trained on ~2M samples for a single epoch), the direct path achieves a substantially higher average score (73.3 vs. 60.2), outperforming the two-stage path on all 10 benchmarks.

Radar chart: direct vs two-stage path comparison across benchmarks.
Benchmark comparison between direct and two-stage conversion paths across 10 multimodal tasks.

Training Architecture

Both paths share the same training pipeline. Only response text tokens are corrupted: a noisy stream is constructed by masking response positions and concatenated with the clean stream. The attention mask enforces three rules:

    •   N2N: Noisy tokens attend bidirectionally within their block for parallel denoising.

    •   N2C: Noisy tokens attend to clean tokens from preceding blocks, including vision tokens.

    •   C2C: Clean tokens follow token-level causal attention, enabling joint AR loss training and AR decoding at inference.

Training architecture and attention mask with block size B=2.

Training Recipe

    •   Block-Size Annealing: A curriculum that progressively increases the block size, allowing the model to learn fine-grained denoising before tackling larger corruption spans.

    •   Auto-Truncation Masking: Automatically truncates each response's last block at the response boundary, preventing cross-turn leakage in multi-turn dialogue while preserving block-parallel denoising.

    •   Vision-Efficient Concatenation: Since vision embeddings are never corrupted, they are included only in the clean stream. On Qwen2.5-VL-3B (H100, context length 2048), this reduces peak memory by 15.0% and training time by 14.2%.

    •   Joint Objective: The total loss combines a diffusion loss with a causal LM loss (α=β=0.5), learning parallel denoising while preserving AR generation capability.

Benchmark Results

We evaluate Fast-dVLM on 11 multimodal benchmarks. On short-answer tasks, Fast-dVLM with speculative decoding achieves an average score of 74.0, exactly matching the AR baseline, while achieving 2.63× Tokens/NFE. Among diffusion VLMs, Fast-dVLM achieves the best results on 8 out of 11 short-answer benchmarks.

Model AI2DChartQADocVQAGQAMMBenchMMMUPOPE RWQASEED2+TextVQAAvg MMMU-Pro-VTok/NFE
Autoregressive Vision-Language Models
VILA-1.5-3B58.053.044.361.460.531.886.853.241.258.254.86.11.00
MiniCPM-V-265.059.269.851.766.337.986.556.352.574.462.010.31.00
Intern-VL-2.5-4B81.377.891.161.080.750.089.364.667.078.874.224.61.00
Qwen2.5-VL-3B80.884.093.159.076.947.386.265.168.679.174.026.31.00
Diffusion Vision-Language Models
LaViDa70.059.064.655.570.543.381.454.557.760.361.710.51.00
Dimple74.463.337.759.274.645.286.255.451.761.660.912.41.00
LLaDA-V77.878.383.953.482.948.681.863.268.764.770.318.61.00
Fast-dVLM (MDM)79.782.892.163.074.244.688.665.167.276.173.321.41.95
Fast-dVLM (spec.)79.783.192.963.374.346.688.665.167.279.374.024.62.63

Among diffusion models, blue bold = best, light blue underline = 2nd best.

Inference Acceleration

Fast-dVLM combines multiple acceleration techniques for production-grade serving:

    •   Confidence-Aware Parallel Decoding: The confidence threshold τ governs the speed–quality tradeoff. At τ=0.9, throughput nearly doubles to 1.95 tokens/step while maintaining accuracy.

    •   Self-Speculative Block Decoding: The diffusion mode drafts all B−1 tokens in one pass and the causal mode verifies them autoregressively, recovering accuracy to 24.6 (close to the AR baseline's 26.3) while achieving 1.98× wall-clock TPS speedup.

    •   SGLang + FP8 Quantization: Integration with SGLang's optimized kernels and CUDA graph, combined with SmoothQuant W8A8 (FP8), yields a total of 6.18× end-to-end speedup.

Setting MMMU-Pro-V TPS SpeedUp
AR baseline26.356.71.00×
Fast-dVLM (MDM, τ=0.9)21.482.21.45×
+ Spec. decoding (linear)24.6112.71.98×
+ SGLang serving24.1319.05.63×
+ SmoothQuant-W8A8 (FP8)23.8350.36.18×
Effect of threshold on accuracy and tokens per step.
Effect of threshold τ on accuracy and tokens per step.
Speculative decoding: linear vs quadratic variants.
Speculative decoding throughput: linear vs. quadratic variants across block sizes.

Case Study

We present qualitative examples to illustrate how Fast-dVLM compares with the AR baseline in both response quality and decoding efficiency across diverse visual understanding tasks.

Math Reasoning

On a constrained optimization problem from MMMU-Pro-V, both the AR baseline and Fast-dVLM correctly identify the optimal solution and produce coherent step-by-step reasoning. Notably, Fast-dVLM outputs cleaner, more human-readable mathematical notation. Despite generating a comparable amount of reasoning, Fast-dVLM completes the response at 77.7 tokens/s, a 1.6× speedup over the baseline's 47.4 tokens/s.

Case study: math reasoning on MMMU-Pro-V.

Diverse Visual Understanding

Across art style recognition, celebrity identification, and chart question answering, Fast-dVLM generates detailed, accurate, and fluent responses. It correctly identifies impressionist style and attributes paintings to Claude Monet, recognizes Lionel Messi with biographical context, and comprehensively reads chart data with trend analysis. Decoding throughput remains high across different response lengths, ranging from 63.7 tokens/s for short descriptions to 115.0 tokens/s for long chart analysis, with Tokens/NFE ratios consistently above 1.5.

More case studies: art, celebrity, chart QA.

Physical AI Applications

The inference speedup of Fast-dVLM is particularly impactful for physical AI deployments such as autonomous driving and robotic manipulation, where VLMs serve as core perception-reasoning modules on resource-constrained edge devices. In these scenarios, models typically operate at batch size one with strict real-time latency requirements—exactly the regime where autoregressive decoding is most bottlenecked by memory bandwidth. Fast-dVLM's block-parallel decoding shifts the workload toward a more compute-bound regime, enabling significantly faster responses while maintaining generation quality.

In the autonomous driving example, the model correctly reads highway signage and reasons about lane selection, producing a concise 149-token response at 73.3 tokens/s. In the robotic manipulation example, it generates a detailed 488-token, 8-step guide at 73.0 tokens/s. Both examples achieve a Tokens/step ratio above 1.68, confirming that the block-diffusion speedup generalizes to long-form embodied reasoning tasks.

Physical AI case study: autonomous driving and robotic manipulation.

BibTeX

@misc{wu2026fastdvlmefficientblockdiffusionvlm,
      title={Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM},
      author={Chengyue Wu and Shiyi Lan and Yonggan Fu and Sensen Gao and Jin Wang and Jincheng Yu and Jose M. Alvarez and Pavlo Molchanov and Ping Luo and Song Han and Ligeng Zhu and Enze Xie},
      year={2026},
      eprint={2604.06832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.06832},
}