Chengyue Wu1,2, Hao Zhang2, Shuchen Xue2, Shizhe Diao2, Yonggan Fu2, Zhijian Liu2, Pavlo Molchanov2, Ping Luo1, Song Han2,3, Enze Xie2
1HKU, 2NVIDIA, 3MIT
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs.
The model generates text in an autoregressive manner at the block level. Each block is further divided into sub-blocks, which are decoded in parallel to enhance efficiency. Leveraging the block diffusion mechanism, the model naturally supports block-level caching. In addition, we incorporate a sub-block cache using the dual-cache strategy introduced in Fast-dLLM v1, enabling fast and efficient parallel decoding within blocks.
To better utilize the autoregressive representations, we adopt a token shift mechanism: each masked token is predicted using the logit of its preceding token, enabling the model to retain autoregressive characteristics. Meanwhile, our block-wise causal attention mask allows the model to access all clean tokens from previous blocks as well as the noisy tokens within the current block during training. Additionally, we introduce complementary masks that let the model learn to predict from alternate masking patterns, ensuring that every token position is eventually learned. This design allows for bidirectional context modeling within blocks while maintaining compatibility with original AR objectives, facilitating efficient and parallelizable text generation.
We compare the inference throughput (tokens per second) and GSM8K accuracy of various language model variants on a single A100 GPU. Fast-dLLM v2 (7B, green) significantly outperforms all baselines in both efficiency and accuracy. In panel (a), Fast-dLLM v2 achieves 2.54× higher throughput than Qwen2.5-7B-Instruct while improving accuracy by 5.2% over Fast-dLLM-LLaDA. In panel (b), Fast-dLLM v2 demonstrates strong scalability, with throughput increasing from 102.5 tokens/sec at batch size 1 to 217.5 tokens/sec at batch size 4—substantially higher than Qwen2.5 and earlier Fast-dLLM variants (Dream and LLaDA). These results highlight the effectiveness of Fast-dLLM v2’s parallel decoding optimizations.
@misc{wu2025fastdllmv2efficientblockdiffusion,
title={Fast-dLLM v2: Efficient Block-Diffusion LLM},
author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
year={2025},
eprint={2509.26328},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.26328},
}