Chengyue Wu1,2, Hao Zhang2, Shuchen Xue2, Shizhe Diao2, Yonggan Fu2, Zhijian Liu2, Pavlo Molchanov2, Ping Luo1, Song Han2,3, Enze Xie2
1HKU, 2NVIDIA, 3MIT
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. However, their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that transforms pretrained AR models into diffusion-style decoders for parallel text generation. Our approach introduces a novel decoding recipe incorporating the block diffusion mechanism and complementary attention mask, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves a near 2.5x speedup over standard AR decoding, without compromising generation quality. Extensive experiments demonstrate that Fast-dLLM v2 achieves state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs, marking a significant step toward practical deployment of fast and accurate language models.
The model generates text in an autoregressive manner at the block level. Each block is further divided into sub-blocks, which are decoded in parallel to enhance efficiency. Leveraging the block diffusion mechanism, the model naturally supports block-level caching. In addition, we incorporate a sub-block cache using the dual-cache strategy introduced in Fast-dLLM v1, enabling fast and efficient parallel decoding within blocks.
To better utilize the autoregressive representations, we adopt a token shift mechanism: each masked token is predicted using the logit of its preceding token, enabling the model to retain autoregressive characteristics. Meanwhile, our block-wise causal attention mask allows the model to access all clean tokens from previous blocks as well as the noisy tokens within the current block during training. Additionally, we introduce complementary masks that let the model learn to predict from alternate masking patterns, ensuring that every token position is eventually learned. This design allows for bidirectional context modeling within blocks while maintaining compatibility with original AR objectives, facilitating efficient and parallelizable text generation.
We compares the inference throughput (tokens per second) of different language model variants across two batch sizes (1 and 4) in one A100 GPU. Fast-dLLM v2 (in green) significantly outperforms all baselines, including the original Qwen-2.5-1.5B-Instruct model and previous Fast-dLLM versions (dream and llada). At batch size 1, Fast-dLLM v2 achieves a throughput of 102.5, more than doubling the baseline Qwen2.5. The advantage becomes even more pronounced at batch size 4, where Fast-dLLM v2 reaches 201.0, nearly 2x faster than Qwen2.5 and approximately 6x faster than Fast-dLLM-llada and Fast-dLLM-dream. These results demonstrate the efficiency and scalability of the proposed parallel decoding approach in Fast-dLLM v2.