Fast-dLLM v2: Efficient Block-Diffusion Large Language Model

Chengyue Wu1,2, Hao Zhang2, Shuchen Xue2, Shizhe Diao2, Yonggan Fu2, Zhijian Liu2, Pavlo Molchanov2, Ping Luo1, Song Han2,3, Enze Xie2

1HKU, 2NVIDIA, 3MIT

Paper Code Model

About Fast-dLLM v2

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. However, their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that transforms pretrained AR models into diffusion-style decoders for parallel text generation. Our approach introduces a novel decoding recipe incorporating the block diffusion mechanism and complementary attention mask, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves a near 2.5x speedup over standard AR decoding, without compromising generation quality. Extensive experiments demonstrate that Fast-dLLM v2 achieves state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs, marking a significant step toward practical deployment of fast and accurate language models.

Generation Process

The model generates text in an autoregressive manner at the block level. Each block is further divided into sub-blocks, which are decoded in parallel to enhance efficiency. Leveraging the block diffusion mechanism, the model naturally supports block-level caching. In addition, we incorporate a sub-block cache using the dual-cache strategy introduced in Fast-dLLM v1, enabling fast and efficient parallel decoding within blocks.

Overview

Training Recipe

To better utilize the autoregressive representations, we adopt a token shift mechanism: each masked token is predicted using the logit of its preceding token, enabling the model to retain autoregressive characteristics. Meanwhile, our block-wise causal attention mask allows the model to access all clean tokens from previous blocks as well as the noisy tokens within the current block during training. Additionally, we introduce complementary masks that let the model learn to predict from alternate masking patterns, ensuring that every token position is eventually learned. This design allows for bidirectional context modeling within blocks while maintaining compatibility with original AR objectives, facilitating efficient and parallelizable text generation.

Training recipe

Throughput Comparison

We compares the inference throughput (tokens per second) of different language model variants across two batch sizes (1 and 4) in one A100 GPU. Fast-dLLM v2 (in green) significantly outperforms all baselines, including the original Qwen-2.5-1.5B-Instruct model and previous Fast-dLLM versions (dream and llada). At batch size 1, Fast-dLLM v2 achieves a throughput of 102.5, more than doubling the baseline Qwen2.5. The advantage becomes even more pronounced at batch size 4, where Fast-dLLM v2 reaches 201.0, nearly 2x faster than Qwen2.5 and approximately 6x faster than Fast-dLLM-llada and Fast-dLLM-dream. These results demonstrate the efficiency and scalability of the proposed parallel decoding approach in Fast-dLLM v2.

Throughput comparison

Benchmark Results

We presents a comprehensive evaluation of Fast-dLLM v2 against several baseline models across a range of code, math, and instruction following benchmarks. Among 1B-scale models, Fast-dLLM v2 consistently outperforms others, achieving the highest average score of 43.5, demonstrating strong capabilities across coding tasks (HumanEval, MBPP), math (GSM8K, Math), and other benchmarks (IFEval, MMLU, GPQA). Notably, Fast-dLLM v2 maintains competitive performance while using significantly fewer parameters than the larger 7B+ models, further validating the efficiency and effectiveness of its block diffusion-based decoding approach.
Benchmark results.