Fast-dLLM v2

Fast-dLLM v2: Efficient Block-Diffusion Large Language Model

Chengyue Wu^1,2, Hao Zhang², Shuchen Xue², Shizhe Diao², Yonggan Fu², Zhijian Liu², Pavlo Molchanov², Ping Luo¹, Song Han^2,3, Enze Xie²

¹HKU, ²NVIDIA, ³MIT

Paper Code Model

About Fast-dLLM v2

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs.

Generation Process

The model generates text in an autoregressive manner at the block level. Each block is further divided into sub-blocks, which are decoded in parallel to enhance efficiency. Leveraging the block diffusion mechanism, the model naturally supports block-level caching. In addition, we incorporate a sub-block cache using the dual-cache strategy introduced in Fast-dLLM v1, enabling fast and efficient parallel decoding within blocks.

Training Recipe

To better utilize the autoregressive representations, we adopt a token shift mechanism: each masked token is predicted using the logit of its preceding token, enabling the model to retain autoregressive characteristics. Meanwhile, our block-wise causal attention mask allows the model to access all clean tokens from previous blocks as well as the noisy tokens within the current block during training. Additionally, we introduce complementary masks that let the model learn to predict from alternate masking patterns, ensuring that every token position is eventually learned. This design allows for bidirectional context modeling within blocks while maintaining compatibility with original AR objectives, facilitating efficient and parallelizable text generation.

Throughput Comparison

We compare the inference throughput (tokens per second) and GSM8K accuracy of various language model variants on a single A100 GPU. Fast-dLLM v2 (7B, green) significantly outperforms all baselines in both efficiency and accuracy. In panel (a), Fast-dLLM v2 achieves 2.54× higher throughput than Qwen2.5-7B-Instruct while improving accuracy by 5.2% over Fast-dLLM-LLaDA. In panel (b), Fast-dLLM v2 demonstrates strong scalability, with throughput increasing from 102.5 tokens/sec at batch size 1 to 217.5 tokens/sec at batch size 4—substantially higher than Qwen2.5 and earlier Fast-dLLM variants (Dream and LLaDA). These results highlight the effectiveness of Fast-dLLM v2’s parallel decoding optimizations.

Benchmark Results

We present a comprehensive benchmark comparison of Fast-dLLM v2 against several baseline models across a diverse set of tasks, including code generation (HumanEval, MBPP), mathematical reasoning (GSM8K, Math), instruction following (IFEval), and knowledge-intensive QA (MMLU, GPQA). In the 1B-scale group, Fast-dLLM v2 (1.5B) achieves the highest average score of 45.0, outperforming all other models. In the 7B+ group, Fast-dLLM v2 (7B) also attains the highest overall average score of 60.3, surpassing larger models such as LLaDA-MoE, Dream, and Qwen2.5-7B variants. Notably, Qwen2.5-Nemo-FT models refer to autoregressive (AR) baselines fine-tuned on the same training data as Fast-dLLM v2, providing a fair comparison under equivalent data conditions. These results demonstrate that Fast-dLLM v2 not only scales effectively but also delivers state-of-the-art performance and efficiency, validating the strength of its block diffusion-based decoding approach across model sizes and tasks.

@misc{wu2025fastdllmv2efficientblockdiffusion,
    title={Fast-dLLM v2: Efficient Block-Diffusion LLM}, 
    author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
    year={2025},
    eprint={2509.26328},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2509.26328}, 
}