Chengyue Wu1,2*, Hao Zhang2*, Shuchen Xue4, Zhijian Liu2, Shizhe Diao2, Ligeng Zhu2, Ping Luo1, Song Han2,3, Enze Xie2
1HKU, 2NVIDIA, 3MIT, 4Independent Researcher
*Equal contribution
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to 27.6x throughput improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
This case study compares the generation process with and without parallel decoding. The left panel shows standard decoding, where tokens are revealed one by one at each step. The right panel demonstrates our confidence-aware parallel decoding, which allows multiple confident tokens to be unmasked in parallel at each step. As shown, parallel decoding significantly accelerates the generation process while maintaining sequence quality. The color legend indicates the generation step for each token.
• Key-Value Cache for Block-Wise Decoding: We propose an efficient block-wise decoding KV Cache mechanism for Masked Diffusion Models (MDMs). By reusing attention Key-Value activations across multiple steps within each block, our approach avoids redundant computation and significantly accelerates inference. Furthermore, our DualCache extension also caches masked suffix tokens, enabling even greater speedup with negligible accuracy loss.
• Confidence-Aware Parallel Decoding: Instead of decoding tokens sequentially, we introduce a confidence-aware parallel decoding scheme. At each step, only tokens with confidence over a threshold are unmasked in parallel, while uncertain ones remain masked for future steps. This selective approach effectively balances decoding efficiency and output quality, and is theoretically supported by our parallel decoding theorem for high-confidence predictions.
Overall, introducing the KV Cache mechanism yields significant speed improvements for all tasks and sequence lengths, typically achieving a 2x to 3.6x speedup compared to the vanilla backbone. When the parallel decoding strategy is applied individually, we see additional acceleration, often pushing speedups to 4x-6x for the evaluated settings, particularly as the generation length increases. When both techniques are combined, the improvements become even more pronounced. On LLaDA, for example, combined KV Cache and parallel decoding methods boost throughput by up to 11x (GSM8K, length 512) and 9.2x (MBPP, length 512) over the standard baseline. Similarly, on Dream-Base, the largest throughput gains are observed on MBPP (7.8x at length 512) and GSM8K (5.6x at length 512). These results indicate that not only are our methods effective individually, but they are also highly complementary, resulting in the combined acceleration.