Fast-dDrive

Fast-dDrive: Efficient Block-Diffusion VLM
for Autonomous Driving

Kewei Zhang^1*, Jin Wang^3,2*, Sensen Gao², Chengyue Wu^3,2, Yulong Cao²,
Songyang Han², Boris Ivanovic², Langechuan Liu², Marco Pavone², Song Han^4,2,
Daquan Zhou^1†, Enze Xie^2†

¹Peking University ²NVIDIA ³The University of Hong Kong ⁴MIT
^*Equal contribution ^†Co-lead

Paper Code Fast-dVLM

Abstract

End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality.

We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe (SASD) that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking N stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost.

Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs. On nuScenes, it reduces average L2 error to 0.32 m (a 22% improvement over prior diffusion baselines). When integrated with SGLang, our framework delivers 12× throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.

Method Overview

Fast-dDrive combines four innovations that together exploit the structured nature of driving VLA outputs:

• Section Diffusion (SD): Replace fixed-size block boundaries with semantically aligned sections (critical_objects, future_meta_behavior, explanation, trajectory). JSON structural tokens are pre-filled as a frozen scaffold; only value tokens are denoised.

• SASD — Section-Aware Structured Diffusion: Per-section importance-weighted loss (trajectory×3, FMB×2, critical_objects×1.5, explanation×1) plus section-adaptive Beta noise schedules. Pure training-time technique with zero inference overhead.

• Scaffold Speculative Decoding: Per block, MDM block-bidirectional draft + AR causal verify. Scaffold positions are auto-accepted. Matches Deep Scaffold quality at 64% faster latency — the fastest single-run inference mode.

• Test-Time Inference Scaling: Run scaffold spec for the shared CoT prefix once, then fork N stochastic trajectory rollouts from the same KV cache and average. Variance reduction at low marginal cost.

Why Block Diffusion for Driving?

Existing VLAs fall into two camps: autoregressive models (Poutine, AutoVLA) have strong reasoning but are sequential and memory-bandwidth-bound at batch-size-1; full-sequence diffusion models (dVLM-AD) gain bidirectional context but suffer from slow iterative denoising and no KV-cache reuse. Fast-dDrive picks the block diffusion middle path: bidirectional refinement within a semantic section, strict causal ordering across sections, plus the ability to do speculative verification with the AR head it inherits from its Qwen2.5-VL backbone.

WOD-E2E Test Set Results

Fast-dDrive achieves the best ADE@3s and ADE@5s among all reported methods on the Waymo Open End-to-End Driving test set, while running at 210.4 TPS on a single H100 GPU — over 4× the throughput of the strongest AR baselines (Poutine/AutoVLA). Adding inference scaling (N=4 shared-prefix multi-rollout) further improves both ADE metrics with only a 1.8× cost factor.

Method	Paradigm	RFS ↑	ADE 5s ↓	ADE 3s ↓	TPS ↑	Tok/Step ↑
Autoregressive VLAs
OpenEMMA*	AR	5.158	12.476	6.684	—	1
LightEMMA*	AR	6.517	3.740	1.705	—	1
NaiveEMMA	AR	7.528	3.018	1.320	—	1
AutoVLA	AR	7.557	2.958	1.351	51.2	1
Poutine-Base	AR	7.909	2.940	1.270	51.2	1
Diffusion VLAs
dVLM-AD	Diffusion	7.633	3.022	1.285	35.2	2.82
Fast-dDrive (Scaffold Spec)	Block Diff.	7.823	2.907	1.254	210.4	4.90
+ Inference scaling (N=4)	Block Diff.	7.827	2.821	1.240	114.7	2.76

*: zero-shot. blue bold = best, light blue underline = 2nd best. TPS measured on a single H100.

Inference Efficiency

Fast-dDrive's scaffold speculative decoding shrinks per-sample latency from 7855 ms (AR baseline) to 1919 ms on a single H100, a 4.1× wall-clock speedup at parity or better accuracy. Plugging in SGLang's optimized kernels with FP8 quantization pushes the speedup to 11.8×, hitting 608 TPS.

Method	Decoding	Latency (ms) ↓	TPS ↑	Tok/Step ↑	ADE 5s ↓	RFS ↑
AR Baseline (Qwen2.5-VL-3B)	Autoregressive	7855	51.6	1	2.083	7.931
dVLM-AD (Full-seq MDM)	Iterative Denoise	9575 (0.8×)	35.2	2.82	3.024	7.187
Fast-dDrive (Self-Spec)	Draft+Verify	3714 (2.1×)	109.0	2.41	1.973	7.959
Fast-dDrive (Section Diffusion)	Iterative MDM	3006 (2.6×)	134.4	3.28	2.058	7.928
+ Scaffold Spec	Scaffold + D&V	1919 (4.1×)	210.4	4.90	1.982	7.934
+ SGLang serving	Scaffold + D&V	665 (11.8×)	608.5	4.93	1.995	7.914

Generalization: nuScenes

Fast-dDrive transfers strongly to the nuScenes benchmark, reducing L2 prediction error to 0.32 m on average — outperforming both classical training-based policies (UniAD, VAD, BEV-Planner) and recent reasoning-based VLAs (DriveVLM, AutoVLA, dVLM-AD). This is a 22% relative improvement over the best prior reasoning-based result.

Method	L2 @1s ↓	L2 @2s ↓	L2 @3s ↓	Avg ↓
Training-based Policy
UniAD	0.20	0.42	0.75	0.46
VAD-Base	0.17	0.34	0.60	0.37
BEV-Planner	0.16	0.32	0.57	0.35
VLMs / VLAs with Reasoning
OpenEMMA*	1.45	3.21	3.76	2.81
DriveVLM	0.18	0.34	0.68	0.40
AutoVLA	0.25	0.46	0.73	0.48
dVLM-AD	0.15	0.40	0.68	0.41
Fast-dDrive (ours)	0.12	0.33	0.50	0.32

Test-Time Inference Scaling

Single-mode multi-sampling (temperature perturbation) collapses to one answer because the AR verify head is deterministic. We instead generate N stochastic trajectory rollouts from a shared scaffold-spec prefix, applying non-zero verify temperature only on the trajectory section, and averaging waypoints. The first three sections (critical_objects, explanation, future_meta_behavior) are heavily structured by the schema, so we keep the AR verifier greedy there and only enable sampling once decoding enters the trajectory section. On a representative WOD-E2E val sample, the N=4 rollouts disagree most at the late waypoints, while their mean lies right on top of the ground truth; ADE@5s decreases monotonically with N, confirming the variance-of-the-mean argument.

BibTeX

@misc{zhang2026fastddriveefficientblockdiffusionvlm,
      title={Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving},
      author={Kewei Zhang and Jin Wang and Sensen Gao and Chengyue Wu and Yulong Cao and Songyang Han and Boris Ivanovic and Langechuan Liu and Marco Pavone and Song Han and Daquan Zhou and Enze Xie},
      year={2026},
      eprint={2605.23163},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.23163},
}