Sol-RL: FP4 Explore, BF16 Train

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Unlocking Massive Rollout Scaling for Diffusion RL at a Fraction of the Cost

Yitong Li^1,2*, Junsong Chen^1,2*, Shuchen Xue^1*, Pengcuo Zeren¹, Siyuan Fu¹, Dinghao Yang¹, Yangyang Tang¹,
Junjie Bai¹, Ping Luo², Song Han^1,3, Enze Xie¹

¹NVIDIA ²HKU ³MIT

*Equal contribution

📄 arXiv 💻 Code 📖 Docs

Demo Video

About

RL-based post-training is a promising paradigm for aligning diffusion models with human preferences, and scaling rollout group size yields clear gains — but at prohibitive cost on large models like FLUX.1-12B. We propose Sol-RL (Speed-of-light RL), an FP4-empowered two-stage RL framework that decouples exploration from optimization: high-throughput NVFP4 rollouts generate a massive candidate pool, from which only the most contrastive subset is regenerated in BF16 for policy updates. This algorithm-hardware co-design accelerates rollouts while preserving training integrity. Experiments across SANA, FLUX.1, and SD3.5-L show superior alignment with up to 4.64× faster convergence.

Sol-RL enables efficient and high-fidelity text-to-image alignment.

4.64×

Convergence Speedup

~1%

Performance Gap vs. BF16

3×4

Foundation Models & Rewards Validated

Qualitative Results

All methods side by side. Click any image to view fullscreen.

Core Design: Decoupled Two-Stage Pipeline

We decouple high-throughput FP4 exploration from selective BF16 training, achieving up to 2.4× rollout acceleration with merely 2% computational overhead while avoiding quantization-induced corruption.

Methodology

Scaling rollouts improves alignment but creates an inference bottleneck, and naive FP4 quantization compromises visual fidelity. Sol-RL introduces a decoupled two-stage architecture that uses FP4 exclusively for exploration while preserving BF16 for optimization.

Rollout Scaling: Promise and Bottleneck

Under selective training, only the best-K and worst-K samples drive gradient updates, so larger rollout groups yield better learning signals at fixed training cost. However, the vast majority of generated candidates are discarded — revealing massive redundancy in computing the full pool at high precision.

Key Insight: Proxy Reward Ranking via FP4

In ODE-style diffusion sampling, the semantic outcome is dictated by the initial noise seed. NVFP4 preserves this structure despite pixel-level deviations, so FP4 rollouts serve as reliable proxies for reward ranking — enabling us to identify the most informative seeds cheaply, then regenerate only those in BF16.

Two-Stage Pipeline

Stage 1: FP4 Explore — Sample N=96 noises and generate candidates via the NVFP4-quantized solver (up to 4× TFLOPs vs. BF16) to compute proxy rewards. Filter to isolate the top/bottom-K most contrastive seeds.
Stage 2: BF16 Train — Regenerate the selected K=24 seeds in BF16 with the full step budget. Optimize the policy on these high-fidelity samples, then re-quantize weights to NVFP4 for the next iteration.

In summary, Sol-RL harnesses FP4 throughput for massive-scale exploration and reserves BF16 compute strictly for the K samples that dictate the policy update — introducing merely ~2% overhead.

Overall Performance

Main Results on FLUX.1

Method	ImageReward (Base w/o CFG: 0.455)		CLIPScore (Base w/o CFG: 0.2630)		PickScore (Base w/o CFG: 0.8096)		HPSv2 (Base w/o CFG: 0.2566)
Method	Score	Δ (↑)	Score	Δ (↑)	Score	Δ (↑)	Score	Δ (↑)
DanceGRPO	1.4937	+1.0387	0.2898	+0.0268	0.8807	+0.0711	0.3552	+0.0986
FlowGRPO	1.5331	+1.0781	0.2884	+0.0254	0.8743	+0.0647	0.3501	+0.0935
AWM	1.6693	+1.2143	0.3039	+0.0409	0.8842	+0.0746	0.3664	+0.1098
DiffusionNFT	1.6707	+1.2157	0.2991	+0.0361	0.8852	+0.0756	0.3613	+0.1047
Sol-RL (Ours)	1.7636	+1.3086	0.3089	+0.0459	0.8932	+0.0836	0.3688	+0.1122

Learning curves across metrics and models.

Efficiency Comparison (seconds/iteration)

Base Model	Rollout Naive	Rollout Ours	Speedup	E2E Naive	E2E Ours	Speedup
FLUX.1	184	79	2.33×	274	169	1.62×
SD3.5-Large	451	187	2.41×	691	427	1.61×
SANA	65	46	1.41×	95	76	1.25×

BibTeX

@misc{li2026fp4explorebf16train,
  title={FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling},
  author={Yitong Li and Junsong Chen and Shuchen Xue and Pengcuo Zeren and Siyuan Fu and Dinghao Yang and Yangyang Tang and Junjie Bai and Ping Luo and Song Han and Enze Xie},
  year={2026},
  eprint={2604.06916},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2604.06916},
}