DLER GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

NVIDIA NVIDIA
1 affiliated with HKUST. Work done during Shih-Yang’s internship at NVIDIA 2 project lead

Why Choose GDPO Over GRPO for Multi-Reward RL Training?

  • We present Group Reward–Decoupled Normalization Policy Optimization (GDPO), a drop-in replacement for GRPO that consistently improves per-reward convergence in multi-reward RL training across tasks.
CLIMB Data Filtering Architecture
Figure 1: Reward curves of training Qwen2.5-1.5B-Instruct using GDPO and GRPO on the tool-calling task. GDPO consistently converges to higher correctness and format rewards.

CLIMB Data Filtering Architecture
Figure 2: Reward score trends during training of DeepSeek-R1-1.5B to reduce response length using GDPO and GRPO; GDPO converges to higher correctness and length rewards.

CLIMB Data Filtering Architecture
Figure 3: Reward curves of training Qwen2.5-1.5B-Instruct using GDPO and GRPO on the math reasoning task to achieve the "aha" moment. GDPO consistently converges to higher correctness, format, and integer rewards.

What's wrong with using GRPO for multi-reward RL training?

CLIMB Data Filtering Architecture
Figure 4: Comparison of GRPO and GDPO advantage computation in a two-binary-reward, two-rollout example. GRPO maps different reward combinations into only two distinct advantage groups, whereas GDPO normalizes each reward independently and retains three distinct groups of advantage values. We skip the batch-wise normalization calculation step in GDPO here for simplicity since it does not change the number of distinct advantage groups.
  • We find that the common practice of applying GRPO to multi-reward RL optimization leads to a previously overlooked issue: GRPO inherently collapses reward signals, resulting in information loss in the advantage estimates.
  • Let's start with a simple training setting and then extend it to more general cases. Consider a scenario where we generate two rollouts for each question for calculating the group-relative advantage and the task involves two binary reward 𝑟1, 𝑟2 ∈ {0, 1}. Consequently, the total reward for each rollout can take values from {0, 1, 2}.
  • Then from the figure above, we can see that directly applying GRPO for advantage estimation collapses distinct reward combinations (0, 1), (0, 2), and (1, 2) to identical normalized advantages of (−0 .7071 , 0 .7071)
  • Our intuition: We think that such characteristic of GRPO’s advantage calculation in multi-reward optimization over-compresses the rich group-wise reward signal, reducing the resolution of the training signal and leading to suboptimal convergence.
  • For example, reward combination of (0, 2) should produce a stronger learning signal than (0, 1) because a total reward of 2 indicates simultaneous satisfaction of rewards 𝑟1 and 𝑟2, whereas a reward of 1 corresponds to achieving only 𝑟1 or 𝑟2. Thus, when the other one rollout only receives zero reward, (0, 2) should yield a larger relative advantage than (0, 1). This limitation can also introduce risks of training instability due to inaccurate advantage estimates.


GDPO comes to the rescue!

Decouples group-wise normalization of each reward separately before aggregation

  • To address the fundamental limitation of GRPO, we propose Group reward-Decoupled normalization Policy Optimization (GDPO), which performs group-wise normalization on each reward independently prior to aggregation. This contrasts with GRPO, which applies group-wise normalization directly to the summed rewards.
  • CLIMB Data Filtering Architecture
    Overview of GDPO

    More Fine-Grained Advantage Estimates

  • By separating the normalization of each reward, GDPO alleviates the training signal collapse problem present in GRPO’s advantage estimation in Figure 4. For example, the reward combination of (0, 1) after GDPO normalization becomes (−0.7071, 0 .7071) and (0, 2) becomes (−1.4142 , 1.4142), which more appropriately reflects that (0, 2) should yield a stronger learning signal than (0, 1).
  • CLIMB Data Filtering Architecture
    Figure 5: Comparison of the number of distinct advantage groups produced by GRPO and GDPO. As the number of rollouts (left) or rewards (right) grows, GDPO consistently preserve a substantially larger number of distinct advantage groups compared to GRPO. This results in advantage estimations that provide more expressive training signals
  • When generalized to a larger number of rollouts while keeping the number of rewards to 2, GDPO consistently produces a significantly higher count of distinct advantage groups compared to GRPO, with the gap widening as the number of rollouts increases. Similar trend can be observed under settings where the number of rollouts is fixed at four, but the number of rewards gradually increases.
  • Switching from GRPO to GDPO is straightforward!

    GDPO can serve as a drop-in replacement for GRPO within verl and TRL, requiring only minor code changes. See NVlabs/GDPO for the GDPO implementation based on verl, TRL, and nemo-RL and training code to reproduce the reported results.


    TRL Implementation Modification to support GDPO

    Original GRPO implementation based on TRL
    
                # line 1254 in NVlabs/GDPO/trl-GDPO/trl-0.18.0-gdpo/trl/trainer/grpo_trainer.py
                # Gather the reward per function: this part is crucial, because the rewards are normalized per group and the
                # completions may be distributed across processes
                rewards_per_func = gather(rewards_per_func)
                rewards = (rewards_per_func * self.reward_weights.to(device).unsqueeze(0)).nansum(dim=1)
    
                # Compute grouped-wise rewards
                mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
                std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)
                is_std_zero = torch.isclose(std_grouped_rewards, torch.zeros_like(std_grouped_rewards))
    
                # Normalize the rewards to compute the advantages
                mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
                std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
                advantages = rewards - mean_grouped_rewards
                if self.scale_rewards:
                    advantages = advantages / (std_grouped_rewards + 1e-4)
              
    GDPO implementation based on TRL
    
                # line 1222 in NVlabs/GDPO/trl-GDPO/trl-0.18.0-gdpo/trl/trainer/grpo_trainer.py
                # Gather the reward per function: this part is crucial, because the rewards are normalized per group and the
                # completions may be distributed across processes
                rewards_per_func = gather(rewards_per_func)
                ## Make sure every reward contain no nan value
                rewards_per_func_filter = torch.nan_to_num(rewards_per_func)
    
                all_reward_advantage = []
                ## Calculate the mean and std of each reward group-wise separately
                for i in range(len(self.reward_weights)):
                    reward_i = rewards_per_func_filter[:,i]
                    each_reward_mean_grouped = reward_i.view(-1, self.num_generations).mean(dim=1)
                    each_reward_std_grouped = reward_i.view(-1, self.num_generations).std(dim=1)
    
                    each_reward_mean_grouped = each_reward_mean_grouped.repeat_interleave(self.num_generations, dim=0)
                    each_reward_std_grouped = each_reward_std_grouped.repeat_interleave(self.num_generations, dim=0)
                    each_reward_advantage = reward_i - each_reward_mean_grouped
                    each_reward_advantage = each_reward_advantage / (each_reward_std_grouped + 1e-4)
                    all_reward_advantage.append(each_reward_advantage)
    
                combined_reward_advantage = torch.stack(all_reward_advantage, dim=1)
                pre_bn_advantages = (combined_reward_advantage * self.reward_weights.to(device).unsqueeze(0)).nansum(dim=1)
    
                ## compute batch-wise mean and std
                bn_advantages_mean = pre_bn_advantages.mean()
                bn_advantages_std = pre_bn_advantages.std()
    
                advantages = (pre_bn_advantages - bn_advantages_mean) / (bn_advantages_std + 1e-4)
              

    BibTeX

    
               @misc{liu2026gdpogrouprewarddecouplednormalization,
                    title={GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization}, 
                    author={Shih-Yang Liu and Xin Dong and Ximing Lu and Shizhe Diao and Peter Belcak and Mingjie Liu and Min-Hung Chen and Hongxu Yin and Yu-Chiang Frank Wang and Kwang-Ting Cheng and Yejin Choi and Jan Kautz and Pavlo Molchanov},
                    year={2026},
                    eprint={2601.05242},
                    archivePrefix={arXiv},
                    primaryClass={cs.CL},
                    url={https://arxiv.org/abs/2601.05242}, 
              }