GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Why Choose GDPO Over GRPO for Multi-Reward RL Training?

We present Group Reward–Decoupled Normalization Policy Optimization (GDPO), a drop-in replacement for GRPO that consistently improves per-reward convergence in multi-reward RL training across tasks.

CLIMB Data Filtering Architecture — Figure 1: Reward curves of training Qwen2.5-1.5B-Instruct using GDPO and GRPO on the tool-calling task. GDPO consistently converges to higher correctness and format rewards.

What's wrong with using GRPO for multi-reward RL training?

We find that the common practice of applying GRPO to multi-reward RL optimization leads to a previously overlooked issue: GRPO inherently collapses reward signals, resulting in information loss in the advantage estimates.
Let's start with a simple training setting and then extend it to more general cases. Consider a scenario where we generate two rollouts for each question for calculating the group-relative advantage and the task involves two binary reward 𝑟1, 𝑟2 ∈ {0, 1}. Consequently, the total reward for each rollout can take values from {0, 1, 2}.
Then from the figure above, we can see that directly applying GRPO for advantage estimation collapses distinct reward combinations (0, 1), (0, 2), and (1, 2) to identical normalized advantages of (−0 .7071 , 0 .7071)
Our intuition: We think that such characteristic of GRPO’s advantage calculation in multi-reward optimization over-compresses the rich group-wise reward signal, reducing the resolution of the training signal and leading to suboptimal convergence.
For example, reward combination of (0, 2) should produce a stronger learning signal than (0, 1) because a total reward of 2 indicates simultaneous satisfaction of rewards 𝑟1 and 𝑟2, whereas a reward of 1 corresponds to achieving only 𝑟1 or 𝑟2. Thus, when the other one rollout only receives zero reward, (0, 2) should yield a larger relative advantage than (0, 1). This limitation can also introduce risks of training instability due to inaccurate advantage estimates.

GDPO comes to the rescue!

Decouples group-wise normalization of each reward separately before aggregation

To address the fundamental limitation of GRPO, we propose Group reward-Decoupled normalization Policy Optimization (GDPO), which performs group-wise normalization on each reward independently prior to aggregation. This contrasts with GRPO, which applies group-wise normalization directly to the summed rewards.

More Fine-Grained Advantage Estimates

By separating the normalization of each reward, GDPO alleviates the training signal collapse problem present in GRPO’s advantage estimation in Figure 4. For example, the reward combination of (0, 1) after GDPO normalization becomes (−0.7071, 0 .7071) and (0, 2) becomes (−1.4142 , 1.4142), which more appropriately reflects that (0, 2) should yield a stronger learning signal than (0, 1).

When generalized to a larger number of rollouts while keeping the number of rewards to 2, GDPO consistently produces a significantly higher count of distinct advantage groups compared to GRPO, with the gap widening as the number of rollouts increases. Similar trend can be observed under settings where the number of rollouts is fixed at four, but the number of rewards gradually increases.

Switching from GRPO to GDPO is straightforward!

GDPO can serve as a drop-in replacement for GRPO within verl and TRL, requiring only minor code changes. See NVlabs/GDPO for the GDPO implementation based on verl, TRL, and nemo-RL and training code to reproduce the reported results.

TRL Implementation Modification to support GDPO

Original GRPO implementation based on TRL


            # line 1254 in NVlabs/GDPO/trl-GDPO/trl-0.18.0-gdpo/trl/trainer/grpo_trainer.py
            # Gather the reward per function: this part is crucial, because the rewards are normalized per group and the
            # completions may be distributed across processes
            rewards_per_func = gather(rewards_per_func)
            rewards = (rewards_per_func * self.reward_weights.to(device).unsqueeze(0)).nansum(dim=1)

            # Compute grouped-wise rewards
            mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
            std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)
            is_std_zero = torch.isclose(std_grouped_rewards, torch.zeros_like(std_grouped_rewards))

            # Normalize the rewards to compute the advantages
            mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
            std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
            advantages = rewards - mean_grouped_rewards
            if self.scale_rewards:
                advantages = advantages / (std_grouped_rewards + 1e-4)

GDPO implementation based on TRL


            # line 1222 in NVlabs/GDPO/trl-GDPO/trl-0.18.0-gdpo/trl/trainer/grpo_trainer.py
            # Gather the reward per function: this part is crucial, because the rewards are normalized per group and the
            # completions may be distributed across processes
            rewards_per_func = gather(rewards_per_func)
            ## Make sure every reward contain no nan value
            rewards_per_func_filter = torch.nan_to_num(rewards_per_func)

            all_reward_advantage = []
            ## Calculate the mean and std of each reward group-wise separately
            for i in range(len(self.reward_weights)):
                reward_i = rewards_per_func_filter[:,i]
                each_reward_mean_grouped = reward_i.view(-1, self.num_generations).mean(dim=1)
                each_reward_std_grouped = reward_i.view(-1, self.num_generations).std(dim=1)

                each_reward_mean_grouped = each_reward_mean_grouped.repeat_interleave(self.num_generations, dim=0)
                each_reward_std_grouped = each_reward_std_grouped.repeat_interleave(self.num_generations, dim=0)
                each_reward_advantage = reward_i - each_reward_mean_grouped
                each_reward_advantage = each_reward_advantage / (each_reward_std_grouped + 1e-4)
                all_reward_advantage.append(each_reward_advantage)

            combined_reward_advantage = torch.stack(all_reward_advantage, dim=1)
            pre_bn_advantages = (combined_reward_advantage * self.reward_weights.to(device).unsqueeze(0)).nansum(dim=1)

            ## compute batch-wise mean and std
            bn_advantages_mean = pre_bn_advantages.mean()
            bn_advantages_std = pre_bn_advantages.std()

            advantages = (pre_bn_advantages - bn_advantages_mean) / (bn_advantages_std + 1e-4)

BibTeX


           @misc{liu2026gdpogrouprewarddecouplednormalization,
                title={GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization}, 
                author={Shih-Yang Liu and Xin Dong and Ximing Lu and Shizhe Diao and Peter Belcak and Mingjie Liu and Min-Hung Chen and Hongxu Yin and Yu-Chiang Frank Wang and Kwang-Ting Cheng and Yejin Choi and Jan Kautz and Pavlo Molchanov},
                year={2026},
                eprint={2601.05242},
                archivePrefix={arXiv},
                primaryClass={cs.CL},
                url={https://arxiv.org/abs/2601.05242}, 
          }