Stricter Length Penalty Leads to Higher Reward Variance
More aggressive truncation introduces greater training instability by increasing group-wise advantage variance, which in turn leads to more biased advantage estimates, thus we propose to swap out the group-wise normalization with batch-wise normalization to mitigate this issue.
DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning