DLER DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

NVIDIA NVIDIA
1 affiliated with HKUST. Work done during Shih-Yang’s internship at NVIDIA 2 project lead

Model Response Examples

Time Needed to Complete A Benchmark

Baseline
DLER

What is the project about?

  • This project, Doing Length pEnalty Right (DLER), cuts Chain-of-Thought (CoT) length by ~70% through RL fine-tuning on DeepSeek-R1 and Llama-Nemotron models without loss of accuracy – delivering more intelligence per token.
  • We find that it is not the sophisticated design of the length penalty that determines performance, but rather the choice of RL optimization algorithm. Even the simplest length truncation can achieve state-of-the-art accuracy-to-token efficiency when combined with our DLER recipe
CLIMB Data Filtering Architecture
Figure 1: DLER substantially shortens the Chain-of-Thought (CoT) length in reasoning models trained with SFT, RL, or a combination of both methods.

SFT-trained model
(a)
RL-trained model
(b)
Figure 2: (a) DLER achieves state-of-the-art accuracy/length trade-offs, shortening CoT by up to 70% without losing accuracy. (b) On AIME-24, DLER-R1 models enable better test-time scaling.

What goes wrong when applying a length penalty? Re-examining the Simplest Length Penalty - Truncation

Stricter Length Penalty Leads to Higher Reward Variance

More aggressive truncation introduces greater training instability by increasing group-wise advantage variance, which in turn leads to more biased advantage estimates, thus we propose to swap out the group-wise normalization with batch-wise normalization to mitigate this issue.

Entropy Collapse Limits Exploration of Reasoning Paths

Clipping the updates of low-probability, high-entropy tokens—essential for exploring diverse reasoning paths—can cause an entropy collapse that limits exploration. Increasing the clipping threshold preserves these tokens during gradient updates, thereby alleviating entropy collapse.

CLIMB Data Filtering Architecture
Figure 3: Left: Word clouds of the most frequent tokens clipped by the high-threshold before applying higher clipping threshold,showing that many are transitional words important for reasoning, and clipping them limits exploration during RL training. (Right) Average probability and entropy of tokens not clipped, clipped by the higher threshold, and clipped by the lower threshold during RL training. Clipped tokens have much lower probabilities than unclipped ones, and those clipped by the higher threshold consistently show higher entropy, supporting that these are often high-entropy transitional tokens triggering reasoning steps.

Length Penalty Over-sparsify Training Signal

Applying length penalty makes early training too difficult and later stages dominated by easy samples. We adopt Dynamic Sampling, which filters out overly easy/hard prompts and extreme response lengths, yielding balanced training signals and better length control.

CLIMB Data Filtering Architecture
Figure 4: Left: Ratio of training prompts with all 16 rollouts receiving zero reward, including those caused by exceeding the truncation length. Around half of the prompts fall into this category early in training, weakening the signal and biasing the model toward easier prompts that model already know how to solve within the target length. Right: Ratio of training prompts with all 16 rollouts receiving reward score of one steadily increases, while average response length declines and remains markedly shorter than that for prompts whose rollouts all receive a reward of zero.

Combining All Ingredients: Do Length pEnalty Right (DLER)

We unify batch-wise reward normalization, a higher policy update clipping threshold, dynamic sampling to remove instances lacking balanced training signals, and a simple length truncation penalty into a comprehensive training recipe, which we term DLER (Doing Length pEnalty Right).

Training Code will be released soon!

State of the Art Accuracy/Response Length Trade-offs

Data Mixture Comparison
Table 1: Comparison of DLER models and baseline models in terms of Pass@1 accuracy and corresponding average output length (tokens) across benchmarks.

Different Length Penalties No Longer Push the Accuracy–Efficiency Frontier

We show that with DLER, the effect of adopting different length-penalty rewards fundamentally changes. Specifically, the accuracy–length relationship is no longer altered in a way that yields strictly shorter responses with higher accuracy; instead, a trade-off always exists.

MATH
(a) MATH
AIME-24
(b) AIME-24
Figure 5: Accuracy and average response length of DeepSeek-R1-7B trained using DLER with different length penalties on MATH and AIME-24. DLER establishes a new accuracy–length efficiency frontier, with varying length penalties moving performance along the frontier rather than beyond it.

BibTeX


        @misc{liu2025dlerdoinglengthpenalty,
          title={DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning}, 
          author={Shih-Yang Liu and Xin Dong and Ximing Lu and Shizhe Diao and Mingjie Liu and Min-Hung Chen and Hongxu Yin and Yu-Chiang Frank Wang and Kwang-Ting Cheng and Yejin Choi and Jan Kautz and Pavlo Molchanov},
          year={2025},
          eprint={2510.15110},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2510.15110}, 
    }