DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

Shih-Yang Liu¹, Xin Dong², Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng¹, Yejin Choi, Jan Kautz, Pavlo Molchanov

NVIDIA
¹ affiliated with HKUST. Work done during Shih-Yang’s internship at NVIDIA ² project lead

Paper 🤗 Models Code

Model Response Examples

Select a Question:

Time Needed to Complete A Benchmark

Baseline

DLER

What is the project about?

This project, Doing Length pEnalty Right (DLER), cuts Chain-of-Thought (CoT) length by ~70% through RL fine-tuning on DeepSeek-R1 and Llama-Nemotron models without loss of accuracy – delivering more intelligence per token.
We find that it is not the sophisticated design of the length penalty that determines performance, but rather the choice of RL optimization algorithm. Even the simplest length truncation can achieve state-of-the-art accuracy-to-token efficiency when combined with our DLER recipe

CLIMB Data Filtering Architecture — Figure 1: DLER substantially shortens the Chain-of-Thought (CoT) length in reasoning models trained with SFT, RL, or a combination of both methods.

What goes wrong when applying a length penalty? Re-examining the Simplest Length Penalty - Truncation

Stricter Length Penalty Leads to Higher Reward Variance

More aggressive truncation introduces greater training instability by increasing group-wise advantage variance, which in turn leads to more biased advantage estimates, thus we propose to swap out the group-wise normalization with batch-wise normalization to mitigate this issue.

Entropy Collapse Limits Exploration of Reasoning Paths

Clipping the updates of low-probability, high-entropy tokens—essential for exploring diverse reasoning paths—can cause an entropy collapse that limits exploration. Increasing the clipping threshold preserves these tokens during gradient updates, thereby alleviating entropy collapse.

Length Penalty Over-sparsify Training Signal

Applying length penalty makes early training too difficult and later stages dominated by easy samples. We adopt Dynamic Sampling, which filters out overly easy/hard prompts and extreme response lengths, yielding balanced training signals and better length control.

Combining All Ingredients: Do Length pEnalty Right (DLER)

We unify batch-wise reward normalization, a higher policy update clipping threshold, dynamic sampling to remove instances lacking balanced training signals, and a simple length truncation penalty into a comprehensive training recipe, which we term DLER (Doing Length pEnalty Right).

Training Code will be released soon!

State of the Art Accuracy/Response Length Trade-offs

Data Mixture Comparison — Table 1: Comparison of DLER models and baseline models in terms of Pass@1 accuracy and corresponding average output length (tokens) across benchmarks.

Different Length Penalties No Longer Push the Accuracy–Efficiency Frontier

We show that with DLER, the effect of adopting different length-penalty rewards fundamentally changes. Specifically, the accuracy–length relationship is no longer altered in a way that yields strictly shorter responses with higher accuracy; instead, a trade-off always exists.

BibTeX


        @misc{liu2025dlerdoinglengthpenalty,
          title={DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning}, 
          author={Shih-Yang Liu and Xin Dong and Ximing Lu and Shizhe Diao and Mingjie Liu and Min-Hung Chen and Hongxu Yin and Yu-Chiang Frank Wang and Kwang-Ting Cheng and Yejin Choi and Jan Kautz and Pavlo Molchanov},
          year={2025},
          eprint={2510.15110},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2510.15110}, 
    }