π Full paper coming soon on arXiv!
We study how improving data diversity helps us to train a better reasoning model. We ask: (1) What is a good measure of data diversity that actually contributes to model performance? (2) How can we use this insight to generate more diverse reasoning data?
Our answer is two folds:
PrismMath-7B vs. the State-Of-The-Arts
Model | Dataset Size |
Data Generator |
Math500 | AIME24 | AIME25 | AMC23 | MATH^2 | Olympiad Bench |
GSM8k Platinum |
Average |
---|---|---|---|---|---|---|---|---|---|---|
Qwen2.5-Math-7B-Inst | - | - | 83.80 | 14.17 | 9.91 | 72.50 | 57.62 | 44.29 | 96.11 | 54.06 |
OpenThinker-7B | 114k | R1-671B | 84.20 | 27.50 | 22.50 | 74.06 | 67.62 | 45.93 | 93.05 | 59.27 |
OpenThinker2-7B | 1.14M | R1-671B | 91.40 | 50.00 | 35.00 | 88.44 | 78.10 | 69.63 | 93.96 | 72.36 |
OpenR1-Qwen-7B | 94k | R1-671B | 90.60 | 47.91 | 30.41 | 87.19 | 78.10 | 67.06 | 96.69 | 71.14 |
R1-Distill-Qwen-7B | Unknown | R1-671B | 92.60 | 53.33 | 33.30 | 92.50 | 78.57 | 68.00 | 89.91 | 72.60 |
PrismMath-7B | 1.0M | R1-32B | 92.40 | 54.16 | 37.91 | 93.75 | 80.95 | 68.30 | 95.95 | 74.78 |
The key behind this improvement is data diversity β more specifically, scaling synthetic data while ensuring that we compile a set of samples that are meaningfully "different" from each other.
While the importance of data diversity has long been emphasized in the literature, a good measure of diversity is still an elusive concept. Traditional metrics often rely on intrinsically-motivated dimensions of similarity, such as token overlap or semantic similarity. The limitations of heuristic measures are pretty obvious, as shown below: