Prismatic Synthesis:
Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Jaehun Jung¹² Seungju Han*¹³ Ximing Lu*¹² Skyler Hallinan*⁴ David Acuna¹ Shrimai Prabhumoye¹ Mostofa Patwary¹ Mohammad Shoeybi¹ Bryan Catanzaro¹ Yejin Choi¹

¹NVIDIA ²University of Washington ³Stanford University ⁴University of Southern California

*: Equal contribution

Paper

HuggingFace Github

💡 Project Summary

We study how improving data diversity helps us to train a better reasoning model. We ask: (1) What is a good measure of data diversity that actually contributes to model performance? (2) How can we use this insight to generate more diverse reasoning data?

Our answer is two folds:

G-Vendi Score: A data diversity measure that computes the entropy of the dataset in gradient space. G-Vendi strongly correlates with how the model performs in unseen distributions (R²>0.8) when trained on that dataset.
Prismatic Synthesis: A novel algorithm to generate large-scale, yet diverse set of synthetic data. We use Prismatic Synthesis to create PrismNLI and PrismMath, state-of-the-art datasets for NLI and math reasoning tasks. Our datasets, despite generated from Qwen2.5-72B-Instruct and R1-32B, yield consistently better performance in OOD than R1-671B generated + human-verified datasets.

Introduction

PrismMath-7B vs. the State-Of-The-Arts

Model	Dataset Size	Data Generator	Math500	AIME24	AIME25	AMC23	MATH^2	Olympiad Bench	GSM8k Platinum	Average
Qwen2.5-Math-7B-Inst	-	-	83.80	14.17	9.91	72.50	57.62	44.29	96.11	54.06
OpenThinker-7B	114k	R1-671B	84.20	27.50	22.50	74.06	67.62	45.93	93.05	59.27
OpenThinker2-7B	1.14M	R1-671B	91.40	50.00	35.00	88.44	78.10	69.63	93.96	72.36
OpenR1-Qwen-7B	94k	R1-671B	90.60	47.91	30.41	87.19	78.10	67.06	96.69	71.14
R1-Distill-Qwen-7B	Unknown	R1-671B	92.60	54.66	33.33	92.50	78.57	68.00	89.91	72.60
PrismMath-7B	1.0M	R1-32B	92.40	57.08	38.33	93.75	80.95	68.30	95.95	75.25

The above table compares PrismMath-7B against state-of-the-art 7B distilled reasoning models. Our model consistently outperforms baselines, leading to 2% average improvement compared to R1-Distill-Qwen-7B (R1-7B) that starts from the same base model as ours (Qwen2.5-Math-7B-Inst), further trained on unknown proprietary data generated by R1-671B. The results are particularly surprising because:

We use R1-32B as the data generator instead of R1-671B, a substantially stronger reasoning model employed by other models.
All problems, solutions and answers are entirely generated by a model, with no human verification involved. This contrasts with a dominant approach that collects human-written problems & answers from the web, and augments pairs with model-generated CoTs.

Data Diversity is Key, when Measured Correctly

The key behind this improvement is data diversity — more specifically, scaling synthetic data while ensuring that we compile a set of samples that are meaningfully "different" from each other.

While the importance of data diversity has long been emphasized in the literature, a good measure of diversity is still an elusive concept. Traditional metrics often rely on intrinsically-motivated dimensions of similarity, such as token overlap or semantic similarity. The limitations of heuristic measures are pretty obvious, as shown below:

Popular similarity measures may not capture the similarity that actually matters. Token overlap is measured by unigram Jaccard Similarity. Embedding similarity uses gte-Qwen-7B-Instruct, a SOTA embedding model in MTEB benchmark. Both measures infer Sample C to be closer, which has more lexical overlap with Sample A.

Imagine measuring diversity of math-reasoning datasets. What definition of "similarity" should we employ? We know that for example, increasing the topical diversity of training samples alone would not guarantee improvement in model performance. Perhaps a more desirable measure of diversity should reflect similarity in reasoning — for math reasoning tasks, perhaps this could be measured by the similarity of equations in each solution. Still, it remains unclear whether the "equation diversity" would hold for tasks beyond math, where the equation extraction is not so clear.

Data Diversity Measure that Does Predict Model Generalization

The misalignment between (1) what we measure as diversity and (2) what we expect from a diverse dataset (i.e., a stronger model) motivates us to study “how data diversity could actually help us train a better reasoning model”.

As the first step, we define the desideratum of a good diversity measure as:

When controlling for the scale and quality of data, a good measure of data diversity should correlate with how the model generalizes to unseen distribution.

G-Vendi Score

Next, we propose G-Vendi - a novel diversity measure that computes the entropy of data samples in a gradient space.

The critical difference from embedding-based metrics is that G-Vendi uses gradients to represent each data sample. The intuition is that when we supervise a model on a dataset, the parameters will be updated via gradient descent over the samples, hence the gradient encodes the knowledge attainable by training on this sample.

In fact, using gradients to represent data sample is not entirely new. Prior works such as LESS or BADGE show that gradient features can approximate training data influence, and thus can be used for data selection when we have access to the target benchmark. Building upon these works, G-Vendi serves as a general-purpose diversity measure beyond data selection or active learning setup. Moreover, G-Vendi significantly simplifies the gradient computation, as it (1) removes the need of warmup-training of gradient proxy model, and (2) does not require the proxy model to be identical to the model to be trained. The gradients computed from an off-the-shelf instruction-tuned model are surprisingly effective for diversity measurement!

After collecting low-dimensional data representations, we aggregate them into a single scalar score by computing the entropy among the gradients. We specifically measure the exponentiated entropy of the normalized covariance matrix (density matrix), or Vendi Score, allowing us to measure the diversity without knowing the underlying distribution of gradients.

Evaluating Diversity Measures

Recall our desideratum for diversity measures:

When controlling for the scale and quality of data, a good measure of data diversity should correlate with how the model generalizes to unseen distribution.

To evaluate this, we train over 300 models with distinct datasets in both NLI and math reasoning tasks, while controlling for the data scale and quality. Then we compare models’ OOD performance with their training data diversity. Specifically:

For both NLI and math reasoning, we generate million-scale datasets by few-shot prompting Qwen2.5-72B-Instruct with seed datasets. Using the same data generator allows us to control the data quality which may confound the effect of diversity.
We sample many subsets of the generated dataset while controlling for the subset size, and measure their respective diversity. Then we fine-tune a model on each subset, and evaluate its performance averaged across multiple unseen benchmarks.
Finally, we check if the model performances correlate with the training data diversity.

Evaluation Results

G-Vendi Score - OOD Performance

G-Vendi Results — (Left) We use MATH + GSM8k as seed dataset and 5-shot prompt Qwen2.5-72B-Instruct to generate 1.5M data pool. We sample total of 180 distinct subsets of size 100k, 50k, 10k, then train Llama-3.2-1B on each subset. The OOD performance is measured as the relative accuracy of each model compared to the Llama-3.2-1B trained on full 1.5M data pool, averaged across 7 unseen benchmarks. (Right) We use WANLI + MNLI as seed datasets and follow the same process as math. We train deberta-v3-large on each subset and evaluate on 6 unseen benchmarks.

Baseline Results — Correlation analysis between baseline measures and OOD performance on math reasoning. For embedding we use gte-Qwen-7B-Instruct, one of the top-ranking models in MTEB benchmark. Embedding Entropy is the vendi score using embedding representation of each sample, and Embedding InvSim measures average 1 - pairwise embedding similarity. Average Perplexity is measured with Qwen2.5-0.5B-Instruct.

We summarize three key insights:

Existing measures fail to meet the desideratum. Not surprisingly, popular diversity measures show weak correlation with how the model performs after training.
Gradient of a small instruct model carries useful information. Compared to baselines, G-Vendi is strongly indicative of how the model generalizes. It marks R² > 0.8 and Spearman's ρ ≈ 0.9 in both tasks, across all data scales.
Diversity often overrides scale. Diversity, as measured by G-Vendi, often overrides scale — training on a smaller dataset with higher diversity can outperform 10 times larger dataset, despite the two datasets drawn from same data pool. However, scale is still a dominant factor for in-distribution performance, improving model in complementary ways to diversity.

Overall, data diversity measured by G-Vendi can actually help improve the model generalize better. Can we leverage this insight to come up with even more synthetic data, while maintaining the diversity of generated samples?

Prismatic Synthesis - Amplifying Generalizability with Diversified Synthetic Data

One strength of synthetic data is that we have full control over data generating function — once we find a good gauge of data quality, we can immediately modify the pipeline to improve the quality! We introduce Prismatic Synthesis — an algorithm to improve both the data scale & diversity, by strategically generating novel synthetic data.

Synthetic Data Saturates without Diversification

We test how scaling synthetic data improves model performance, differing the data generation methods — vanilla few-shot, persona-guided few-shot, and Prismatic Synthesis. Without Prismatic Synthesis, synthetic data already start to saturate at 50k~100k scale — even when we use heuristic diversification methods using persona.

PrismNLI and PrismMath: Strategically Diversified Synthetic Data

We further scale Prismatic Synthesis beyond 100k level, for both NLI and math reasoning. The resulting datasets, PrismNLI and PrismMath, leads to state-of-the-art models — despite being generated from a 32B & 72B LLM, without any manual verification.

PrismMath-7B vs. the State-Of-The-Arts

Model	Dataset Size	Data Generator	Math500	AIME24	AIME25	AMC23	MATH^2	Olympiad Bench	GSM8k Platinum	Average
Qwen2.5-Math-7B-Inst	-	-	83.80	14.17	9.91	72.50	57.62	44.29	96.11	54.06
OpenThinker-7B	114k	R1-671B	84.20	27.50	22.50	74.06	67.62	45.93	93.05	59.27
OpenThinker2-7B	1.14M	R1-671B	91.40	50.00	35.00	88.44	78.10	69.63	93.96	72.36
OpenR1-Qwen-7B	94k	R1-671B	90.60	47.91	30.41	87.19	78.10	67.06	96.69	71.14
R1-Distill-Qwen-7B	Unknown	R1-671B	92.60	54.66	33.33	92.50	78.57	68.00	89.91	72.60
PrismMath-7B	1.0M	R1-32B	92.40	57.08	38.33	93.75	80.95	68.30	95.95	75.25

PrismNLI vs. the State-Of-The-Arts

Dataset	Dataset Size	Data Generator	HANS	WNLI	ANLI-r1	ANLI-r2	ANLI-r3	Diagnostics	BigBench	Control	Average
MNLI	393k	Human	78.47	63.03	60.10	45.30	42.58	81.88	78.22	42.16	61.47
WANLI	103k	ChatGPT+Human	89.25	75.21	61.30	46.50	44.90	83.68	80.81	43.93	65.70
MNLI + FEVER	601k	Human	74.62	66.29	60.20	47.00	41.75	80.89	76.14	48.51	61.93
MNLI + WANLI + SNLI	943k	ChatGPT+Human	80.18	69.41	65.00	50.60	45.25	83.77	84.72	50.49	66.18
PrismNLI	515k	Qwen2.5-72B-Inst	92.44	78.47	73.70	61.90	57.00	86.13	86.32	58.25	74.28

For math, we compare against SOTA 7B reasoning models trained via SFT, using traces generated by R1-671B. We use HuggingFace lighteval and math-verify for evaluation. For AIME and AMC, we average results across 8 runs to mitigate high variance. For NLI, we fix the base model to deberta-v3-large and train with respective datasets. Our datasets yield better results across both tasks, throughout benchmarks.

Key Takeaways

Diversity matters, but only when measured correctly. Data diversity impacts how the model generalizes, but improving semantic diversity does not guarantee a better model. G-Vendi can more accurately predict how the model would generalize at test time, by measuring entropy of gradients with an off-the-shelf proxy model.
Synthetic data are prone to significant duplicates, and needs intervention for diversity. While data synthesis provides an attractive shortcut for collecting more data, naive scaling does not warrant better model and faces early saturation. Heuristic diversification (such as auxiliary conditioning on persona) could be helpful to some extent, but may not always align with the diversity meaningful to the target task. But with Prismatic Synthesis, diversified data can often surpass the power of carefully-curated data, distilled from frontier models and manually verified by humans.

BibTeX


  @misc{prismatic-synthesis,
        title={Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning},
        author={Jaehun Jung and Seungju Han and Ximing Lu and Skyler Hallinan and David Acuna and Shrimai Prabhumoye and Mostafa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Yejin Choi},
        year={2025},
        eprint={2505.20161},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2505.20161},
  }