Prismatic Synthesis & G-Vendi Score: How Data Diversification could make R1-32B a Better Teacher Model than R1-671B

1NVIDIA 2University of Washington 3Stanford University 4University of Southern California
*: Equal contribution
NVIDIA
UW
Stanford
USC

πŸš€ Full paper coming soon on arXiv!

πŸ’‘ Project Summary

We study how improving data diversity helps us to train a better reasoning model. We ask: (1) What is a good measure of data diversity that actually contributes to model performance? (2) How can we use this insight to generate more diverse reasoning data?

Our answer is two folds:

  • G-Vendi Score: A data diversity measure that computes the entropy of the dataset in gradient space. G-Vendi strongly correlates with how the model performs in unseen distributions (R2>0.8) when trained on that dataset.
  • Prismatic Synthesis: A novel algorithm to generate large-scale, yet diverse set of synthetic data. We use Prismatic Synthesis to create PrismNLI and PrismMath, state-of-the-art datasets for NLI and math reasoning tasks. Our datasets, despite generated from Qwen2.5-72B-Instruct and R1-32B, yield consistently better performance in OOD than R1-671B generated + human-verified datasets.

Introduction

PrismMath-7B vs. the State-Of-The-Arts

Model Dataset
Size
Data
Generator
Math500 AIME24 AIME25 AMC23 MATH^2 Olympiad
Bench
GSM8k
Platinum
Average
Qwen2.5-Math-7B-Inst - - 83.80 14.17 9.91 72.50 57.62 44.29 96.11 54.06
OpenThinker-7B 114k R1-671B 84.20 27.50 22.50 74.06 67.62 45.93 93.05 59.27
OpenThinker2-7B 1.14M R1-671B 91.40 50.00 35.00 88.44 78.10 69.63 93.96 72.36
OpenR1-Qwen-7B 94k R1-671B 90.60 47.91 30.41 87.19 78.10 67.06 96.69 71.14
R1-Distill-Qwen-7B Unknown R1-671B 92.60 53.33 33.30 92.50 78.57 68.00 89.91 72.60
PrismMath-7B 1.0M R1-32B 92.40 54.16 37.91 93.75 80.95 68.30 95.95 74.78

The above table compares PrismMath-7B against state-of-the-art 7B distilled reasoning models. Our model consistently outperforms baselines, leading to 2% average improvement compared to R1-Distill-Qwen-7B (R1-7B) that starts from the same base model as ours (Qwen2.5-Math-7B-Inst), further trained on unknown proprietary data generated by R1-671B. The results are particularly surprising because:

  • We use R1-32B as the data generator instead of R1-671B, a substantially stronger reasoning model employed by other models.
  • All problems, solutions and answers are entirely generated by a model, with no human verification involved. This contrasts with a dominant approach that collects human-written problems & answers from the web, and augments pairs with model-generated CoTs.

Data Diversity is Key, when Measured Correctly

The key behind this improvement is data diversity β€” more specifically, scaling synthetic data while ensuring that we compile a set of samples that are meaningfully "different" from each other.

While the importance of data diversity has long been emphasized in the literature, a good measure of diversity is still an elusive concept. Traditional metrics often rely on intrinsically-motivated dimensions of similarity, such as token overlap or semantic similarity. The limitations of heuristic measures are pretty obvious, as shown below:

Popular similarity measures may not capture the similarity that actually matters. Token overlap is measured by unigram Jaccard Similarity. Embedding similarity uses gte-Qwen-7B-Instruct, a SOTA embedding model in MTEB benchmark. Both measures infer Sample C to be closer, which has more lexical overlap with Sample A.

Imagine measuring diversity of math-reasoning datasets. What definition of "similarity" should we employ? We know that for example, increasing the topical diversity of training samples alone would not guarantee improvement in model performance. Perhaps a more desirable measure of diversity should reflect similarity in reasoning β€” for math reasoning tasks, perhaps this could be measured by the similarity of equations in each solution. Still, it remains unclear whether the "equation diversity" would hold for tasks beyond math, where the equation extraction is not so clear.



Data Diversity Measure that Does Predict Model Generalization

The misalignment between (1) what we measure as diversity and (2) what we expect from a diverse dataset (i.e., a stronger model) motivates us to study β€œhow data diversity could actually help us train a better reasoning model”.

As the first step, we define the desideratum of a good diversity measure as:

When controlling for the scale and quality of data, a good measure of data diversity should correlate with how the model generalizes to unseen distribution.

G-Vendi Score

Next, we propose G-Vendi - a novel diversity measure that computes the entropy of data samples in a gradient space.

G-Vendi Score
Computing G-Vendi consists of 3-steps: (1) We collect normalized gradient for each sample using an off-the-shelf, instruction-tuned model, (2) reduce dimension of gradients with random projection while preserving their dot products, and (3) measure the entropy of the density matrix. We call the exponent of entropy as Gradient-Vendi, or G-Vendi.

The critical difference from embedding-based metrics is that G-Vendi uses gradients to represent each data sample. The intuition is that when we supervise a model on a dataset, the parameters will be updated via gradient descent over the samples, hence the gradient encodes the knowledge attainable by training on this sample.

In fact, using gradients to represent data sample is not entirely new. Prior works such as LESS or BADGE show that gradient features can approximate training data influence, and thus can be used for data selection when we have access to the target benchmark. Building upon these works, G-Vendi serves as a general-purpose diversity measure beyond data selection or active learning setup. Moreover, G-Vendi significantly simplifies the gradient computation, as it (1) removes the need of warmup-training of gradient proxy model, and (2) does not require the proxy model to be identical to the model to be trained. The gradients computed from an off-the-shelf instruction-tuned model are surprisingly effective for diversity measurement!

After collecting low-dimensional data representations, we aggregate them into a single scalar score by computing the entropy among the gradients. We specifically measure the exponentiated entropy of the normalized covariance matrix (density matrix), or Vendi Score, allowing us to measure the diversity without knowing the underlying distribution of gradients.

Evaluating Diversity Measures

Recall our desideratum for diversity measures:

When controlling for the scale and quality of data, a good measure of data diversity should correlate with how the model generalizes to unseen distribution.

To evaluate this, we train over 300 models with distinct datasets in both NLI and math reasoning tasks, while controlling for the data scale and quality. Then we compare models’ OOD performance with their training data diversity. Specifically:

  • For both NLI and math reasoning, we generate million-scale datasets by few-shot prompting Qwen2.5-72B-Instruct with seed datasets. Using the same data generator allows us to control the data quality which may confound the effect of diversity.
  • We sample many subsets of the generated dataset while controlling for the subset size, and measure their respective diversity. Then we fine-tune a model on each subset, and evaluate its performance averaged across multiple unseen benchmarks.
  • Finally, we check if the model performances correlate with the training data diversity.

Evaluation Results

G-Vendi Score - OOD Performance
G-Vendi Results
(Left) We use MATH + GSM8k as seed dataset and 5-shot prompt Qwen2.5-72B-Instruct to generate 1.5M data pool. We sample total of 180 distinct subsets of size 100k, 50k, 10k, then train Llama-3.2-1B on each subset. The OOD performance is measured as the relative accuracy of each model compared to the Llama-3.2-1B trained on full 1.5M data pool, averaged across 7 unseen benchmarks. (Right) We use WANLI + MNLI as seed datasets and follow the same process as math. We train deberta-v3-large on each subset and evaluate on 6 unseen benchmarks.
Baseline Results
Correlation analysis between baseline measures and OOD performance on math reasoning. For embedding we use gte-Qwen-7B-Instruct, one of the top-ranking models in MTEB benchmark. Embedding Entropy is the vendi score using embedding representation of each sample, and Embedding InvSim measures average 1 - pairwise embedding similarity. Average Perplexity is measured with Qwen2.5-0.5B-Instruct.

We summarize three key insights:

  • Existing measures fail to meet the desideratum. Not surprisingly, popular diversity measures show weak correlation with how the model performs after training.
  • Gradient of a small instruct model carries useful information. Compared to baselines, G-Vendi is strongly indicative of how the model generalizes. It marks R2 > 0.8 and Spearman's ρ β‰ˆ 0.9 in both tasks, across all data scales.
  • Diversity often overrides scale. Diversity, as measured by G-Vendi, often overrides scale β€” training on a smaller dataset with higher diversity can outperform 10 times larger dataset, despite the two datasets drawn from same data pool. However, scale is still a dominant factor for in-distribution performance, improving model in complementary ways to diversity.

Overall, data diversity measured by G-Vendi can actually help improve the model generalize better. Can we leverage this insight to come up with even more synthetic data, while maintaining the diversity of generated samples?

Prismatic Synthesis - Amplifying Generalizability with Diversified Synthetic Data

One strength of synthetic data is that we have full control over data generating function β€” once we find a good gauge of data quality, we can immediately modify the pipeline to improve the quality! We introduce Prismatic Synthesis β€” an algorithm to improve both the data scale & diversity, by strategically generating novel synthetic data.

Prismatic Synthesis
Starting from a seed dataset, Prismatic Synthesis takes a 3-step process of (1) clustering existing samples in gradient space, (2) generating new samples based on existing samples, and (3) diversifying by leaving only the samples that correspond to sparse clusters. Repeating this process, we iteratively add only the new samples that are underrepresented in the current dataset β€” hence constantly improving both the data diversity and scale. It works just like a prism, deflecting light into diverse wavelengths!

Synthetic Data Saturates without Diversification

Synthetic Data Saturates
Comparison of synthetic data generation methods in math reasoning. We use Qwen2.5-72B-Instruct & R1-Distill-32B to generate problem & solution respectively, and use MATH & GSM8k as seed datasets. For sub-100k sizes, we random-sample subsets 5 times from the 100k pool. Test accuracy is averaged over MATH500, AIME24, AIME25, AMC23, MATH^2, OlympiadBench, GSM8k-Platinum.

We test how scaling synthetic data improves model performance, differing the data generation methods β€” vanilla few-shot, persona-guided few-shot, and Prismatic Synthesis. Without Prismatic Synthesis, synthetic data already start to saturate at 50k~100k scale β€” even when we use heuristic diversification methods using persona.

PrismNLI and PrismMath: Strategically Diversified Synthetic Data

We further scale Prismatic Synthesis beyond 100k level, for both NLI and math reasoning. The resulting datasets, PrismNLI and PrismMath, leads to state-of-the-art models β€” despite being generated from a 32B & 72B LLM, without any manual verification.

PrismMath-7B vs. the State-Of-The-Arts

Model Dataset Size Data
Generator
Math500 AIME24 AIME25 AMC23 MATH^2 Olympiad
Bench
GSM8k
Platinum
Average
Qwen2.5-Math-7B-Inst - - 83.80 14.17 9.91 72.50 57.62 44.29 96.11 54.06
OpenThinker-7B 114k R1-671B 84.20 27.50 22.50 74.06 67.62 45.93 93.05 59.27
OpenThinker2-7B 1.14M R1-671B 91.40 50.00 35.00 88.44 78.10 69.63 93.96 72.36
OpenR1-Qwen-7B 94k R1-671B 90.60 47.91 30.41 87.19 78.10 67.06 96.69 71.14
R1-Distill-Qwen-7B Unknown R1-671B 92.60 53.33 33.30 92.50 78.57 68.00 89.91 72.60
PrismMath-7B 1.0M R1-32B 92.40 54.16 37.91 93.75 80.95 68.30 95.95 74.78

PrismNLI vs. the State-Of-The-Arts

Dataset Dataset Size Data Generator HANS WNLI ANLI-r1 ANLI-r2 ANLI-r3 Diagnostics BigBench Control Average
MNLI 393k Human 78.47 63.03 60.10 45.30 42.58 81.88 78.22 42.16 61.47
WANLI 103k ChatGPT+Human 89.25 75.21 61.30 46.50 44.90 83.68 80.81 43.93 65.70
MNLI + FEVER 601k Human 74.62 66.29 60.20 47.00 41.75 80.89 76.14 48.51 61.93
MNLI + WANLI + SNLI 943k ChatGPT+Human 80.18 69.41 65.00 50.60 45.25 83.77 84.72 50.49 66.18
PrismNLI 515k Qwen2.5-72B-Inst 92.44 78.47 73.70 61.90 57.00 86.13 86.32 58.25 74.28

For math, we compare against SOTA 7B reasoning models trained via SFT, using traces generated by R1-671B. We use HuggingFace lighteval and math-verify for evaluation. For AIME and AMC, we average results across 8 runs to mitigate high variance. For NLI, we fix the base model to deberta-v3-large and train with respective datasets. Our datasets yield better results across both tasks, throughout benchmarks.


Key Takeaways

  • Diversity matters, but only when measured correctly. Data diversity impacts how the model generalizes, but improving semantic diversity does not guarantee a better model. G-Vendi can more accurately predict how the model would generalize at test time, by measuring entropy of gradients with an off-the-shelf proxy model.
  • Synthetic data are prone to significant duplicates, and needs intervention for diversity. While data synthesis provides an attractive shortcut for collecting more data, naive scaling does not warrant better model and faces early saturation. Heuristic diversification (such as auxiliary conditioning on persona) could be helpful to some extent, but may not always align with the diversity meaningful to the target task. But with Prismatic Synthesis, diversified data can often surpass the power of carefully-curated data, distilled from frontier models and manually verified by humans.

BibTeX


    @misc{prismatic-synthesis,
    title={Prismatic Synthesis & G-Vendi Score: Gradient-based Diversification Yields Superior Synthetic Data for Reasoning},
    author={Jaehun Jung and Seungju Han and Ximing Lu and Skyler Hallinan and David Acuna and Shrimai Prabhumoye and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Yejin Choi},
    howpublished={\url{https://nvlabs.github.io/prismatic-synthesis}},
    note={Blog},
    year={2025}
    }