Long Grounded Thoughts:
Synthesizing Visual Problems and
Reasoning Chains at Scale

David Acuna1* Chao-Han Huck Yang1* Yuntian Deng1,2* Jaehun Jung1* Ximing Lu1* Prithviraj Ammanabrolu1,3 Hyunwoo Kim1 Yuan-Hong Liao4 Yejin Choi1
1NVIDIA  ·  2University of Waterloo  ·  3UC San Diego  ·  4University of Toronto
*Joint First Authors

Abstract

Despite rapid progress, multimodal reasoning still lacks a systematic approach to synthesize large-scale vision-centric datasets beyond visual math. We introduce a framework able to synthesize vision-centric problems spanning diverse skills and levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts supporting SFT, offline and online RL. Our vision-centric synthesis framework uses a two-stage process focusing on: (1) generating diverse verifiable questions from existing images at scale, and (2) creating complex compositional visual problems by merging simpler questions. Remarkably, finetuning Qwen2.5-VL-7B on our data outperforms existing open-data baselines across evaluated vision-centric benchmarks, and our best configurations match or surpass strong closed-data models such as MiMo-VL-7B-RL on V*Bench, CV-Bench and MMStar-V. Notably, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro, +3.7%) and audio reasoning (MMAU, +1.32%), demonstrating its effectiveness. Similarly, despite containing no embodied visual data, we observe notable gains (NiEH, +8.8%) when evaluating open-ended embodied QA. Lastly, we use our data to comprehensively analyze at scale (1M+) the entire VLM post-training pipeline showing that (i) SFT on high-quality data with cognitive behaviors on reasoning traces is essential to scale online RL, (ii) offline RL could match online RL's performance while disaggregating compute demands, and (iii) SFT on high quality data also improve out-of-domain, cross-modality transfer.

Comparison of our visual reasoning dataset with prominent open-source counterparts
Comparison of our visual reasoning dataset with prominent open-source counterparts.

Method

Overview of our two-stage visual cognitive-aware synthesis framework
Overview of our two-stage data generation framework. First, we synthesize multiple-choice questions (MCQs) from dense captions and grounded object metadata. Later, we harden questions by composing them into visual reasoning problems that require decomposition and higher-order cognitive patterns. For each stage, we synthesize reasoning traces by first distilling CoTs from VLMs and then expanding them with reasoning LLMs, yielding traces that are richer in reasoning depth.

Demonstrations

Reasoning trace comparison between our model and the vanilla base model
Reasoning trace comparison between our model (post-SFT and RL) and the vanilla base model. Both models initially fail to identify the dog in the image. The base model terminates with an incorrect answer based on this flawed premise. In contrast, our model demonstrates a non-linear reasoning process; it employs self-verification and backtracking to challenge and self-correct its initial assessment.
Quantitative and qualitative comparison of the post-training pipeline
Right: The effect of scaling dataset size during online RL. The baseline (blue) exhibits negative scaling: performance peaks at 0.695 (10K samples) and degrades with more data. Our method (green), which includes SFT on our data with complex reasoning traces, allows online RL to continue scaling. Left: A qualitative example from V* Bench. The baseline fails to identify the partially obscured dog. Our model initially expresses confusion but then self-corrects ("Wait, I'm getting conflicting information..."), showcasing multi-step reasoning instilled by our data.
Additional qualitative example
Additional qualitative example of a reasoning trace from the model post-trained on our data vs the base model.
Temporal reasoning improvement on unseen audio reasoning
Cross-modal transfer: temporal reasoning improvement on an unseen audio reasoning question. Despite training on vision-only data, the Qwen-2.5 Omni model post-trained on our data shows improved reasoning compared to the base model.

Results

Main results on vision-centric reasoning benchmarks
Main results on vision-centric reasoning benchmarks.

Scaling Behavior

Scaling behaviour of LPT vs Ours for SFT
Scaling behaviour of LPT vs Ours for SFT. Grounded object metadata enables diverse, controlled MCQ generation, scaling beyond 1M+ examples without saturation.

MCQ Complexity

Complexity estimation via multiple rollouts
Complexity estimation via multiple rollouts on synthesized MCQs using Qwen2.5-VL as a policy. Darker green indicates easier problems. Our composition hardening reduces trivially solvable questions from 36.7% to 3.3%.

Citation

If you find this work helpful, please consider citing:

@article{acuna2025lgt, title={Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale}, author={Acuna, David and Yang, Chao-Han Huck and Deng, Yuntian and Jung, Jaehun and Lu, Ximing and Ammanabrolu, Prithviraj and Kim, Hyunwoo and Liao, Yuan-Hong and Choi, Yejin}, journal={arXiv preprint arXiv:2511.05705}, year={2026} }