Long Grounded Thoughts

Abstract

Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale, where imagery and metadata (captions, bounding boxes) are used to generate diverse, verifiable visual questions; and (2) complexity, where a composition hardening algorithm merges simpler questions from the previous stage into harder, still verifiable visual problems. Reasoning traces are synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models.

Comparison of our visual reasoning dataset with prominent open-source counterparts — Comparison of our visual reasoning dataset with prominent open‑source counterparts.

Demonstrations

Reasoning trace comparison between our model (post-SFT and RL) and the vanilla base model. — Reasoning trace comparison between our model (post‑SFT and RL) and the vanilla base model. Both models initially fail to identify the dog in the image. The base model terminates with an incorrect answer based on this flawed premise. In contrast, our model demonstrates a non-linear reasoning process; it employs self-verification and backtracking to challenge and self-correct its initial assessment. This correction appears to stem from a trace where the model relies on captioning and grounding as a bridge between language and vision; notably grounding on the dog triggers the revised path on a second "self-captioned" verification structure. This behavior is notable as captions were not explicitly included in the training traces, perhaps suggesting captioning and grounding as part of the thinking process could be an emergent capability of training on our data.

Quantitative and qualitative comparison of the post-training pipeline on our data vs pure RL on the base model. — Quantitative and qualitative comparison of the post‑training pipeline on our data vs pure RL on the base model. **(Right)** The graph illustrates the effect of scaling dataset size during online RL. The baseline (blue line), starting from an off‑the‑shelf model, exhibits *negative scaling*: performance peaks at 0.695 (10K samples) and degrades with more data. In contrast, our method (green line), which includes SFT on our high‑quality data with complex reasoning traces, allows online RL to scale further. This suggests that without offline "skill teaching" via SFT, online RL fails to effectively utilize larger datasets. **(Left)** A qualitative example (from V* Bench), using each model's best checkpoint (indicated by a dot on the curve), highlights the resulting difference in reasoning. The baseline model fails to identify the partially obscured dog and answers incorrectly. Our model also initially expresses confusion but then self‑corrects ("Wait, I'm getting conflicting information..."), showcasing a multi‑step reasoning process to arrive at the correct answer. This self‑correction capability, instilled with our data, is not observed in the baseline, indicating RL alone was insufficient to elicit this behavior. Image brightness was increased for illustration purposes.

Additional qualitative example: reasoning trace from the model post-trained on our data vs the base model. — Additional qualitative example of a reasoning trace from the model post‑trained on our data vs the base model.

Temporal reasoning improvement illustrated by a qualitative example of a reasoning trace from the Qwen-2.5 Omni model post-trained on our data, compared to the base Qwen-2.5 Omni model, on an unseen audio reasoning question involving joint speaking and sound events — Temporal reasoning improvement illustrated by a qualitative example of a reasoning trace from the Qwen‑2.5 Omni model post‑trained on our data, compared to the base Qwen‑2.5 Omni model, on an unseen audio reasoning question involving joint speaking and sound events (audio_00435.wav).

Citation

Please consider citing this work if it helps your research.

@article{lgt2025,
  title={Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale},
  author={David Acuna and Chao-Han Huck Yang and Yuntian Deng and Jaehun Jung and Ximing Lu and Prithviraj Ammanabrolu and Hyunwoo Kim and Yuan-Hong Liao and Yejin Choi},
  journal={arXiv},
  year={2025}
}

Long Grounded Thoughts:
Distilling Compositional Visual Reasoning Chains at Scale

Abstract

Method

Demonstrations

Performance

Scaling Behavior

MCQ Complexity

Citation