LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Abstract

Turning 3D Detection into a Disciplined Next-Token Problem

To act in the world, a model must name what it sees and know where it is in 3D. LocateAnything3D is a VLM-native recipe that casts 3D detection as next-token prediction via an explicit Chain-of-Sight (CoS) sequence mirroring how humans reason from images: localize an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought before predicting calibrated 3D boxes under an easy-to-hard curriculum. Across objects, a near-to-far ordering reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This interface preserves open-vocabulary grounding and visual prompting without bespoke heads.

On the challenging Omni3D benchmark we achieve 49.89 AP_3D, surpassing the previous best by +15.51 absolute even when the baseline has access to ground-truth 2D boxes. LocateAnything3D also generalizes zero-shot to held-out categories with strong robustness by turning 3D detection into a disciplined next-token problem.

Method

LocateAnything3D Chain-of-Sight Decoding

LocateAnything3D is post-trained from the Eagle 2.5 base model. In the Chain-of-Sight decoding, monocular 3D perception is formatted as a compact sequence that interleaves 2D and 3D information per instance, optimized end-to-end as next-token prediction.

Chain-of-Sight Factorization

Input: Monocular RGB image plus free-form text query (e.g., “detect all cars,” “any chair,” “all pedestrians on the crosswalk”) optionally augmented with a visual prompt (box or click).
CoS sequence: For each instance, the decoder emits a 2D bounding box, then the 3D box (center, size, rotation), and repeats until EOS.
Output: Calibrated multi-object 3D boxes in the camera frame with open-vocabulary labels and depth-colored cuboids.

Curricula & Packaging

Inter-object curriculum: Serialize instances in depth order, near → far, so confident objects appear early and stabilize decoding.
Intra-object factorization: Within each object, produce center-from-camera → dimensions → rotation to rank information by stability.
CoS packaging: Canonical multi-box normalization, large-scale text auto-annotation, anti-hallucination negatives, and other CoS-ready supervision tools.

LocateAnything3D decoding architecture diagram.

Architecture overview with Chain-of-Sight 3D decoding curriculum.

Highlights

Chain-of-Sight 3D Reasoning in a VLM

Chain-of-Sight 3D Reasoning

Casts 3D detection as a short, structured token sequence that mirrors human inference: first localize in 2D, then infer distance, size, and pose.

Joint 2D–3D Interface

Uses 2D bounding boxes as a visual chain-of-thought, tightly coupling 2D grounding and 3D estimation within a single autoregressive VLM decoder that supports text and visual prompts.

3D-Aware Curriculum

Orders detections from near to far and factorizes each 3D box into center → size → rotation, aligning supervision with what is easiest and most informative to predict.

Cross-Domain Robustness

Trained on a CoS-ready corpus (~1.74M examples) spanning indoor/outdoor scenes, driving, ARKit, synthetic environments, and more for strong zero-shot generalization.

LocateAnything3D Data

CoS-Ready Supervision Across Domains

We curate a camera-centric dataset that presents supervision in exactly the form the model decodes, making Chain-of-Sight learning practical.

Unify Benchmarks

ARKitScenes, SUN-RGBD, Hypersim, Objectron, KITTI, nuScenes, CA-1M, and more are harmonized into a shared schema.

Canonical Multi-Box Normalization

Ensures every cuboid is calibrated in the same camera frame for stable supervision.

Large-Scale Text Auto-Annotation

Maintains open-vocabulary prompting by auto-generating category and spatial prompts.

Anti-Hallucination Negatives

Pair negative prompts with CoS tuples to teach the model when to abstain.

Results

State-of-the-Art 3D Detection & Grounding

LocateAnything3D delivers substantial gains over prior monocular 3D detectors and VLM-based systems while preserving open-vocabulary grounding.

Omni3D Detection. 49.89 AP_3D, +15.51 points over DetAny3D even when the baseline has ground-truth 2D boxes.

Indoor 3D object grounding benchmark table.

Indoor 3D Grounding. Beats Cube-LLM_large across Objectron, ARKitScenes, and SUN-RGBD using 10× less data.

Novel Category Evaluation. No external 2D detector required; best zero-shot AP_3D across KITTI, SUN, and OmniPark splits.

Data Efficiency

2D→3D Chain-of-Sight Accelerates Learning

The Chain-of-Sight formulation (blue) consistently outperforms a direct 3D baseline (purple) and reaches DetAny3D-level performance with only 10% of the training data. 2D pretraining (green) accelerates convergence relative to training from scratch (orange).

Data efficiency and training dynamics line plots.

Data efficiency and training dynamics analysis.

Qualitative

Zero-Shot Chain-of-Sight Inference

Depth-colored cuboids stay consistent across in-the-wild scenes, indoor environments, and AR applications.

Zero-shot qualitative visualizations with depth-aware cuboids.

Citation

LocateAnything3D

Please cite LocateAnything3D if you find our Chain-of-Sight interface useful.

@article{man2025locateanything3d,
  title   = {LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight},
  author  = {Yunze Man and Shihao Wang and Guowen Zhang and Johan Bjorck and Zhiqi Li and Liang-Yan Gui and Jim Fan and Jan Kautz and Yu-Xiong Wang and Zhiding Yu},
  journal = {arXiv preprint arXiv:2511.20648},
  year    = {2025},
}