OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

* equal contribution † corresponding authors
1Tsinghua University 2Xian Jiaotong University 3Shanghai Jiao Tong University 4Galbot
5Peking University 6Shanghai Qi Zhi Institute 7Shanghai AI Laboratory
Interpolation end reference image.

Abstract

Spatial reasoning is a key aspect of cognitive psychology, remains a major bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs’ understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks represent only the most fundamental level of spatial reasoning. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through Internet data crawling and careful manual annotation, we construct over 1.5K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs, as well as existing reasoning and spatial understanding models, exhibit significant limitations in comprehensive spatial understanding. We further analyze failure cases and propose potential directions for future research.

Leaderboard

Accuracy scores of OmniSpatial.

ModelAvgManipMotionTrafficLocate GeospatialPatternGeometricEgoAlloHypo
🥇o356.3371.8966.1861.1868.57 65.4540.2129.6877.0648.4048.19
🥈Gemini-2.5-pro-preview55.1967.5771.3962.3575.24 64.5543.3034.8474.5138.0337.35
🥉Gemini-2.5-flash-thinking53.1670.2764.7461.1872.38 58.1835.0536.1374.1240.9632.53
o4-mini52.7772.9759.8360.0073.33 61.8234.0236.7773.5340.6940.96
Gemini-2.5-flash-preview52.1267.5762.7268.2473.33 60.9138.1434.1975.4935.9033.73
GPT-4.151.7866.2264.7460.0065.33 60.1831.7530.0670.9840.6439.04
o150.3671.6260.9857.6563.81 60.0039.1827.1071.5738.0336.14
InternVL3-78B49.3363.7863.1256.2459.24 51.4527.6330.1974.5138.4635.90
GPT-4.1-mini48.8764.3256.5359.0660.19 56.3629.2830.1972.5539.5739.28
Claude-3-7-thinking48.6257.2159.7353.7367.94 57.2730.2428.1768.6337.9436.95
InternVL3-38B48.4863.4263.5854.5958.29 50.5529.9028.5272.1636.7633.49
Gemini-2.0-flash-exp48.4061.8956.0151.7663.43 59.0920.8233.8172.7539.2039.28
Qwen-VL2.5-72B47.8558.3860.1250.1259.81 53.6426.1933.0371.3736.8136.39
GPT-4o47.8165.5457.2356.4752.38 54.0926.2925.4875.9839.4939.76
Claude-3-7-sonnet47.5357.5755.9556.7163.81 59.0929.4828.3972.1636.0636.63
Qwen-VL2.5-32B47.3663.0655.0951.7666.29 56.9126.3927.4868.0437.5040.24
Claude-3-5-sonnet46.8654.0554.5758.1268.38 53.0926.6031.7470.0034.7939.52
InternVL3-14B45.9454.3260.1750.3551.81 51.4528.0428.2668.0435.3734.46
LLaVA-onevision-qwen2-72B45.6662.1650.2954.1260.95 56.3622.6825.8176.4737.2333.73
SoFar-Qwen2.5-3B45.1456.4951.1654.1253.14 52.7331.7522.8871.6036.5641.69
Gemma-3-27B44.7556.7655.7857.6550.48 52.7327.8429.0364.7133.5132.53
Gemini-2.0-flash-lite44.0359.1946.7160.2449.52 53.2721.6531.2366.4736.8138.80
Gemma-3-12B43.7154.0554.9154.1247.62 45.4516.4930.3263.7336.7033.73
GPT-4o-mini42.6455.9550.2954.5943.43 44.9122.4729.4261.5736.7634.22
GPT-4.1-nano42.6250.9053.8554.9040.95 42.4224.4030.1153.5937.2333.73
InternVL3-8B41.6052.4340.8748.9451.05 44.7724.9528.6364.2038.6240.96
SpaceThinker-Qwen2.5-3B40.4247.8453.0643.2935.43 38.7324.3328.0058.0435.1131.08
Qwen-VL2.5-3B40.3055.4147.5146.1242.29 44.7332.1623.8759.4133.3030.84
SpaceQwen2.5-VL-3B40.2558.1139.8841.1840.95 40.9129.9025.8163.7338.8339.76
Gemma-3-4B39.7941.8949.7156.4727.62 36.3623.7124.5259.8036.1738.55
Qwen-VL2.5-7B39.1858.3835.0950.1245.33 44.0031.1329.4264.5133.1937.35
InternVL3-2B37.9850.0040.5843.2940.00 40.5521.8628.5255.4935.1133.01
SpaceMantis-13B36.3647.0336.5940.9434.86 33.0922.2724.3949.2238.2539.28
RoboPoint-vicuna-7B35.8557.0328.6134.8237.33 40.5529.9022.7150.2038.7240.96
LLaVA-onevision-qwen2-7B35.6843.2438.1532.9429.52 41.8228.8722.5847.0636.1737.35
SpatialBot-3B35.6843.2438.1532.9429.52 41.8228.8722.5847.0636.1737.35
LLaVA-1.5-vicuna-7B34.9754.4631.2335.2936.19 33.9429.0124.1855.6034.6636.14
RoboPoint-vicuna-13B34.6055.6828.1542.8232.19 32.5524.1227.7449.0237.6633.49


Tasks Demonstration

OmniSpatial provides representative examples across its four main categories of spatial reasoning.

Perspective Taking: Tasks demonstrate the ability to understand spatial relationships from different viewpoints, including egocentric (your view), allocentric (global view), and hypothetical perspectives.

Dynamic Reasoning: This includes tasks that involve understanding object movement and changes, such as manipulation (operational position selection, movement direction determination, intent recognition) and motion analysis (uniform motion, variable motion, spatial compatibility).

Spatial Interaction: These tasks focus on engaging with spatial environments, including traffic analysis (anomaly detection, sign recognition, action recognition, risk detection, behavior guidance, contextual analysis) and manipulation (UI interaction, object detection, spatial localization, pose estimation, geospatial strategy).

Complex Logic: This category covers higher-level reasoning like pattern recognition (style, quantity, attributes, location) and geometric reasoning (polyhedron unfolding, sections and projections, mental rotation, assembly, analytical geometry).

These examples showcase the diversity and complexity of tasks within the OmniSpatial benchmark, which are inspired by real-life scenarios.

Interpolation end reference image.

OmniSpatial’s 50 fine-grained tasks span dynamic motion prediction, geometric logic, real-world traffic and object interaction analysis, map-level navigation planning, and egocentric, hypothetical, and allocentric perspective-taking for counting, size, direction, order, and distance, together forming a single benchmark that comprehensively probes spatial reasoning, multimodal perception, and decision-making across both 2D and 3D scenes.


Interpolation end reference image.

PointGraph

PointGraph augments spatial reasoning by injecting an explicit scene graph that encodes instance-level regions and their pair-wise geometric relations. Starting from a Segment Anything decomposition, we treat every object mask as a graph node and connect nodes with edges weighted by relative position, scale and depth cues. This graph is serialized and prepended to the image token sequence, enabling Vision-Language Models to ground higher-order queries—such as topology or occlusion—on structured spatial features rather than raw pixels alone. Experiments on OmniSpatial show consistent improvements, especially on Dynamic Reasoning and Perspective-Taking tasks.

Overview of the PointGraph pipeline.
Overview of the PointGraph pipeline.
Overview of the PointGraph pipeline.
Camparison of textual Chain-of-Thought and PointGraph on OmniSpatial.

Spatial Chain-of-Thoughts

We further propose a Spatial Chain-of-Thoughts (Spatial CoT) paradigm that mirrors human “mental imagery.” For every query, the model iteratively synthesizes novel viewpoints with InstantMesh, verbalizes intermediate spatial hypotheses, and refines its answer after each visual imagination step. This multi-modal reasoning loop bridges textual deduction with 3D scene reconstruction, yielding marked gains on allocentric and hypothetical perspective questions. The approach is decoder-agnostic and can be plugged into existing VLMs with minimal overhead.

Spatial Chain-of-Thoughts reasoning loop. Overview of the PointGraph pipeline.
Performance of Spatial CoT on OmniSpatial Perspective-Taking track.

BibTeX

@article{omnispatial25,
      title   = {OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models},
      author  = {Mengdi Jia and Zekun Qi and Shaochen Zhang and Wenyao Zhang and Xinqiang Yu and Jiawei He and He Wang and Li Yi},
      journal = {arXiv preprint arXiv:2502.13143},
      year = {2025}
    }