OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

O m n iSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

* equal contribution † corresponding authors

¹Tsinghua University ²Xian Jiaotong University ³Shanghai Jiao Tong University ⁴Galbot
⁵Peking University ⁶Shanghai Qi Zhi Institute ⁷Shanghai AI Laboratory

Abstract

Spatial reasoning is a key aspect of cognitive psychology, remains a major bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs’ understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks represent only the most fundamental level of spatial reasoning. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through Internet data crawling and careful manual annotation, we construct over 1.5K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs, as well as existing reasoning and spatial understanding models, exhibit significant limitations in comprehensive spatial understanding. We further analyze failure cases and propose potential directions for future research.

Leaderboard

Accuracy scores of OmniSpatial.

Model	Avg	Manip	Motion	Traffic	Locate	Geospatial	Pattern	Geometric	Ego	Allo	Hypo
🥇o3-2025-04-16	56.33	71.89	66.18	61.18	68.57	65.45	40.21	29.68	77.06	48.40	48.19
🥈Gemini-2.5-pro-preview-05-06	55.19	67.57	71.39	62.35	75.24	64.55	43.30	34.84	74.51	38.03	37.35
🥉Gemini-2.5-flash-thinking-05-20	53.16	70.27	64.74	61.18	72.38	58.18	35.05	36.13	74.12	40.96	32.53
o4-mini-04-16	52.77	72.97	59.83	60.00	73.33	61.82	34.02	36.77	73.53	40.69	40.96
Gemini-2.5-flash-preview-05-20	52.12	67.57	62.72	68.24	73.33	60.91	38.14	34.19	75.49	35.90	33.73
GPT-4.1-2025-04-14	51.78	66.22	64.74	60.00	65.33	60.18	31.75	30.06	70.98	40.64	39.04
o1-2024-12-17	50.36	71.62	60.98	57.65	63.81	60.00	39.18	27.10	71.57	38.03	36.14
InternVL3-78B	49.33	63.78	63.12	56.24	59.24	51.45	27.63	30.19	74.51	38.46	35.90
GPT-4.1-mini-2025-04-14	48.87	64.32	56.53	59.06	60.19	56.36	29.28	30.19	72.55	39.57	39.28
Claude-3-7-thinking-20250219	48.62	57.21	59.73	53.73	67.94	57.27	30.24	28.17	68.63	37.94	36.95
InternVL3-38B	48.48	63.42	63.58	54.59	58.29	50.55	29.90	28.52	72.16	36.76	33.49
Gemini-2.0-flash-exp	48.40	61.89	56.01	51.76	63.43	59.09	20.82	33.81	72.75	39.20	39.28
Qwen-VL2.5-72B	47.85	58.38	60.12	50.12	59.81	53.64	26.19	33.03	71.37	36.81	36.39
GPT-4o-2024-11-20	47.81	65.54	57.23	56.47	52.38	54.09	26.29	25.48	75.98	39.49	39.76
Claude-3-7-sonnet-20250219	47.53	57.57	55.95	56.71	63.81	59.09	29.48	28.39	72.16	36.06	36.63
Qwen-VL2.5-32B	47.36	63.06	55.09	51.76	66.29	56.91	26.39	27.48	68.04	37.50	40.24
Claude-3-5-sonnet-20241022	46.86	54.05	54.57	58.12	68.38	53.09	26.60	31.74	70.00	34.79	39.52
InternVL3-14B	45.94	54.32	60.17	50.35	51.81	51.45	28.04	28.26	68.04	35.37	34.46
LLaVA-onevision-qwen2-72B	45.66	62.16	50.29	54.12	60.95	56.36	22.68	25.81	76.47	37.23	33.73
SoFar-Qwen2.5-3B	45.14	56.49	51.16	54.12	53.14	52.73	31.75	22.88	71.60	36.56	41.69
Gemma-3-27B	44.75	56.76	55.78	57.65	50.48	52.73	27.84	29.03	64.71	33.51	32.53
Gemini-2.0-flash-lite	44.03	59.19	46.71	60.24	49.52	53.27	21.65	31.23	66.47	36.81	38.80
Gemma-3-12B	43.71	54.05	54.91	54.12	47.62	45.45	16.49	30.32	63.73	36.70	33.73
GPT-4o-mini-2024-07-18	42.64	55.95	50.29	54.59	43.43	44.91	22.47	29.42	61.57	36.76	34.22
GPT-4.1-nano-2025-04-14	42.62	50.90	53.85	54.90	40.95	42.42	24.40	30.11	53.59	37.23	33.73
InternVL3-8B	41.60	52.43	40.87	48.94	51.05	44.77	24.95	28.63	64.20	38.62	40.96
SpaceThinker-Qwen2.5-3B	40.42	47.84	53.06	43.29	35.43	38.73	24.33	28.00	58.04	35.11	31.08
Qwen-VL2.5-3B	40.30	55.41	47.51	46.12	42.29	44.73	32.16	23.87	59.41	33.30	30.84
SpaceQwen2.5-VL-3B	40.25	58.11	39.88	41.18	40.95	40.91	29.90	25.81	63.73	38.83	39.76
Gemma-3-4B	39.79	41.89	49.71	56.47	27.62	36.36	23.71	24.52	59.80	36.17	38.55
Qwen-VL2.5-7B	39.18	58.38	35.09	50.12	45.33	44.00	31.13	29.42	64.51	33.19	37.35
InternVL3-2B	37.98	50.00	40.58	43.29	40.00	40.55	21.86	28.52	55.49	35.11	33.01
SpaceMantis-13B	36.36	47.03	36.59	40.94	34.86	33.09	22.27	24.39	49.22	38.25	39.28
RoboPoint-vicuna-7B	35.85	57.03	28.61	34.82	37.33	40.55	29.90	22.71	50.20	38.72	40.96
LLaVA-onevision-qwen2-7B	35.68	43.24	38.15	32.94	29.52	41.82	28.87	22.58	47.06	36.17	37.35
SpatialBot-3B	35.68	43.24	38.15	32.94	29.52	41.82	28.87	22.58	47.06	36.17	37.35
LLaVA-1.5-vicuna-7B	34.97	54.46	31.23	35.29	36.19	33.94	29.01	24.18	55.60	34.66	36.14
RoboPoint-vicuna-13B	34.60	55.68	28.15	42.82	32.19	32.55	24.12	27.74	49.02	37.66	33.49

OmniSpatial provides representative examples across its four main categories of spatial reasoning.

Perspective Taking: Tasks demonstrate the ability to understand spatial relationships from different viewpoints, including egocentric (your view), allocentric (global view), and hypothetical perspectives.

Dynamic Reasoning: This includes tasks that involve understanding object movement and changes, such as manipulation (operational position selection, movement direction determination, intent recognition) and motion analysis (uniform motion, variable motion, spatial compatibility).

Spatial Interaction: These tasks focus on engaging with spatial environments, including traffic analysis (anomaly detection, sign recognition, action recognition, risk detection, behavior guidance, contextual analysis) and manipulation (UI interaction, object detection, spatial localization, pose estimation, geospatial strategy).

Complex Logic: This category covers higher-level reasoning like pattern recognition (style, quantity, attributes, location) and geometric reasoning (polyhedron unfolding, sections and projections, mental rotation, assembly, analytical geometry).

These examples showcase the diversity and complexity of tasks within the OmniSpatial benchmark, which are inspired by real-life scenarios.

OmniSpatial’s 50 fine-grained tasks span dynamic motion prediction, geometric logic, real-world traffic and object interaction analysis, map-level navigation planning, and egocentric, hypothetical, and allocentric perspective-taking for counting, size, direction, order, and distance, together forming a single benchmark that comprehensively probes spatial reasoning, multimodal perception, and decision-making across both 2D and 3D scenes.

PointGraph

PointGraph augments spatial reasoning by injecting an explicit scene graph that encodes instance-level regions and their pair-wise geometric relations. Starting from a Segment Anything decomposition, we treat every object mask as a graph node and connect nodes with edges weighted by relative position, scale and depth cues. This graph is serialized and prepended to the image token sequence, enabling Vision-Language Models to ground higher-order queries—such as topology or occlusion—on structured spatial features rather than raw pixels alone. Experiments on OmniSpatial show consistent improvements, especially on Dynamic Reasoning and Perspective-Taking tasks.

Overview of the PointGraph pipeline.

Camparison of textual Chain-of-Thought and PointGraph on OmniSpatial.

Spatial Chain-of-Thoughts

We further propose a Spatial Chain-of-Thoughts (Spatial CoT) paradigm that mirrors human “mental imagery.” For every query, the model iteratively synthesizes novel viewpoints with InstantMesh, verbalizes intermediate spatial hypotheses, and refines its answer after each visual imagination step. This multi-modal reasoning loop bridges textual deduction with 3D scene reconstruction, yielding marked gains on allocentric and hypothetical perspective questions. The approach is decoder-agnostic and can be plugged into existing VLMs with minimal overhead.

Performance of Spatial CoT on OmniSpatial Perspective-Taking track.

BibTeX

@article{omnispatial25, title={OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models}, author={Jia, Mengdi and Qi, Zekun and Zhang, Shaochen and Zhang, Wenyao and Yu, Xinqiang and He, Jiawei and Wang, He and Yi, Li}, journal={arXiv preprint arXiv:2506.03135}, year={2025} }