Zekun Qi

I am a first year PhD student at IIIS, Tsinghua University, under the supervision of Prof. Li Yi. Previously, I obtained my bachelor's and master's degrees from Xi'an Jiaotong University, supervised by Prof. Andrew C. Yao and Prof. Kaisheng Ma. I collaborate closely with Prof. He Wang and Runpei Dong.

I am currently a research intern at GalBot.

My research focuses on Embodied Intelligence, Agentic AI and 3D Computer Vision.

Zekun Qi
News
Publications

* indicates equal contribution

Selected Works By Date
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia*, Zekun Qi*, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi
International Conference on Learning Representations (ICLR), 2026

Based on cognitive psychology, we introduce a comprehensive and complex spatial reasoning benchmark, including 50 detailed categories and 1.5K manual labeled QA pairs.

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Conference on Neural Information Processing Systems (NeurIPS), 2025 Spotlight

We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Conference on Neural Information Processing Systems (NeurIPS), 2025

We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high-level semantic information.

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma
European Conference on Computer Vision (ECCV), 2024

We present ShapeLLM, the first 3D Multimodal Large Language Model designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.

DreamLLM: Synergistic Multimodal Comprehension and Creation
International Conference on Learning Representations (ICLR), 2024 Spotlight

We present DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models empowered with frequently overlooked synergy between multimodal comprehension and creation.

ReCon
Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining
International Conference on Machine Learning (ICML), 2023

We propose contrast guided by reconstruct to mitigate the pattern differences between two self-supervised paradigms.

ACT
Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?
International Conference on Learning Representations (ICLR), 2023

We propose to use autoencoders as cross-modal teachers to transfer dark knowledge into 3D representation learning.

2025

GS-Reasoner
Reasoning in Space via Grounding in the World
International Conference on Learning Representations (ICLR), 2026

We believe that grounding can be seen as a chain-of-thought for spatial reasoning. Based on this, we achieve a new SOTA performance on VSI-Bench.

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia*, Zekun Qi*, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi
International Conference on Learning Representations (ICLR), 2026

Based on cognitive psychology, we introduce a comprehensive and complex spatial reasoning benchmark, including 50 detailed categories and 1.5K manual labeled QA pairs.

MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning
ArXiv Preprint, 2025

We present MM-Nav a multi-view VLA system with 360° perception. The model is trained on large-scale expert navigation data collected from multiple reinforcement learning agents.

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Conference on Neural Information Processing Systems (NeurIPS), 2025 Spotlight

We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Conference on Neural Information Processing Systems (NeurIPS), 2025

We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high-level semantic information.

DexVLG: Dexterous Vision-Language-Grasp Model at Scale
International Conference on Computer Vision (ICCV), 2025 Highlight

We introduce DexVLG, a vision‑language‑grasp model trained on the 170M‑pose, 174k‑object dataset that can generate instruction‑aligned dexterous grasp poses.

Hybrid-Depth
Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation
Wenyao Zhang, Hongsi Liu, Bohan Li, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao, Xinqiang Yu, Wenjun Zeng, Xin Jin
International Conference on Computer Vision (ICCV), 2025

We introduce Hybrid‑depth, a self‑supervised method that aligns hybrid semantics via language guided fusion, achieving SOTA accuracy on KITTI.

PPT
Positional Prompt Tuning for Efficient 3D Representation Learning
Shaochen Zhang*, Zekun Qi*, Runpei Dong, Xiuxiu Bai, Xing Wei
ACM International Conference on Multimedia (ACMMM), 2025 Oral

We rethink the role of positional encoding in 3D representation learning, and propose Positional Prompt Tuning for transfer learning.

DreamBench++
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
International Conference on Learning Representations (ICLR), 2025

We collect diverse images and prompts, and utilize GPT-4o for automated evaluation aligned with human preference.

2024

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma
European Conference on Computer Vision (ECCV), 2024

We present ShapeLLM, the first 3D Multimodal Large Language Model designed for embodied interaction.

Point-GCC
Point-GCC: Universal Self-supervised 3D Scene Pre-training via Geometry-Color Contrast
ACM International Conference on Multimedia (ACMMM), 2024

We enhance the utilization of color information to improve 3D scene self-supervised learning.

DreamLLM: Synergistic Multimodal Comprehension and Creation
International Conference on Learning Representations (ICLR), 2024 Spotlight

We present DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models.

2023

VPP
VPP: Efficient Conditional 3D Generation via Voxel-Point Progressive Representation
Conference on Neural Information Processing Systems (NeurIPS), 2023

We achieve rapid, multi-category 3D conditional generation by sharing the merits of different representations. VPP can generate 3D shapes in less than 0.2s.

ReCon
Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining
International Conference on Machine Learning (ICML), 2023

We propose contrast guided by reconstruct to mitigate the pattern differences between two self-supervised paradigms.

ACT
Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?
International Conference on Learning Representations (ICLR), 2023

We propose to use autoencoders as cross-modal teachers to transfer dark knowledge into 3D representation learning.

Honors and Awards