Zekun Qi

Zekun Qi

I am a first year PhD student at IIIS, Tsinghua University, under the supervision of Prof. Li Yi. Previously, I obtained my bachelor's and master's degrees from Xi'an Jiaotong University , supervised by Prof. Andrew C. Yao and Prof. Kaisheng Ma. I collaborate closely with Prof. He Wang and Runpei Dong.
I am currently a research intern at GalBot.

My research focuses on Embodied Intelligence, Agentic AI and 3D Computer Vision.

Email / Google Scholar / Github / CV

News

2025-09: Two papers accepted to NeurIPS 2025 , and one of them as Spotlight presentation.

2025-07: One paper accepted to ACMMM 2025 as Oral presentation.

2025-06: Two papers accepted to ICCV 2025 , and one of them as Highlight presentation.

2025-01: One paper accepted to ICLR 2025.

2024-07: One paper accepted to ECCV 2024 and one paper acccepted to ACMMM 2024.

2024-01: One paper accepted to ICLR 2024 as Spotlight presentation.

2023-09: One paper accepted to NeurIPS 2023.

2023-04: One paper accepted to ICML 2023.

2023-01: One paper accepted to ICLR 2023.

Publications

* indicates equal contribution Show Selected / Show by Date

	SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi Conference on Neural Information Processing Systems (NeurIPS), 2025, Spotlight* [arXiv] [Project Page] [Code] [Huggingface] We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.
	DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin Conference on Neural Information Processing Systems (NeurIPS), 2025 [arXiv] [Project Page] [Code] We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high- level semantic information, supplying concise yet comprehensive look-ahead cues for planning.
	DexVLG: Dexterous Vision-Language-Grasp Model at Scale Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi, He Wang International Conference on Computer Vision (ICCV), 2025, Highlight* [arXiv] [Project Page] [Code] We introduce DexVLG, a vision‑language‑grasp model trained on the 170M‑pose, 174k‑object dataset that can generate instruction‑aligned dexterous grasp poses and achieves SOTA success and part‑grasp accuracy.
	OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models Mengdi Jia, Zekun Qi*, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi arXiv preprint, 2025* [arXiv] [Project Page] [Code] [Huggingface] Based on cognitive psychology, we introduce a comprehensive and complex spatial reasoning benchmark, including 50 detailed categories and 1.5K manual labeled QA pairs.
	ShapeLLM: Universal 3D Object Understanding for Embodied Interaction Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma European Conference on Computer Vision (ECCV), 2024 [arXiv] [Project Page] [Code] [Huggingface] We present ShapeLLM, the first 3D Multimodal Large Language Model designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.
	DreamLLM: Synergistic Multimodal Comprehension and Creation Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi International Conference on Learning Representations (ICLR), 2024, Spotlight [arXiv] [Project Page] [Code] [Huggingface] We present DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models empowered with frequently overlooked synergy between multimodal comprehension and creation.
	Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining Zekun Qi, Runpei Dong, Guofan Fan, Zheng Ge, Xiangyu Zhang, Kaisheng Ma, Li Yi International Conference on Machine Learning (ICML), 2023 [arXiv] [Code] [OpenReview] We propose contrast guided by reconstruct to mitigate the pattern differences between two self-supervised paradigms.
	Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning? Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng Ge, Li Yi, Kaisheng Ma International Conference on Learning Representations (ICLR), 2023 [arXiv] [Code] [OpenReview] We propose to use autoencoders as cross-modal teachers to transfer dark knowledge into 3D representation learning.

2025

	Reasoning in Space via Grounding in the World Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, Peidong Liu ArXiv Preprint, 2025 [arXiv] [Project Page] [Code] [HuggingFace] We believe that grounding can be seen as a chain-of-thought for spatial reasoning. Based on this, we achieve a new SOTA performance on VSI-Bench.
	MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning Tianyu Xu, Jiawei Chen, Jiazhao Zhang, Wenyao Zhang, Zekun Qi, Minghan Li, Zhizheng Zhang, He Wang ArXiv Preprint, 2025* [arXiv] [Project Page] We present MM-Nav a multi-view VLA system with 360° perception. The model is trained on large-scale expert navigation data collected from multiple reinforcement learning agents, demonstrating robust generalization in complex navigation scenarios.
	SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi Conference on Neural Information Processing Systems (NeurIPS), 2025, Spotlight* [arXiv] [Project Page] [Code] [Huggingface] We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.
	DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin Conference on Neural Information Processing Systems (NeurIPS), 2025 [arXiv] [Project Page] [Code] We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high- level semantic information, supplying concise yet comprehensive look-ahead cues for planning.
	DexVLG: Dexterous Vision-Language-Grasp Model at Scale Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi, He Wang International Conference on Computer Vision (ICCV), 2025, Highlight* [arXiv] [Project Page] [Code] We introduce DexVLG, a vision‑language‑grasp model trained on the 170M‑pose, 174k‑object dataset that can generate instruction‑aligned dexterous grasp poses and achieves SOTA success and part‑grasp accuracy.
	Hybrid-grained Feature Aggregation with Coare-to-fine Language Guidance for Self-supervised Monocular Depth Estimation Wenyao Zhang, Hongsi Liu, Bohan Li, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao, Xinqiang Yu, Wenjun Zeng, Xin Jin International Conference on Computer Vision (ICCV), 2025 [arXiv] We introduce Hybrid‑depth, a self‑supervised method that aligns hybrid semantics via language guided fusion, achieving SOTA accuracy on KITTI and boosting downstream perception.
	Positional Prompt Tuning for Efficient 3D Representation Learning Shaochen Zhang, Zekun Qi*, Runpei Dong, Xiuxiu Bai, Xing Wei ACM International Conference on Multimedia (ACMMM), 2025*, Oral [arXiv] [Code] We rethink the role of positional encoding in 3D representation learning, and propose Positional Prompt Tuning, a simple but efficient method for transfer learning.
	OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models Mengdi Jia, Zekun Qi*, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi arXiv preprint, 2025* [arXiv] [Project Page] [Code] [Huggingface] Based on cognitive psychology, we introduce a comprehensive and complex spatial reasoning benchmark, including 50 detailed categories and 1.5K manual labeled QA pairs.
	DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia International Conference on Learning Representations (ICLR), 2025* [arXiv] [Project Page] [Code] We collect diverse images and prompts, and utilize GPT-4o for automated evaluation aligned with human preference.

2024

	ShapeLLM: Universal 3D Object Understanding for Embodied Interaction Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma European Conference on Computer Vision (ECCV), 2024 [arXiv] [Project Page] [Code] [Huggingface] We present ShapeLLM, the first 3D Multimodal Large Language Model designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.
	Point-GCC: Universal Self-supervised 3D Scene Pre-training via Geometry-Color Contrast Guofan Fan, Zekun Qi, Wenkai Shi, Kaisheng Ma ACM International Conference on Multimedia (ACMMM), 2024 [arXiv] [Code] We enhance the utilization of color information to improve 3D scene self-supervised learning.
	DreamLLM: Synergistic Multimodal Comprehension and Creation Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi International Conference on Learning Representations (ICLR), 2024, Spotlight [arXiv] [Project Page] [Code] [Huggingface] We present DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models empowered with frequently overlooked synergy between multimodal comprehension and creation.

2023

	VPP⚡: Efficient Conditional 3D Generation via Voxel-Point Progressive Representation Zekun Qi, Muzhou Yu, Runpei Dong, Kaisheng Ma Conference on Neural Information Processing Systems (NeurIPS), 2023 [arXiv] [Code] [OpenReview] We achieve rapid, multi-category 3D conditional generation by sharing the merits of different representations. VPP can generate 3D shapes less than 0.2s using a single RTX 2080Ti.
	Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining Zekun Qi, Runpei Dong, Guofan Fan, Zheng Ge, Xiangyu Zhang, Kaisheng Ma, Li Yi International Conference on Machine Learning (ICML), 2023 [arXiv] [Code] [OpenReview] We propose contrast guided by reconstruct to mitigate the pattern differences between two self-supervised paradigms.
	Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning? Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng Ge, Li Yi, Kaisheng Ma International Conference on Learning Representations (ICLR), 2023 [arXiv] [Code] [OpenReview] We propose to use autoencoders as cross-modal teachers to transfer dark knowledge into 3D representation learning.

Honors and Awards

2024 National Scholarship & Xiaomi Special Scholarship, Xi’an Jiaotong University

2022 Outstanding Graduate, Xi’an Jiaotong University

2021 Annual Spiritual Civilization Award , Xi’an Jiaotong University

2020 National runner-up of the China Undergraduate Physics Tournament (CUPT) as the team leader

Website Template

© Zekun Qi | Last updated: Oct 18, 2025