ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

S

h

a

p

e

L

M

Universal 3D Object Understanding for Embodied Interaction

Zekun Qi¹ Runpei Dong¹ Shaochen Zhang¹ Haoran Geng² Chunrui Han³ Zheng Ge³ He Wang² Li Yi⁴ Kaisheng Ma⁴

¹Xi'an JiaoTong University ²Peking University ³Megvii Technology ⁴Tsinghua University

arXiv Cite Code

Pipeline

🔥Highlights

ShapeLLM is the first 3D Multimodal Large Language Model designed for embodied interaction.

ShapeLLM supports single-view colored point cloud input, which can be effortlessly obtained from RGBD cameras.

We introduce a robust 3D QA benchmark, 3D MM-Vet, encompassing various variants including single-view, noise jitter, etc.

We extend the powerful point encoder architecture, ReCon++, achieving state-of-the-art performance across a range of representation learning tasks.

Motivation

What makes better 3D representations that bridge language models and interaction-oriented 3D object understanding?

3D Point Clouds as Inputs. Compared to 2D images, 3D point clouds provide a more accurate representation of the physical environment, encapsulating sparse yet highly precise geometric data. Moreover, 3D point clouds are crucial in facilitating embodied interactions necessitating accurate 3D structures like 6-DoF object pose estimation.

Selective Multi-View Distillation. Interacting with objects typically necessitates an intricate 3D understanding that involves knowledge at various levels and granularities. For instance, a whole-part high-level semantic understanding is needed for interactions like opening a large cabinet, while detailed, high-resolution (i.e., low-level) semantics are crucial for smaller objects like manipulating a drawer handle. However, existing works mainly distill single-view high-resolution object features from 2D foundation models, providing a complementary understanding. The potential of multi-view images, which offer abundant multilevel features due to view variation and geometry consistency, is often neglected.

3D Visual Instruction Tuning. Instruction tuning has been proven effective in improving LLMs' alignment capability. To realize various 3D understanding tasks with a universal language interface, ShapeLLM is trained through instruction-following tuning on constructed language-output data. However, similar to 2D visual instruction tuning, the data-dessert issue is even worse since no object-level data is available, unlike 2D. To validate the efficacy of ShapeLLM, we first construct ~45K instruction-following data using the advanced GPT-4V on the processed Objaverse dataset and 30K embodied part understanding data from GAPartNet for supervised fine-tuning.

Gallery

*conversations generated with instructions provided by our users

ReCon++

ReCon++ is a powerful point encoder architecture that achieves state-of-the-art performance across a range of representation learning tasks.
Such as Fine-tuned 3D recognition, Few-shot 3D recognition, and Zero-shot 3D recognition.

3D MM-Vet

3D MM-Vet is the first 3D multimodal comprehension evaluation benchmark, which includes five different levels of tasks.

Citation


                @article{qi2024shapellm,

                  author = {Qi, Zekun and Dong, Runpei and Zhang, Shaochen and Geng, Haoran and Han, Chunrui and Ge, Zheng and Wang, He and Yi, Li and Ma, Kaisheng},

                  title  = {ShapeLLM: Universal 3D Object Understanding for Embodied Interaction},

                  journal = {arXiv preprint arXiv:2402.17766},

                  year   = {2024},

            }