Universal 3D Object Understanding for Embodied Interaction     ECCV 2024

Zekun Qi1    Runpei Dong1     Shaochen Zhang1     Haoran Geng2     Chunrui Han3     Zheng Ge3     Li Yi4     Kaisheng Ma4
1Xi'an JiaoTong University    2Peking University    3Megvii Technology    4Tsinghua University
  1. ShapeLLM is the first 3D Multimodal Large Language Model designed for embodied interaction.

  2. ShapeLLM supports single-view colored point cloud input, which can be effortlessly obtained from RGBD cameras.

  3. We introduce a robust 3D QA benchmark, 3D MM-Vet, encompassing various variants including single-view, noise jitter, etc.

  4. We extend the powerful point encoder architecture, ReCon++, achieving state-of-the-art performance across a range of representation learning tasks.

What makes better 3D representations that bridge language models and interaction-oriented 3D object understanding?
  1. 3D Point Clouds as Inputs. Compared to 2D images, 3D point clouds provide a more accurate representation of the physical environment, encapsulating sparse yet highly precise geometric data. Moreover, 3D point clouds are crucial in facilitating embodied interactions necessitating accurate 3D structures like 6-DoF object pose estimation.

  2. Selective Multi-View Distillation. Interacting with objects typically necessitates an intricate 3D understanding that involves knowledge at various levels and granularities. For instance, a whole-part high-level semantic understanding is needed for interactions like opening a large cabinet, while detailed, high-resolution (i.e., low-level) semantics are crucial for smaller objects like manipulating a drawer handle. However, existing works mainly distill single-view high-resolution object features from 2D foundation models, providing a complementary understanding. The potential of multi-view images, which offer abundant multilevel features due to view variation and geometry consistency, is often neglected.

  3. 3D Visual Instruction Tuning. Instruction tuning has been proven effective in improving LLMs' alignment capability. To realize various 3D understanding tasks with a universal language interface, ShapeLLM is trained through instruction-following tuning on constructed language-output data. However, similar to 2D visual instruction tuning, the data-dessert issue is even worse since no object-level data is available, unlike 2D. To validate the efficacy of ShapeLLM, we first construct ~45K instruction-following data using the advanced GPT-4V on the processed Objaverse dataset and 30K embodied part understanding data from GAPartNet for supervised fine-tuning.
*conversations generated with instructions provided by our users
Single-View Point Cloud Understanding
Scene Understanding
Planning & Task Decomposition
Representation Learning
Embodied Visual Grounding
Precise Referring Dialogue
3D Captioning
Vision Question Answering
ReCon++ is a powerful point encoder architecture that achieves state-of-the-art performance across a range of representation learning tasks.
Such as Fine-tuned 3D recognition, Few-shot 3D recognition, and Zero-shot 3D recognition.

3D MM-Vet
3D MM-Vet is the first 3D multimodal comprehension evaluation benchmark, which includes five different levels of tasks.

  author = {Qi, Zekun and Dong, Runpei and Zhang, Shaochen and Geng, Haoran and Han, Chunrui and Ge, Zheng and Yi, Li and Ma, Kaisheng},
  title = {ShapeLLM: Universal 3D Object Understanding for Embodied Interaction},
  journal = {arXiv preprint arXiv:2402.17766},
  year = {2024},