S

h

a

p

e

L

L

M

Universal 3D Object Understanding for Embodied Interaction

ECCV 2024

Zekun Qi1   Runpei Dong1   Shaochen Zhang1   Haoran Geng2   Chunrui Han3   Zheng Ge3   Li Yi4   Kaisheng Ma4
1Xi'an Jiaotong University   2Peking University   3Megvii Technology   4Tsinghua University

Pipeline

Highlights

  1. ShapeLLM is the first 3D Multimodal Large Language Model designed for embodied interaction.
  2. ShapeLLM supports single-view colored point cloud input, which can be effortlessly obtained from RGBD cameras.
  3. We introduce a robust 3D QA benchmark, 3D MM-Vet, encompassing various variants including single-view, noise jitter, etc.
  4. We extend the powerful point encoder architecture, ReCon++, achieving state-of-the-art performance across a range of representation learning tasks.

Motivation

What makes better 3D representations that bridge language models and interaction-oriented 3D object understanding?

  1. 3D Point Clouds as Inputs. Compared to 2D images, 3D point clouds provide a more accurate representation of the physical environment, encapsulating sparse yet highly precise geometric data. Moreover, 3D point clouds are crucial in facilitating embodied interactions necessitating accurate 3D structures like 6-DoF object pose estimation.
  2. Selective Multi-View Distillation. Interacting with objects typically necessitates an intricate 3D understanding that involves knowledge at various levels and granularities. For instance, a whole-part high-level semantic understanding is needed for interactions like opening a large cabinet, while detailed, high-resolution (i.e., low-level) semantics are crucial for smaller objects like manipulating a drawer handle.
  3. 3D Visual Instruction Tuning. Instruction tuning has been proven effective in improving LLMs' alignment capability. To realize various 3D understanding tasks with a universal language interface, ShapeLLM is trained through instruction-following tuning on constructed language-output data. We construct ~45K instruction-following data using GPT-4V on the processed Objaverse dataset and 30K embodied part understanding data from GAPartNet for supervised fine-tuning.

Gallery

*Conversations generated with instructions provided by our users

Single-View Point Cloud Understanding
Scene Understanding
Planning & Task Decomposition
Representation Learning
Embodied Visual Grounding
Precise Referring Dialogue
3D Captioning
Vision Question Answering

ReCon++

ReCon++ is a powerful point encoder architecture that achieves state-of-the-art performance across a range of representation learning tasks: Fine-tuned 3D recognition, Few-shot 3D recognition, and Zero-shot 3D recognition.

3D MM-Vet

3D MM-Vet is the first 3D multimodal comprehension evaluation benchmark, which includes five different levels of tasks.

Citation

@article{qi2024shapellm,
  author = {Qi, Zekun and Dong, Runpei and Zhang, Shaochen and Geng, Haoran and Han, Chunrui and Ge, Zheng and Yi, Li and Ma, Kaisheng},
  title  = {ShapeLLM: Universal 3D Object Understanding for Embodied Interaction},
  journal = {arXiv preprint arXiv:2402.17766},
  year   = {2024},
}