We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high- level semantic information, supplying concise yet comprehensive look-ahead cues for planning.
We introduce DexVLG, a vision‑language‑grasp model trained on the 170M‑pose, 174k‑object dataset that can generate instruction‑aligned dexterous grasp poses and achieves SOTA success and part‑grasp accuracy.
Based on cognitive psychology, we introduce a comprehensive and complex spatial reasoning benchmark, including 50 detailed categories and 1.5K manual labeled QA pairs.
We present ShapeLLM, the first 3D Multimodal Large Language Model designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.
We present DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models empowered with frequently overlooked synergy between multimodal comprehension and creation.
We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high- level semantic information, supplying concise yet comprehensive look-ahead cues for planning.
We introduce DexVLG, a vision‑language‑grasp model trained on the 170M‑pose, 174k‑object dataset that can generate instruction‑aligned dexterous grasp poses and achieves SOTA success and part‑grasp accuracy.
We introduce Hybrid‑depth, a self‑supervised method that aligns hybrid semantics via language guided fusion, achieving SOTA accuracy on KITTI and boosting downstream perception.
We rethink the role of positional encoding in 3D representation learning, and propose Positional Prompt Tuning, a simple but efficient method for transfer learning.
Based on cognitive psychology, we introduce a comprehensive and complex spatial reasoning benchmark, including 50 detailed categories and 1.5K manual labeled QA pairs.
We present ShapeLLM, the first 3D Multimodal Large Language Model designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.
Point-GCC: Universal Self-supervised 3D Scene Pre-training via Geometry-Color Contrast Guofan Fan,
Zekun Qi,
Wenkai Shi,
Kaisheng Ma ACM International Conference on Multimedia (ACMMM), 2024 [arXiv][Code]
We enhance the utilization of color information to improve 3D scene self-supervised learning.
We present DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models empowered with frequently overlooked synergy between multimodal comprehension and creation.
2023
VPP⚡: Efficient Conditional 3D Generation via Voxel-Point Progressive Representation Zekun Qi*,
Muzhou Yu*,
Runpei Dong,
Kaisheng Ma Conference on Neural Information Processing Systems (NeurIPS), 2023 [arXiv][Code][OpenReview]
We achieve rapid, multi-category 3D conditional generation by sharing the merits of different representations. VPP can generate 3D shapes less than 0.2s using a single RTX 2080Ti.