1. Complex robotic manipulation tasks are constrained by the understanding of orientation, such as "upright a tilted wine glass", or "plugging a cord into a power strip."
2. We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language. Such as the orientation of "top," "handle," and "pouring water."
3. We construct OrienText300K, a large paired dataset of point clouds, text, and orientation. We trained PointSO, the first Open-Vocabulary Orientation Model.
4. Based on PointSO, we propose SoFar, the first 6-DoF spatial understanding LLM, which achieves a 13.1% performance improvement on the 6-DoF object rearrangement task and a 47.2% improvement over OpenVLA on the SimplerEnv benchmark.
5. We propose two benchmarks, Open6DOR V2 and 6-DoF SpatialBench, which evaluate 6-DoF rearrangement capability and 6-DoF spatial understanding capability, respectively.
SoFar is capable of performing various complex object manipulation with spatial relationships and re-orientation tasks and can generalize across different embodiments, such as dexterous hands.
@article{sofar25,
author = {Zekun Qi and
Wenyao Zhang and
Yufei Ding and
Runpei Dong and
Xinqiang Yu and
Jingwen Li and
Lingyun Xu and
Baoyu Li and
Xialin He and
Guofan Fan and
Jiazhao Zhang and
Jiawei He and
Jiayuan Gu and
Xin Jin and
Kaisheng Ma and
Zhizheng Zhang and
He Wang and
Li Yi},
title = {SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and
Object Manipulation},
journal = {CoRR},
volume = {abs/2502.13143},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2502.13143},
doi = {10.48550/ARXIV.2502.13143},
eprinttype = {arXiv},
eprint = {2502.13143}
}