S o F a r: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Zekun Qi^13*, Wenyao Zhang^237*, Yufei Ding^34*, Runpei Dong⁵, Xinqiang Yu³
Jingwen Li⁴, Lingyun Xu⁴, Baoyu Li⁵, Xialin He⁵, Guofan Fan¹, Jiazhao Zhang³, Jiawei He³
Jiayuan Gu⁶, Xin Jin⁷, Kaisheng Ma¹, Zhizheng Zhang^3†, He Wang^34†, Li Yi^189†

* equal contribution † corresponding authors

¹Tsinghua University ²Shanghai Jiao Tong University ³Galbot ⁴Peking University ⁵UIUC
⁶ShanghaiTech University ⁷Eastern Institute of Technology ⁸Shanghai Qi Zhi Institute ⁹Shanghai AI Laboratory

arXiv Video Code Huggingface

Highlight

1. Complex robotic manipulation tasks are constrained by the understanding of orientation, such as "upright a tilted wine glass", or "plugging a cord into a power strip."

2. We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language. Such as the orientation of "top," "handle," and "pouring water."

3. We construct OrienText300K, a large paired dataset of point clouds, text, and orientation. We trained PointSO, the first Open-Vocabulary Orientation Model.

4. Based on PointSO, we propose SoFar, the first 6-DoF spatial understanding LLM, which achieves a 13.1% performance improvement on the 6-DoF object rearrangement task and a 47.2% improvement over OpenVLA on the SimplerEnv benchmark.

5. We propose two benchmarks, Open6DOR V2 and 6-DoF SpatialBench, which evaluate 6-DoF rearrangement capability and 6-DoF spatial understanding capability, respectively.

SoFar Robotic Manipulation Pipeline

Given the language instruction, SoFar prompts the VLM to obtain task-oriented object phases and semantic orientation descriptions. Then, SoFar leverage foundation models Florence-2 and SAM to segment depth point clouds and our PointSO to obtain semantic orientations. Summarizing 3D object-centric information, an orientation-aware scene graph is constructed and encoded into languages. The VLM takes the RGB image and the scene graph and outputs the queried spatial understanding VQA or translation for manipulation.

Real-World Experiments

We show the quantitative evaluation of zero-shot real world language-grounded rearrangement with SoFar. We design 60 diverse real world experimental tasks involving over 100 diverse objects.

Demo

SoFar is capable of performing various complex object manipulation with spatial relationships and re-orientation tasks and can generalize across different embodiments, such as dexterous hands.

Pick up the teapot and pour water into the cup.

Take out the test tube with the green solution.

Rotate Loopy to face the yellow dragon doll.

Place the right bottle into the box and arrange it in a 3×3 pattern.

Rotate the flashlight to illuminate the loopy.

Upright the bottle.

Upside down the bottle.

Put the chili into the basket.

Upright the fallen wine glass and arrange it neatly in a row with the other wine glasses.

Insert the pen into the pen holder.

Pick the highest box and place it on the right.

Pick the box and place it to the right of the doll.

pick baseball and place it in the cart, then turn the cart to right.

Pull out a tissue.

Pick up the cabbage and place it in the basket.

Pour out chips from the chips cylinder to the plate.

Aim the camera at the toy truck.

Pick up the Lego blocks and place it between the two toy truck.

Navigation Demo

Semantic orientation can not only be applied to manipulation tasks but also to robotic navigation task. This orientation-aware constraint enhances the navigation process by ensuring precise alignment with the desired orientation, thereby improving task performance in scenarios where directionality is critical.

Move to facing the front of the microwave.

Move to facing the third chair‘s back.

Long Horizon Demo

Our model can complete multiple consecutive tasks, including pick & place, articulated object manipulation, and 6-DoF object rearrangement.

Clean the table.

6-DoF Shelf Rearrangement.

Close-Loop Planning

We demonstrate the closed-loop replan capabilities of SoFar within Simpler-Env.
In (a), model accidentally knocks over the Coke can during motion. Subsequently, we re-plan and successfully achieve the grasp.
In (b), model initially misidentified the coke can as a Fanta can. After correction, the model re-identifies and locates the correct object.

(a) Pick coke can.

(b) Pick coke can.

BibTeX

@article{sofar25,
          author       = {Zekun Qi and
                          Wenyao Zhang and
                          Yufei Ding and
                          Runpei Dong and
                          Xinqiang Yu and
                          Jingwen Li and
                          Lingyun Xu and
                          Baoyu Li and
                          Xialin He and
                          Guofan Fan and
                          Jiazhao Zhang and
                          Jiawei He and
                          Jiayuan Gu and
                          Xin Jin and
                          Kaisheng Ma and
                          Zhizheng Zhang and
                          He Wang and
                          Li Yi},
          title        = {SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and
                          Object Manipulation},
          journal      = {CoRR},
          volume       = {abs/2502.13143},
          year         = {2025},
          url          = {https://doi.org/10.48550/arXiv.2502.13143},
          doi          = {10.48550/ARXIV.2502.13143},
          eprinttype    = {arXiv},
          eprint       = {2502.13143}
        }

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation