Scene representation. The scene point cloud is partitioned into object-centric point clouds (either ground truth or predicted proposals), which are then processed by the 3D encoder to obtain object-centric features. We also incorporate an optional 2D branch, where a 2D encoder processes the agent's ego-view observation to obtain ego-centric features.
Unified sequence and objective. The sequence begins with a system message that tells the agent its role and situation. Subsequent 2D image tokens and 3D object tokens provide the perceived scene information. Next an instruction specifies the task or context, and also prompts for the final response. The learning objective is a simple auto-regressive loss.
Two-stage scheme: alignment & instruction tuning. We combine existing datasets and LLM-prompted data to create LEO-align and LEO-instruct.
LEO's response in blue shade.
@inproceedings{huang2023embodied,
title={An Embodied Generalist Agent in 3D World},
author={Huang, Jiangyong and Yong, Silong and Ma, Xiaojian and Linghu, Xiongkun and Li, Puhao and Wang, Yan and Li, Qing and Zhu, Song-Chun and Jia, Baoxiong and Huang, Siyuan},
booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
year={2024}
}