An Embodied Generalist Agent in 3D World

ICML 2024

1Beijing Institute for General Artificial Intelligence (BIGAI)
2Peking University 3Carnegie Mellon University 4Tsinghua University

✶ indicates equal contribution


Leveraging massive knowledge and learning schemes from large language models (LLMs), recent machine learning models show notable successes in building generalist agents that exhibit the capability of general-purpose task solving in diverse domains, including natural language processing, computer vision, and robotics. However, a significant challenge remains as these models exhibit limited ability in understanding and interacting with the 3D world. We argue this limitation significantly hinders the current models from performing real-world tasks and further achieving general intelligence. To this end, we introduce an embodied multi-modal and multi-task generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. Our proposed agent, referred to as LEO, is trained with shared LLM-based model architectures, objectives, and weights in two stages: (i) 3D vision-language alignment and (ii) 3D vision-language-action instruction tuning. To facilitate the training, we meticulously curate and generate an extensive dataset comprising object-level and scene-level multi-modal tasks with exceeding scale and complexity, necessitating a deep understanding of and interaction with the 3D world. Through rigorous experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, embodied navigation, and robotic manipulation. Our ablation results further provide valuable insights for the development of future embodied generalist agents.


Scene representation. The scene point cloud is partitioned into object-centric point clouds (either ground truth or predicted proposals), which are then processed by the 3D encoder to obtain object-centric features. We also incorporate an optional 2D branch, where a 2D encoder processes the agent's ego-view observation to obtain ego-centric features.

Unified sequence and objective. The sequence begins with a system message that tells the agent its role and situation. Subsequent 2D image tokens and 3D object tokens provide the perceived scene information. Next an instruction specifies the task or context, and also prompts for the final response. The learning objective is a simple auto-regressive loss.


Two-stage scheme: alignment & instruction tuning. We combine existing datasets and LLM-prompted data to create LEO-align and LEO-instruct.


Select 3D VL capabilities

State what you notice at the pointed location in the 3D representation.
              [sep] A donut with pink icing and sprinkles.
Explain the object at the pinpointed location in the 3D setting.
              [sep] A wooden chest of drawers.
Develop a description of the object at the specified position in the 3D backdrop.
              [sep] This is a black keyboard. It is in front of two monitors.
How does the object appear at the marked location in the 3D view?
              [sep] This is a white pillow. It is on the right side of the bed.
Relay a brief summary of this scene's elements.
              [sep] In this scene, there is a tall wardrobe standing on the wooden floor.
              The wardrobe is to the right of a bed and in front of a nightstand.
              It is also behind a chair and to the left of another chair.
              There are pillows lying on the bed, with different sizes and colors.
              Some pillows are close to each other, while others are far away from each other.
              A blanket is lying on the bed, close to one of the pillows.
              The room has white walls attached to the floor, giving it an architectural feel.
              The ceiling is overhead and attached to the walls.
              There is a curtain hanging on one of the walls, adding a touch of elegance to the room.
              Overall, this room provides a cozy and comfortable space for relaxation and sleep.
Can you walk me through the details of this scene?
              [sep] In this scene, there is a white rectangular metal heater connected to the wall.
              It is in front of a curtain and to the right of another curtain.
              There is also a closed glass window attached to the wall.
              The floor is made of wooden material and has a striped texture.
              There are several chairs in the room, with different sizes and colors.
              Some chairs are close to tables, while others are far away from them.
              The walls are flat and white, giving the room an architectural feel.
              The room appears to be a living room or dining area, with comfortable seating options and a cozy atmosphere.
Prompt text in gray.

LEO's response in blue shade.

Robotic Manipulation
Embodied Navigation


  title={An Embodied Generalist Agent in 3D World},
  author={Huang, Jiangyong and Yong, Silong and Ma, Xiaojian and Linghu, Xiongkun and Li, Puhao and Wang, Yan and Li, Qing and Zhu, Song-Chun and Jia, Baoxiong and Huang, Siyuan},
  booktitle={Proceedings of the International Conference on Machine Learning (ICML)},