MP5 | A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

CVPR 2024

1Shanghai Artificial Intelligence Laboratory; 2The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen); 3Beihang University; 4Tsinghua University; 5The University of Sydney;
Equal Contribution   ✉ Corresponding author   † Project Leader  


It is a long-lasting goal to design an embodied system that can solve long-horizon open-world tasks in human-like ways. However, existing approaches usually struggle with compound difficulties caused by the logic-aware decomposition and context-aware execution of these tasks. To this end, we introduce MP5, an open-ended multimodal embodied system built upon the challenging Minecraft simulator, which can decompose feasible sub-objectives, design sophisticated situation-aware plans, and perform embodied action control, with frequent communication with a goal-conditioned active perception scheme. Specifically, MP5 is developed on top of recent advances in Multimodal Large Language Models (MLLMs), and the system is modulated into functional modules that can be scheduled and collaborated to ultimately solve pre-defined context- and process-dependent tasks. Extensive experiments prove that MP5 can achieve a 22% success rate on difficult process-dependent tasks and a 91% success rate on tasks that heavily depend on the context. Moreover, MP5 exhibits a remarkable ability to address many open-ended tasks that are entirely novel.

The process of finishing the task ''kill a pig with a stone sward during the daytime near the water with grass next to it.''

Overview of MP5 architecture.

Module interaction in MP5. After receiving the task instruction, MP5 first utilizes Parser to generate a sub-objective list. Once a sub-objective is passed to the Planner, the Planner Obtaining Env. Info. for Perception-aware Planning. The performer takes frequently Perception-aware Execution to interact with the environment by interacting with the Patroller. Both Perception-aware Planning and Execution rely on the Active Perception between the Percipient and the Patroller. Once there are execution failures, the Planner will re-schedule the action sequence of the current sub-objective. Mechanisms for collaboration and inspection of multiple modules guarantee the correctness and robustness when MP5 is solving an open-ended embodied task.

Active Perception

A demonstration of the process of Active Perception scheme. Temporary Env. Info. Set saves information collected in the current scenario, so it should be reset at the beginning of Active Perception scheme. Performer then invokes Patroller to start asking Percipient questions with respect to the description of the sub-objective and the current execution action round by round. The responses of Percipient are saved in Temporary Env. Info. Set and are also gathered as the context for the next question-answering round. After finishing asking all significant necessary questions, Patroller will check whether the current execution action is complete by analyzing the current sub-objective with Perceived env info. saved in Temporary Env. Info. Set, therefore complex Context-Dependent Tasks could be solved smoothly.

Experiments of Different Tasks

Process-Dependent Task Definition

Process-Dependent Tasks primarily investigate situation-aware planning and embodied action execution, incorporating contributions from Active Perception and other modules that continuously perceive the environment and dynamically adjust their actions to accomplish long-horizon tasks.
In the table below, we list the names of all tasks in Process-Dependent Tasks, their reasoning steps, object icons, the final recipe, and the required tools/platforms. The reasoning step refers to the number of sub-objectives that need to be completed in order to finish the entire task.

Context-Dependent Task Definition

Context-Dependent Tasks primarily study how Active Perception enables the agent to better perceive low-level context information in the environment.
We first establish 6 aspects of environmental information derived from the Minecraft game environment: [Object, Mob, Ecology, Time, Weather, Brightness]. Each aspect has multiple options. Based on this, we define 16 tasks and organize their difficulty into 4 levels by taking into account the number of information elements that require perception as is shown in the table below.
Easy tasks necessitate the perception of only one element, Mid tasks include 2 perception elements, Hard tasks contain 3 elements, whereas Complex tasks involve the perception of 4 to 6 elements. Each task at the same level has different environment information content, the amount of environment information contained in each task, and the corresponding specific environment information is shown in the table below.

Agent in Context-Dependent Task

Agent in Process-Dependent Task

Agent in Open-Ended Task


In this paper, we propose a novel multi-modal embodied system termed MP5 which is driven by frequently ego-centric scene perception for task planning and execution. In practice, it is designed by integrating five functional modules to accomplish task planning and execution via actively acquiring essential visual information from the scene. The experimental results suggest that our system represents an effective integration of perception, planning, and execution, skillfully crafted to handle both context- and process-dependent tasks within an open-ended environment.


  title={MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception},
  author={Yiran Qin and Enshen Zhou and Qichang Liu and Zhenfei Yin and Lu Sheng and Ruimao Zhang and Yu Qiao and Jing Shao},
  booktitle={arXiv preprint arxiv:2312.07472},