Highlights
Highlights
When performing the task "Grab the steak and use the camera to photograph it with 4 embodied agents", collaboration among multiple agents is required: \( a_1 \) grasps the steak, \( a_2 \) and \( a_3 \) lift the camera, and \( a_4 \) presses the shutter to take the photo. However, each agent cannot focus solely on its own task. We introduce the concept of compositional constraints to ensure safe and efficient collaboration among the agents. Logical constraints prevent incorrect interaction forms (e.g., \( a_3 \) grabbing the camera lens, causing damage). Spatial constraints avoid catastrophic hardware damage (e.g., collisions between \( a_2 \) and \( a_3 \) during trajectory execution). Temporal constraints prevent inefficient collaboration (e.g., \( a_1 \) waiting unnecessarily due to nonexistent collisions while other agents execute their tasks).
Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.
Overview of RoboFactory. Given the global task description, prior information, and observations, RoboBrain generates the next sub-goals for each agent and outputs textual compositional constraints. It then generates unconstrained trajectory sequences for each agent to achieve the corresponding sub-goals, invoking predefined motion primitives. RoboChecker constructs corresponding constraint interfaces based on the textual compositional constraints and the current multi-agent state. It checks whether the agents violate any constraints while executing the generated trajectories. This framework ensures the generation of safe and efficient collaborative data for multi-embodied agents by transforming abstract textual constraints into representations that can interact with agent behaviors through the construction of constraint interfaces.
Different Constraint Interface. For \( C_l \), we annotated the interactive points of objects and the interactive directions of each point. For \( C_s \), we modeled observations to obtain depth maps and used them, along with the robotic arm states, to construct 3D occupancy representations. For \( C_t \), we modeled temporal-state representations based on the trajectories of agents at each changing position and used these representations for scheduling through analysis.
These videos demonstrate that by introducing compositional constraints , RoboFactory can prevent collisions between robotic arms and efficiently complete various collaborative tasks involving multiple embodied agents.
Task: Take Photo.
Other Tasks.
Demonstration of RoboChecker is showcased in the complete execution of the Take Photo task. By analyzing constraints, RoboChecker generates CheckCode, a composition of multiple interfaces. Specifically, VI stands for Validate Interaction, VD for Validate Direction, VSO for Validate Spatial Occupancy, and VS for Validate Scheduling. The CheckCode returns true only when all interfaces pass validation, indicating that the generated motion trajectory adheres to the compositional constraints. Otherwise, CheckCode identifies the failed interfaces and sends the feedback to RoboBrain.
We design four multi-embodied agent imitation learning architectures. The Global View in the image input represents the observation containing all agents, and the Local View represents the ego-view observation of each agent. In policy training, Shared Policy indicates that all agents share a policy, and Separate Policy indicates that each agent trains an independent policy.