RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

ICCV 2025 & CVPR 2025 MEIS Outstanding Paper Award

Yiran Qin^1,2*, Li Kang^2,6*, Xiufeng Song^2,7*, Zhenfei Yin^3✉,
Xiaohong Liu⁷, Xihui Liu⁴, Ruimao Zhang^5✉, Lei Bai^2✉,

¹The Chinese University of Hong Kong, Shenzhen; ²Shanghai Artificial Intelligence Laboratory;
³Oxford; ⁴HKU; ⁵Sun Yat-sen University; ⁶Tongji University; ⁷Shanghai Jiao Tong University;

^*Equal Contribution ^✉Equal Advising

arXiv PDF Challenge Code Dataset

Highlights

We propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents.

Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory.

Based on RoboFactory, we deploy imitation learning methods and conduct evaluations. and explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.

Motivation

When performing the task "Grab the steak and use the camera to photograph it with 4 embodied agents", collaboration among multiple agents is required: \( a_1 \) grasps the steak, \( a_2 \) and \( a_3 \) lift the camera, and \( a_4 \) presses the shutter to take the photo. However, each agent cannot focus solely on its own task. We introduce the concept of compositional constraints to ensure safe and efficient collaboration among the agents. Logical constraints prevent incorrect interaction forms (e.g., \( a_3 \) grabbing the camera lens, causing damage). Spatial constraints avoid catastrophic hardware damage (e.g., collisions between \( a_2 \) and \( a_3 \) during trajectory execution). Temporal constraints prevent inefficient collaboration (e.g., \( a_1 \) waiting unnecessarily due to nonexistent collisions while other agents execute their tasks).

Abstract

Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.

Method Overview

Overview of RoboFactory. Given the global task description, prior information, and observations, RoboBrain generates the next sub-goals for each agent and outputs textual compositional constraints. It then generates unconstrained trajectory sequences for each agent to achieve the corresponding sub-goals, invoking predefined motion primitives. RoboChecker constructs corresponding constraint interfaces based on the textual compositional constraints and the current multi-agent state. It checks whether the agents violate any constraints while executing the generated trajectories. This framework ensures the generation of safe and efficient collaborative data for multi-embodied agents by transforming abstract textual constraints into representations that can interact with agent behaviors through the construction of constraint interfaces.

Constraint Interface

Different Constraint Interface. For \( C_l \), we annotated the interactive points of objects and the interactive directions of each point. For \( C_s \), we modeled observations to obtain depth maps and used them, along with the robotic arm states, to construct 3D occupancy representations. For \( C_t \), we modeled temporal-state representations based on the trajectories of agents at each changing position and used these representations for scheduling through analysis.

Multi-agent Collaboration Demos

These videos demonstrate that by introducing compositional constraints , RoboFactory can prevent collisions between robotic arms and efficiently complete various collaborative tasks involving multiple embodied agents.

Task: Take Photo.

Observation of Agent1.

Observation of Agent2.

Observation of Agent3.

Observation of Agent4.

Global observation.

Other Tasks.

Task: Place Food.

Task: Lift Barrier.

Task: Two Robots Stack Cube.

Task: Three Robots Stack Cube.

Visualization of RoboChecker

Demonstration of RoboChecker is showcased in the complete execution of the Take Photo task. By analyzing constraints, RoboChecker generates CheckCode, a composition of multiple interfaces. Specifically, VI stands for Validate Interaction, VD for Validate Direction, VSO for Validate Spatial Occupancy, and VS for Validate Scheduling. The CheckCode returns true only when all interfaces pass validation, indicating that the generated motion trajectory adheres to the compositional constraints. Otherwise, CheckCode identifies the failed interfaces and sends the feedback to RoboBrain.

Multi-agent Imitation Learning

We design four multi-embodied agent imitation learning architectures. The Global View in the image input represents the observation containing all agents, and the Local View represents the ego-view observation of each agent. In policy training, Shared Policy indicates that all agents share a policy, and Separate Policy indicates that each agent trains an independent policy.

BibTeX

@article{qin2025robofactory,
  title={RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints},
  author={Qin, Yiran and Kang, Li and Song, Xiufeng and Yin, Zhenfei and Liu, Xiaohong and Liu, Xihui and Zhang, Ruimao and Bai, Lei},
  journal={arXiv preprint arXiv:2503.16408},
  year={2025}
}