WorldSimBench: Towards Video Generation Models as World Simulators

WorldSimBench: Towards Video Generation Models as World Simulators

1The Chinese University of Hong Kong, Shenzhen; 2Shanghai Artificial Intelligence Laboratory;
3Beihang University; 4The University of Hong Kong;
Equal Contribution   ✉ Corresponding author   † Project Lead  

Abstract

Recent advancements in predictive models have demonstrated exceptional capabilities in predicting the future state of objects and scenes. However, the lack of categorization based on inherent characteristics continues to hinder the progress of predictive model development. Additionally, existing benchmarks are unable to effectively evaluate higher-capability, highly embodied predictive models from an embodied perspective. In this work, we classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks, covering three representative embodied scenarios: Open-Ended Embodied Environment, Autonomous, Driving, and Robot Manipulation. In the Explicit Perceptual Evaluation, we introduce the HF-Embodied Dataset, a video assessment dataset based on fine-grained human feedback, which we use to train a Human Preference Evaluator that aligns with human perception and explicitly assesses the visual fidelity of World Simulater. In the Implicit Manipulative Evaluation, we assess the video-action consistency of World Simulators by evaluating whether the generated situation-aware video can be accurately translated into the correct control signals in dynamic environments. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.

Overview of WorldSimBench architecture.





Overview of the hierarchical capabilities of the Predictive Models. Models at higher stages demonstrate more advanced capabilities. We take the initial step in evaluating Predictive Generative Models up to the S3 stage, known as World Simulators, by introducing a parallel evaluation framework, WorldSimBench. WorldSimBench assesses the models both Perceptual Evaluation and Implicit Manipulative Evaluation, focusing on video generation and action transformation across three critical embodied scenarios.

Predictive Model Category Definition

We concretely categorize predictive models based on the model's capabilities and level of embodiment. The detailed categorization stage of is illustrated below:


  • Stage S0: At this stage, predictive models can generate corresponding predictions based on instructions and observations but are limited to textual modality. Benchmarks at this stage conduct text-level and task-completion evaluations through output text planning.

  • Stage S1: At this stage, predictive models can generate visual predictions based on instructions and observations, but without incorporating temporal information. Benchmarks at this stage conduct aesthetic evaluation for generated images.

  • Stage S2: At this stage, predictive models can generate corresponding video predictions based on both instructions and observations. Yet, due to limited model capabilities, the evaluation at this level focuses solely on the aesthetic quality of the generated outputs.

  • Stage S3: At this stage, predictive models can generate corresponding video predictions based on instructions and observations, with the predicted video content adhering to physical rules and aligning with the executed actions. These models are known as World Simulators, and WorldSimBench is a benchmark specifically designed to evaluate these World Simulators.

The rapidly evolving field of World Simulators offers exciting opportunities for advancing Artificial General Intelligence, with significant potential to enhance human productivity and creativity, especially in embodied intelligence. Therefore, conducting a comprehensive embodied evaluation of World Simulators is crucial.

Explicit Perceptual Evaluation




Hierarchical Evaluation Dimension. The dimensions are categorized into three main aspects: Visual Quality for evaluating the overall quality, Condition Consistency for evaluating the alignment to the input instruction, and Embodiment for evaluating embodied related factors like physical rules.




Overview of Explicit Perceptual Evaluation. (Top) Prompt Generation. We use a large collection of video captions from the internet and our predefined embodied evaluation dimensions. These are expanded using GPT and manually verified to create a corresponding Task Instruction Prompt List for data generation and evaluation. (Bottom) HF-Embodied Dataset Generation. Massive internet-sourced embodied videos with captions are used to train data generation models. Fine-grained Human Feedback Annotation is then applied to the embodied videos according to the corresponding Task Instruction Prompt List, covering multiple embodied dimensions.




Result of Explicit Perceptual Evaluation aross three embodied scenarios. Scores in each embodied scenario are normalized to 0-1. Check the paper for more experimental details.

Implicit Manipulative Evaluation




Overview of Implicit Manipulative Evaluation. Embodied tasks in different scenarios are decomposed into executable sub-tasks. The video generation model generates corresponding predicted videos based on the current instructions and real-time observations. Using a pre-trained IDM or a goal-based policy, the agent executes the generated sequence of actions. After a fixed timestep, the predicted video is refreshed by sampling again from the video generation model, and this process repeats. Finally, the success rates of various embodied tasks are obtained through monitors in the simulation environment.




Result of Implicit Manipulative Evaluation aross three embodied scenarios. Check the paper for more experimental details.

Conclusion

In this work, we classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. We conducted a comprehensive evaluation and analysis of multiple video generation models as World Simulators through both Explicit Perceptual Evaluation and Implicit Manipulative Evaluation processes. We summarize key findings from the evaluation and hope these insights will inspire and guide future research on World Simulators.

Limitations. Although we evaluate physical rules and 3D content from the perspective of embodied intelligence, the World Simulator can be applied to more scenarios than just robots, and different scenarios have more physical representations. Therefore, how to effectively evaluate the World Simulator in other scenarios requires more exploration.

BibTeX

@misc{qin2024worldsimbenchvideogenerationmodels,
      title={WorldSimBench: Towards Video Generation Models as World Simulators}, 
      author={Yiran Qin and Zhelun Shi and Jiwen Yu and Xijun Wang and Enshen Zhou and Lijun Li and Zhenfei Yin and Xihui Liu and Lu Sheng and Jing Shao and Lei Bai and Wanli Ouyang and Ruimao Zhang},
      year={2024},
      eprint={2410.18072},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.18072}, 
}