WorldSimBench: Towards Video Generation Models as World Simulators

Abstract

Recent advancements in predictive models have demonstrated exceptional capabilities in predicting the future state of objects and scenes. However, the lack of categorization based on inherent characteristics continues to hinder the progress of predictive model development. Additionally, existing benchmarks are unable to effectively evaluate higher-capability, highly embodied predictive models from an embodied perspective. In this work, we classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks, covering three representative embodied scenarios: Open-Ended Embodied Environment, Autonomous, Driving, and Robot Manipulation. In the Explicit Perceptual Evaluation, we introduce the HF-Embodied Dataset, a video assessment dataset based on fine-grained human feedback, which we use to train a Human Preference Evaluator that aligns with human perception and explicitly assesses the visual fidelity of World Simulater. In the Implicit Manipulative Evaluation, we assess the video-action consistency of World Simulators by evaluating whether the generated situation-aware video can be accurately translated into the correct control signals in dynamic environments. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.

Overview of WorldSimBench architecture.

Overview of the hierarchical capabilities of the Predictive Models. Models at higher stages demonstrate more advanced capabilities. We take the initial step in evaluating Predictive Generative Models up to the S3 stage, known as World Simulators, by introducing a parallel evaluation framework, WorldSimBench. WorldSimBench assesses the models both Perceptual Evaluation and Implicit Manipulative Evaluation, focusing on video generation and action transformation across three critical embodied scenarios.

Explicit Perceptual Evaluation

Hierarchical Evaluation Dimension. The dimensions are categorized into three main aspects: Visual Quality for evaluating the overall quality, Condition Consistency for evaluating the alignment to the input instruction, and Embodiment for evaluating embodied related factors like physical rules.

Overview of Explicit Perceptual Evaluation. (Top) Prompt Generation. We use a large collection of video captions from the internet and our predefined embodied evaluation dimensions. These are expanded using GPT and manually verified to create a corresponding Task Instruction Prompt List for data generation and evaluation. (Bottom) HF-Embodied Dataset Generation. Massive internet-sourced embodied videos with captions are used to train data generation models. Fine-grained Human Feedback Annotation is then applied to the embodied videos according to the corresponding Task Instruction Prompt List, covering multiple embodied dimensions.

Result of Explicit Perceptual Evaluation aross three embodied scenarios. Scores in each embodied scenario are normalized to 0-1. Check the paper for more experimental details.

Conclusion

In this work, we classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. We conducted a comprehensive evaluation and analysis of multiple video generation models as World Simulators through both Explicit Perceptual Evaluation and Implicit Manipulative Evaluation processes. We summarize key findings from the evaluation and hope these insights will inspire and guide future research on World Simulators.

Limitations. Although we evaluate physical rules and 3D content from the perspective of embodied intelligence, the World Simulator can be applied to more scenarios than just robots, and different scenarios have more physical representations. Therefore, how to effectively evaluate the World Simulator in other scenarios requires more exploration.

BibTeX

@article{qin2024worldsimbench,
  title={WorldSimBench: Towards Video Generation Models as World Simulators},
  author={Qin, Yiran and Shi, Zhelun and Yu, Jiwen and Wang, Xijun and Zhou, Enshen and Li, Lijun and Yin, Zhenfei and Liu, Xihui and Sheng, Lu and Shao, Jing and others},
  journal={arXiv preprint arXiv:2410.18072},
  year={2024}
}