Recent advancements in predictive models have demonstrated exceptional capabilities in predicting the future state of objects and scenes. However, the lack of categorization based on inherent characteristics continues to hinder the progress of predictive model development. Additionally, existing benchmarks are unable to effectively evaluate higher-capability, highly embodied predictive models from an embodied perspective. In this work, we classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks, covering three representative embodied scenarios: Open-Ended Embodied Environment, Autonomous, Driving, and Robot Manipulation. In the Explicit Perceptual Evaluation, we introduce the HF-Embodied Dataset, a video assessment dataset based on fine-grained human feedback, which we use to train a Human Preference Evaluator that aligns with human perception and explicitly assesses the visual fidelity of World Simulater. In the Implicit Manipulative Evaluation, we assess the video-action consistency of World Simulators by evaluating whether the generated situation-aware video can be accurately translated into the correct control signals in dynamic environments. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.
We concretely categorize predictive models based on the model's capabilities and level of embodiment. The detailed categorization stage of is illustrated below:
The rapidly evolving field of World Simulators offers exciting opportunities for advancing Artificial General Intelligence, with significant potential to enhance human productivity and creativity, especially in embodied intelligence. Therefore, conducting a comprehensive embodied evaluation of World Simulators is crucial.
In this work, we classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. We conducted a comprehensive evaluation and analysis of multiple video generation models as World Simulators through both Explicit Perceptual Evaluation and Implicit Manipulative Evaluation processes. We summarize key findings from the evaluation and hope these insights will inspire and guide future research on World Simulators.
Limitations. Although we evaluate physical rules and 3D content from the perspective of embodied intelligence, the World Simulator can be applied to more scenarios than just robots, and different scenarios have more physical representations. Therefore, how to effectively evaluate the World Simulator in other scenarios requires more exploration.
@misc{qin2024worldsimbenchvideogenerationmodels,
title={WorldSimBench: Towards Video Generation Models as World Simulators},
author={Yiran Qin and Zhelun Shi and Jiwen Yu and Xijun Wang and Enshen Zhou and Lijun Li and Zhenfei Yin and Xihui Liu and Lu Sheng and Jing Shao and Lei Bai and Wanli Ouyang and Ruimao Zhang},
year={2024},
eprint={2410.18072},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.18072},
}