Home

Awesome

PhyGenBench

<p align="left"> <a href="#🚀-quick-start"><b>Quick Start</b></a> | <a href="https://phygenbench123.github.io/"><b>HomePage</b></a> | <a href="https://arxiv.org/abs/2410.05363"><b>arXiv</b></a> | </a> <a href="#🖊️-citation"><b>Citation</b></a> <br> </p>

This repository is the official implementation of PhyGenBench.

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng<sup>*</sup>, Jiaqi Liao<sup>*</sup>, Xinyu Tan, Wenqi Shao<sup>#</sup>, Quanfeng Lu, Kaipeng Zhang, Cheng Yu, Dianqi Li, Yu Qiao, Ping Luo<sup>#</sup>
<sup>*</sup> MFQ and LJQ contribute equally.
<sup>#</sup> SWQ (shaowenqi@pjlab.org.cn) and LP are correponding authors.

đź’ˇ News

🎩Introduction

We introduce PhyGenBench, a comprehensive Physics Generation Benchmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic physical phenomenons). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications.

<img src="static/overview.png" alt="overview" style="zoom:80%;" />

đź“–PhyGenEval

<img src="static/phyeval.png" alt="overview" style="zoom:80%;" />

we design a progressive strategy that starts with key physical phenomena, then moves through the sequence of several key phenomena, and finally evaluates the overall naturalness of the entire video. This hierarchical and refined approach reduces the difficulty compared to existing methods that directly uses VLMs to evaluate physical commonsense, enabling PhyGenEval to achieve results closely aligned with human judgements.

🏆 Leaderboard

ModelSizeMechanics(↑)Optics(↑)Thermal(↑)Material(↑)Average(↑)Human(↑)
CogVideoX2B0.380.430.340.390.370.31
CogVideoX5B0.430.550.400.420.450.37
Open-Sora V1.21.1B0.430.500.440.370.440.35
Lavie860M0.400.440.380.320.360.30
Vchitect 2.02B0.410.560.440.370.450.36
Pika-0.350.560.430.390.440.36
Gen-3-0.450.570.490.510.510.48
Kling-0.450.580.500.400.490.44

🚀 Quick Start

File Structure

Environment

git clone https://github.com/OpenGVLab/PhyGenBench
cd PhyGenBench

If you only want to use the closed-source model for testing, you only need to configure the VQAScore environment. If you want to perform an ensemble of both closed-source and open-source models, you need to configure VQAScore, LLava-Interleave, and InternVideo2 environments, and download the models.

Question Generation

First, we generate corresponding questions for Key Physical Phenomena Detection, Physics Order Verification, and Overall Naturalness Evaluation. To simplify the expression, we refer to them as the single stage, multi stage, and video stage based on the VLM used.

# single
python PhyGenEval/single/generate_question.py

# multi
python PhyGenEval/multi/generate_question.py

# video
python PhyGenEval/video/generate_question.py

PhyGenBench/single_question.json, PhyGenBench/multi_question.json, and PhyGenBench/video_question.json are questions we generated at different stages.

Three-tier Evaluation

Our evaluations all use only one A100-80G. When using it, we have marked the python files that need to be run. Please write the appropriate script file according to your system (slurm or ...)

Key Physical Phenomena Detection:

python PhyGenEval/single/vqascore.py

Physics Order Verification:

# the environment of vqascore make collide with environment with llava-interleave,
# so we first retrieval the keyframe and then do the multi-image qa

# first do the retrieval and denote the retrieval score
# the environment is same with vqascore

python PhyGenEval/multi/multiimage_clip.py

# then do the multi-image qa
# for gpt-4o
python PhyGenEval/multi/GPT4o.py

# for llava
cd PhyGenEval/multi/LLaVA-NeXT-interleave_inference
python llava/eval/model_vqa_multi.py

Overall Naturalness Evaluation

# for gpt4o
python PhyGenEval/video/GPT4o.py

# for internvideo2
cd PhyGenEval/video/MTScore
python InternVideo_physical.py

Overall Score Calculation

python PhyGenEval/overall.py

🎬Qualitative Analysis

<img src="static/qualitative.png" alt="overview" style="zoom:80%;" />

đź“’Note

📧 Contact

If you have any questions, feel free to contact Fanqing Meng with mengfanqing33@gmail.com