Home

Awesome

AIGCBench

:dart::dart: AIGCBench is a novel and comprehensive benchmark designed for evaluating the capabilities of state-of-the-art video generation algorithms. Official code for the paper:

AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI, BenchCouncil Transactions on Benchmarks, Standards and Evaluations (TBench).

Fanda Fan, Chunjie Luo, Wanling Gao, Jianfeng Zhan

<a href='https://arxiv.org/abs/2401.01651'><img src='https://img.shields.io/badge/arXiv-2401.01651-red'></a> <a href='https://www.benchcouncil.org/AIGCBench/'><img src='https://img.shields.io/badge/Project-Website-orange'></a> <a href='https://github.com/BenchCouncil/AIGCBench'><img src='https://img.shields.io/badge/Github-Code-green'></a> <a href='https://huggingface.co/datasets/stevenfan/AIGCBench_v1.0'><img src='https://img.shields.io/badge/Huggingface-Dataset-yellow'></a>

<p align="center"> <img src="./source/I2VFramework.jpg" width="1080px"/> </p>

<em>Illustration of our AIGCBench. Our AIGCBench is divided into three modules: the evaluation dataset, the evaluation metrics, and the video generation models to be assessed.</em>

Key Features of AIGCBench:

:fire:News

Dataset

:smile:The Hugging Face link for our dataset.

This dataset is intended for the evaluation of video generation tasks. Our dataset includes image-text pairs and video-text pairs. The dataset comprises three parts:

  1. Ours - A custom generation of image-text samples.
  2. Webvid val - A subset of 1000 video samples from the WebVid val dataset.
  3. Laion-aesthetics - A subset of LAION dataset that includes 925 curated image-text samples.

Below are some images we generated, with the corresponding text:

ImageDescription
<img src="source/265_Amidst the lush canopy of a deep jungle, a playful panda is brewing a potion, captured with the stark realism of a photo.png" width="200px" />Amidst the lush canopy of a deep jungle, a playful panda is brewing a potion, captured with the stark realism of a photo.
<img src="source/426_Behold a noble king in the throes of skillfully strumming the guitar surrounded by the tranquil waters of a serene lake, envisioned in the style of an oil painting.png" width="200px" />Behold a noble king in the throes of skillfully strumming the guitar surrounded by the tranquil waters of a serene lake, envisioned in the style of an oil painting.
<img src="source/619_Amidst a sun-dappled forest, a mischievous fairy is carefully repairing a broken robot, captured in the style of an oil painting.png" width="200px" />Amidst a sun-dappled forest, a mischievous fairy is carefully repairing a broken robot, captured in the style of an oil painting.
<img src="source/824_Within the realm of the backdrop of an alien planet's red skies, a treasure-seeking pirate cleverly solving a puzzle, each moment immortalized in the style of an oil painting.png" width="200px" />Within the realm of the backdrop of an alien planet's red skies, a treasure-seeking pirate cleverly solving a puzzle, each moment immortalized in the style of an oil painting.

Metrics

We have encapsulated the evaluation metrics used in our paper in eval.py; for more details, please refer to the paper. To use the code, please first download the clip model file and replace the 'path_to_dir' with the actual path.

Below is a simple example:

batch_video_path = os.path.join('path_to_videos', '*.mp4')
video_path_list = sorted(glob.glob(batch_video_path))

sum_res = 0
cnt = 0
for video_path in video_path_list:
    res = compute_video_video_similarity(ref_video_path, video_path)
    sum_res += res['clip']
    cnt += res["state"]
print(sum_res / cnt)

Evaluation Results

Quantitative analysis for different Image-to-Video algorithms. An upward arrow indicates that higher values are better, while a downward arrow means lower values are preferable.

DimensionsMetricsVideoCrafterI2VGen-XLSVDPikaGen2
Control-video AlignmentMSE (First) ↓3929.654491.90640.75155.30235.53
SSIM (First) ↑0.3000.3540.6120.8000.803
Image-GenVideo Clip ↑0.8300.8320.9190.9300.939
GenVideo-Text Clip ↑0.230.24-0.2710.270
GenVideo-RefVideo CliP (Keyframes) ↑0.7630.764-0.8240.820
Motion EffectsFlow-Square-Mean1.241.802.520.2811.18
GenVideo-RefVideo CliP (Corresponding frames) ↑0.7640.7640.7960.8230.818
Temporal ConsistencyGenVideo Clip (Adjacent frames) ↑0.9800.9710.9740.9960.995
GenVideo-RefVideo CliP (Corresponding frames) ↑0.7640.7640.7960.8230.818
Video QualityFrame Count ↑1632257296
DOVER ↑0.5180.5100.6230.7150.775
GenVideo-RefVideo SSIM ↑0.3670.3040.5070.5600.504

To validate the alignment of our proposed evaluation standards with human preferences, we conducted a study. We randomly selected 30 generated results from each of the five methods. Then, we asked participants to vote on the best algorithm outcomes across four dimensions: Image Fidelity, Motion Effects, Temporal Consistency, and Video Quality. A total of 42 individuals participated in the voting process. The specific results of the study are presented below:

<img src="source/radar_chart_high_res.jpg" alt="Alt text" width="600">

Contact Us

:email: If you have any questions, please feel free to contact us via email at fanfanda@ict.ac.cn and jianfengzhan.benchcouncil@gmail.com.

Citation

If you find our work useful in your research, please consider citing our paper:

@article{fan2024aigcbench,
  title={AIGCBench: Comprehensive evaluation of image-to-video content generated by AI},
  author={Fan, Fanda and Luo, Chunjie and Gao, Wanling and Zhan, Jianfeng},
  journal={BenchCouncil Transactions on Benchmarks, Standards and Evaluations},
  pages={100152},
  year={2024},
  publisher={Elsevier}
}