


This repo describes how to reproduce the evaluation results in the FETV paper.
If you only want to evaluate a new text-to-video (T2V) generation model using the FETV benchmark and UMT-based metrics, please refer to the folder UMT.

1. Video Collection

1.1 Collect Generated Videos

We evaluate four T2V models: CogVideo, Text2Video-Zero, ModelScopeT2V and ZeroScope.

The generated videos and processed video frames are provided in this huggingface dataset. Download them to datas/videos

The folder is structured as follows:

	├── cogvideo
	│   ├── videos
	│   └── 16frames_uniform   
	├── cogvideo_fid
	├── text2video-zero
	├── text2video-zero_fid
	├── modelscope-t2v
	├── modelscope-t2v_fid
	├── zeroscope
	└── zeroscope_fid

cogvideo/videos contains 619 videos generated from the prompts in datas/fetv_data.json. cogvideo/16frames_uniform contains uniformly sampled frames of the videos. cogvideo_fid contains 2,055 videos generated from the prompts in datas/sampled_prompts_for_fid_fvd/prompts_gen.json, which is used to compute FID and FVD.

<details> <summary>If you want to generate the videos yourself, follow these steps:</summary>

Generate and Process Videos

python utils/video2frames.py \
--video_ext .mp4 \
--frm_num 16 \
--video_root_path $path/to/the/generated/videos$ \
--target_root_path $path/to/the/processed/frames$ \
--sampling_strategy uniform

To compute CLIPScore, BLIPScore and FID, we adopt the "uniform" frame sampling strategy. To compute FVD, we adopt the "offset" frame sampling strategy following stylegan-v. The processed frames are structured as follows:

	├── sent0
	│   ├── frame0.jpg
	│   ├── frame1.jpg
	|   ...
	|   └── frame15.jpg   
	├── sent1
	└──  sent618

1.2 Collect Reference Real Videos

NOTE: You can also contact liuyuanxin@stu.pku.edu.cn to access the reference real videos.

2. Manual Evaluation

The manual evaluation results from three human evaluators can be found in manual_eval_results. By running python utils/visualization_manual_results.py, these results can be visualized in the form of radar plots:

Results of static and temporal video quality

Results of video-text alignment

We also release our manual evaluation instruction with carefully designed rating level definitions and examples. We hope this can help facilitate inter-human correlation in evaluating T2V generation models.

3. Automatic Evaluation

3.1 Video-Text Alignment

3.1.1 CLIPScore and BLIPScore

Run the following command to compute CLIPScore and BLIPScore:

  python auto_eval.py \
  	--eval_model ViT-B/32 \
  	--blip_config BLIP/blip_config.yaml \
  	--prompt_file datas/fetv_data.json \
  	--gen_path datas/videos/modelscope-t2v/16frames_uniform \
    	--t2v_model modelscope-t2v \
	--is_clip_ft false \
  	--save_results true 

3.1.2 UMTScore

Please refer to the folder UMT for how to compute the UMTScore.

3.1.3 Correlation between Automatic Metrics and Humans

To compute the correlation between automatic and human judgements of video-text alignment, run

  python auto_human_correlation.py

The results will be printed as follows:

3.2 Video Quality

3.2.1 FID

To compute FID, run

  python compute_fid.py \
    --model modelscope-t2v \

The results will be saved to auto_eval_results/fid_results and auto_eval_results/fid_fg_results, respectively.

3.2.2 FVD

To compute FVD over the entire FETV benchmark, enter the folder stylegan-v and run

    bash run_fvd_modelscope-t2v.sh

Change modelscope-t2v to evaluate different T2V generation models. The results will be saved to auto_eval_results/fvd_results.

To compute FVD of different categories, run

    python compute_fg_fvd.py \
      --model modelscope-t2v

The results will be saved to auto_eval_results/fvd_fg_results.

3.2.3 FVD-UMT

Please refer to the folder UMT for how to compute the FVD-UMT.

3.2.4 Correlation between Automatic Metrics and Humans

To visualize the automatic and human ranking of T2V models in terms of video quality, run

    python plot_fid_fvd_human.py

To visualize the fine-grained results in different categories, run

    python plot_fid_fvd_human_fg.py

To visualize the effect of video sample number on FID and FVD, run

    python plot_fid_fvd_numvideo.py