

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Junyu Xie<sup>1</sup>, Tengda Han<sup>1</sup>, Max Bain<sup>1</sup>, Arsha Nagrani<sup>1</sup>, Gül Varol<sup>1</sup> <sup>2</sup>, Weidi Xie<sup>1</sup> <sup>3</sup>, Andrew Zisserman<sup>1</sup>

<sup>1</sup> Visual Geometry Group, Department of Engineering Science, University of Oxford <br> <sup>2</sup> LIGM, École des Ponts, Univ Gustave Eiffel, CNRS <br> <sup>3</sup> CMIC, Shanghai Jiao Tong University

<a src="https://img.shields.io/badge/cs.CV-2407.15850-b31b1b?logo=arxiv&logoColor=red" href="https://arxiv.org/abs/2407.15850"> <img src="https://img.shields.io/badge/cs.CV-2407.15850-b31b1b?logo=arxiv&logoColor=red"></a> <a href="https://www.robots.ox.ac.uk/~vgg/research/autoad-zero/" alt="Project page"> <img alt="Project page" src="https://img.shields.io/badge/project_page-autoad--zero-blue"></a> <a href="https://www.robots.ox.ac.uk/~vgg/research/autoad-zero/#tvad" alt="Dataset"> <img alt="Dataset" src="https://img.shields.io/badge/dataset-TV--AD-purple"></a> <br> <br> <p align="center"> <img src="assets/teaser.PNG" width="750"/> </p>



In this work, we evaluate our model on CMD-AD, MAD-Eval, and TV-AD.

Video Frames

Ground Truth AD Annotations


Character Recognition

The pre-computed character recognition results are available in resources/annotations (e.g. resources/annotations/cmdad_anno_with_face_0.2_0.4.csv), which can be directly feeded into stage I (next step).

It is also possible to run character recognition code from stratch. Please refer to the char_recog folder for more details.


Stage I: VLM-Based Dense Video Description

python stage1/main.py \
--dataset={dataset} \                  #e.g. "cmdad"
--video_dir={video_dir} \
--anno_path={anno_path} \              #e.g. "resources/annotations/cmdad_anno_with_face_0.2_0.4.csv"
--charbank_path={charbank_path} \      #e.g. "resources/charbanks/cmdad_charbank.json" 
--model_path={videollama2_ckpt_path} \

--dataset: choices are cmdad, madeval, and tvad. <br> --video_dir: directory of video datasets, example file structures can be found in resources/example_file_structures (files are empty, for references only). <br> --anno_path: path to AD annotations (with predicted face IDs and bboxes), available in resources/annotations. <br> --charbank_path: path to external character banks, available in resources/charbanks. <br> --model_path: path to videollama2 checkpoint. <br> --output_dir: directory to save output csv. <br>

Stage II: LLM-Based AD Summary

python stage2/main.py \
--dataset={dataset} \             #e.g. "cmdad"

--dataset: choices are cmdad, madeval, and tvad. <br> --pred_path: path to the stage1 saved csv file.

Inference with GPT-4o via OpenAI API

Note: Before starting, insert OpenAI API keys into the corresponding main.py file. <br> Note: This is not officially tested and reported in the original paper. You may want to adjust the text prompts to get improved / more robust outputs.

Stage I: VLM-Based Dense Video Description

python stage1_gpt/main.py \
--dataset={dataset} \                  #e.g. "cmdad"
--video_dir={video_dir} \
--anno_path={anno_path} \              #e.g. "resources/annotations/cmdad_anno_with_face_0.2_0.4.csv"
--charbank_path={charbank_path} \      #e.g. "resources/charbanks/cmdad_charbank.json" 

Stage II: LLM-Based AD Summary

python stage2_gpt/main.py \
--dataset={dataset} \             #e.g. "cmdad"


If you find this repository helpful, please consider citing our work:

	title={AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description},
	author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and G\"ul Varol and Weidi Xie and Andrew Zisserman},


VideoLLaMA2: https://github.com/DAMO-NLP-SG/VideoLLaMA2 <br> LLaMA3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct