

Audio-Visual Class-Incremental Learning

We introduce <b>audio-visual class-incremental learning</b>, a class-incremental learning scenario for audio-visual video recognition, and propose a method <b>AV-CIL</b>. [paper]

<div align="center"> <img width="100%" alt="AV-CIL" src="images/model.jpg"> </div>


We conduct experiments with Python 3.8.13 and Pytorch 1.13.0.

To setup the environment, please simply run

pip install -r requirements.txt



The original AVE dataset can be downloaded through link.

Please put the downloaded AVE videos in ./raw_data/AVE/videos/.


The original Kinetics dataset can be downloaded through link. After downloading the Kinetics dataset, please apply our provided video id list (here) to extract the Kinetics-Sounds dataset used in our experiments.

Please put the downloaded videos in ./raw_data/kinetics-sounds/videos/.


The original VGGSound dataset can be downloaded through link. After downloading the VGGSound dataset, please apply our provided video id list (here) to extract the Kinetics-Sounds dataset used in our experiments.

Please put the downloaded videos in ./raw_data/VGGSound/videos/.

Extract audio and frames

After downloading the datasets to the folds, please run the following command to extract the audios and frames

sh extract_audios_frames.sh 'dataset'

where the 'dataset' should be in [AVE, ksounds, VGGSound_100].

Pre-trained models

For the audio encoder, please download the pre-trained AudioMAE and put it in ./model/pretrained/.

Feature extraction

For the pre-trained audio features extraction, please run

sh extract_pretrained_features 'dataset'

where the 'dataset' should be in [AVE, ksounds, VGGSound_100].

For the running environment of the AudioMAE, we follow the official implementation and use timm==0.3.2, for which a fix is needed to work with Pytorch 1.8.1+.

(option) Use our extracted features directly

We also released the pre-trained features, you can use them directly instead of pre-processing and extracting them from the raw data: AVE, Kinetics-Sounds [part-1, part-2, part-3], VGGSound100[part-1, part-2, part-3, part-4, part-5, part-6].

For Kinetics-Sounds and VGGSound100, please download all the parts and concatenate them before unzipping.

After obtaining the pre-trained audio and visual features, please put them to ./data/'dataset'/audio_pretrained_feature/ and ./data/'dataset'/visual_pretrained_feature/.

Training & Evaluation

For vanilla fine-tuning strategy, please run

sh run_incremental_fine_tuning.sh 'dataset' 'modality'

where the 'dataset' should be in [AVE, ksounds, VGGSound_100], and the 'modality' should be in [audio, visual, audio-visual].

For the upper bound, please run

sh run_incremental_upper_bound.sh 'dataset' 'modality'

For LwF, please run

sh run_incremental_lwf.sh 'dataset' 'modality'

For iCaRL, please run

sh run_incremental_lwf.sh 'dataset' 'modality' 'classifier'

where the 'classifier' should be in [NME, FC].

For SS-IL, please run

sh run_incremental_ssil.sh 'dataset' 'modality'

For AFC, please run

sh run_incremental_afc.sh 'dataset' 'modality' 'classifier'

where the 'classifier' should be in [NME, LSC].

For our AV-CIL, please run

sh run_incremental_ours.sh 'dataset'


If you find this work useful, please consider citing it.

  title={Audio-Visual Class-Incremental Learning},
  author={Pian, Weiguo and Mo, Shentong and Guo, Yunhui and Tian, Yapeng},
  booktitle={IEEE/CVF International Conference on Computer Vision},