Awesome
Audio-Visual Class-Incremental Learning
We introduce <b>audio-visual class-incremental learning</b>, a class-incremental learning scenario for audio-visual video recognition, and propose a method <b>AV-CIL</b>. [paper]
<div align="center"> <img width="100%" alt="AV-CIL" src="images/model.jpg"> </div>Environment
We conduct experiments with Python 3.8.13 and Pytorch 1.13.0.
To setup the environment, please simply run
pip install -r requirements.txt
Datasets
AVE
The original AVE dataset can be downloaded through link.
Please put the downloaded AVE videos in ./raw_data/AVE/videos/.
Kinetics-Sounds
The original Kinetics dataset can be downloaded through link. After downloading the Kinetics dataset, please apply our provided video id list (here) to extract the Kinetics-Sounds dataset used in our experiments.
Please put the downloaded videos in ./raw_data/kinetics-sounds/videos/.
VGGSound100
The original VGGSound dataset can be downloaded through link. After downloading the VGGSound dataset, please apply our provided video id list (here) to extract the Kinetics-Sounds dataset used in our experiments.
Please put the downloaded videos in ./raw_data/VGGSound/videos/.
Extract audio and frames
After downloading the datasets to the folds, please run the following command to extract the audios and frames
sh extract_audios_frames.sh 'dataset'
where the 'dataset' should be in [AVE, ksounds, VGGSound_100].
Pre-trained models
For the audio encoder, please download the pre-trained AudioMAE and put it in ./model/pretrained/.
Feature extraction
For the pre-trained audio features extraction, please run
sh extract_pretrained_features 'dataset'
where the 'dataset' should be in [AVE, ksounds, VGGSound_100].
For the running environment of the AudioMAE, we follow the official implementation and use timm==0.3.2, for which a fix is needed to work with Pytorch 1.8.1+.
(option) Use our extracted features directly
We also released the pre-trained features, you can use them directly instead of pre-processing and extracting them from the raw data: AVE, Kinetics-Sounds [part-1, part-2, part-3], VGGSound100[part-1, part-2, part-3, part-4, part-5, part-6].
For Kinetics-Sounds and VGGSound100, please download all the parts and concatenate them before unzipping.
After obtaining the pre-trained audio and visual features, please put them to ./data/'dataset'/audio_pretrained_feature/ and ./data/'dataset'/visual_pretrained_feature/.
Training & Evaluation
For vanilla fine-tuning strategy, please run
sh run_incremental_fine_tuning.sh 'dataset' 'modality'
where the 'dataset' should be in [AVE, ksounds, VGGSound_100], and the 'modality' should be in [audio, visual, audio-visual].
For the upper bound, please run
sh run_incremental_upper_bound.sh 'dataset' 'modality'
For LwF, please run
sh run_incremental_lwf.sh 'dataset' 'modality'
For iCaRL, please run
sh run_incremental_lwf.sh 'dataset' 'modality' 'classifier'
where the 'classifier' should be in [NME, FC].
For SS-IL, please run
sh run_incremental_ssil.sh 'dataset' 'modality'
For AFC, please run
sh run_incremental_afc.sh 'dataset' 'modality' 'classifier'
where the 'classifier' should be in [NME, LSC].
For our AV-CIL, please run
sh run_incremental_ours.sh 'dataset'
Citation
If you find this work useful, please consider citing it.
@inproceedings{pian2023audiovisual,
title={Audio-Visual Class-Incremental Learning},
author={Pian, Weiguo and Mo, Shentong and Guo, Yunhui and Tian, Yapeng},
booktitle={IEEE/CVF International Conference on Computer Vision},
year={2023}
}