Home

Awesome

<p align="center"> <img width="320" height="320" src="figures/COMBO_logo.png"> <h1 align="center">COMBO-AVS</h1> </p>

Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

Qi Yang, Xing Nie, Tong Li, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan and Shiming Xiang

This repository provides the PyTorch implementation for the paper "Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation" accepted by CVPR 2024 (Highlight).

🔥What's New

🪵 TODO List

📚Method

<p align="center"> <img src="figures/Architecture.png"> <h6 align="center">Overview of the proposed COMBO.</h6> </p>

🛠️ Getting Started

1. Environments

# recommended
pip install -r requirements.txt
pip install soundfile
# build MSDeformAttention
cd models/modeling/pixel_decoder/ops
sh make.sh
# raise ValueError("Cannot match one checkpoint key to multiple keys in the model.")  
# Semantic-SAM
pip install git+https://github.com/cocodataset/panopticapi.git
git clone https://github.com/UX-Decoder/Semantic-SAM
cd Semantic-SAM
python -m pip install -r requirements.txt

Find out more at Semantic-SAM

2. Datasets

Please refer to the link AVSBenchmark to download the datasets. You can put the data under data folder or rename your own folder. Remember to modify the path in config files. The data directory is as bellow:

|--AVS_dataset
   |--AVSBench_semantic/
   |--AVSBench_object/Multi-sources/
   |--AVSBench_object/Single-source/

Preprocess the AVSS dataset for efficient training.

python3 avs_tools/preprocess_avss_audio.py
python3 avs_tools/process_avssimg2fixsize.py

3. Download Pre-Trained Models

|--pretrained
   |--detectron2/R-50.pkl
   |--detectron2/d2_pvt_v2_b5.pkl
   |--vggish-10086976.pth
   |--vggish_pca_params-970ea276.pth

4. Maskiges pregeneration

sh avs_tools/pre_mask/pre_mask_semantic_sam_s4.sh train # or ms3, avss
sh avs_tools/pre_mask/pre_mask_semantic_sam_s4.sh val 
sh avs_tools/pre_mask/pre_mask_semantic_sam_s4.sh test
python3 avs_tools/pre_mask2rgb/mask_precess_s4.py --split train # or ms3, avss
python3 avs_tools/pre_mask2rgb/mask_precess_s4.py --split val
python3 avs_tools/pre_mask2rgb/mask_precess_s4.py --split test
|--AVS_dataset
    |--AVSBench_semantic/pre_SAM_mask/
    |--AVSBench_object/Multi-sources/ms3_data/pre_SAM_mask/
    |--AVSBench_object/Single-source/s4_data/pre_SAM_mask/

5. Train

# ResNet-50
sh scripts/res_train_avs4.sh # or ms3, avss
# PVTv2
sh scripts/pvt_train_avs4.sh # or ms3, avss

6. Test

# ResNet-50
sh scripts/res_test_avs4.sh # or ms3, avss
# PVTv2
sh scripts/pvt_test_avs4.sh # or ms3, avss

7. Results and Download Links

We provide the checkpoints of the S4 Subset at YannQi/COMBO-AVS-checkpoints · Hugging Face.

MethodBackboneSubsetConfigmIoUF-score
COMBO-R50ResNet-50S4config81.790.1
COMBO-PVTv2PVTv2-B5S4config84.791.9
COMBO-R50ResNet-50MS3config54.566.6
COMBO-PVTv2PVTv2-B5MS3config59.271.2
COMBO-R50ResNet-50AVSSconfig33.337.3
COMBO-PVTv2PVTv2-B5AVSSconfig42.146.1

🤝 Citing COMBO

@misc{yang2023cooperation,
      title={Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation},
      author={Qi Yang and Xing Nie and Tong Li and Pengfei Gao and Ying Guo and Cheng Zhen and Pengfei Yan and Shiming Xiang},
      year={2023},
      eprint={2312.06462},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}