Home

Awesome

Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval

Official implementation of "Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval", BMVC 2022.
Additional Links: arXiv | Video & Poster

Our framework retains semantically relevant modality-specific features by learning a fused representation space, while bypassing the expensive cross-attention computation at test-time via cross-modal knowledge distillation.

Model Diagram

Environment Setup

This project is implemented using PyTorch. A conda environment with all related dependencies can be created as follows:

  1. Clone the project repository:
git clone https://github.com/abhrac/xmodal-vit.git
cd xmodal-vit
  1. Create and activate conda environment:
conda env create -f environment.yml
conda activate xmodal-vit

Experimentation

To run the whole train-test pipeline end-to-end, run:

./run_expt.sh

Training

To train individual components from scratch, run the following:

python src/train_teacher.py --dataset=DatasetName
python src/train_photo_student.py --dataset=DatasetName
python src/train_sketch_student.py --dataset=DatasetName

where DatasetName is one of ShoeV2, ChairV2 or Sketchy.

Evaluation

Pre-trained models are available here. To evaluate a trained model, run:

python src/test.py --dataset=DatasetName

Results

Shoe-V2Shoe-V2Chair-V2Chair-V2
Acc@1Acc@10Acc@1Acc@10
Yang et al., ICCV '2132.3379.6352.8994.88
Sain et al., CVPR '2136.4781.8362.8691.14
Bhunia et al., CVPR '2139.1087.5062.2090.80
Chowdhury et al., CVPR '2239.9082.90--
Bhunia et al., CVPR '2243.70-64.80-
Ours (XModalViT)45.0590.2363.4895.02
SketchySketchy
Acc@1Acc@10
Human (Sangkloy et al., SIGGRAPH'16)54.27-
Pang et al., BMVC'1750.14-
Wang et al., PR'20 (S+I)40.1692.00
Wang et al., PR'20 (S+I+D)46.2096.49
Ours (XModalViT)56.1596.86

Citation

@inproceedings{Chaudhuri2022XModalViT,
 author = {Abhra Chaudhuri and Massimiliano Mancini and Yanbei Chen and Zeynep Akata and Anjan Dutta},
 booktitle = {Proceedings of the British Machine Vision Conference (BMVC)},
 title = {Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval},
 year = {2022}
}