Awesome

DIME-FM

Implementation of "DIME-FM: DIstilling Multimodal and Efficient Foundation Models" (ICCV 2023)

Abstract

Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to their large size, high latency and fixed architectures. Unfortunately, recent work shows training a small custom VLFM for resource-limited applications is currently very difficult using public and smaller-scale data. In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences. We transfer the knowledge from the pre-trained CLIP-ViT- L/14 model to a ViT-B/32 model, with only 40M public im- ages and 28.4M unpaired public sentences. The resulting model “Distill-ViT-B/32” rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves similar results in terms of zero-shot and linear-probing performance on both Ima- geNet and the ELEVATER (20 image classification tasks) benchmarks. It also displays comparable robustness when evaluated on five datasets with natural distribution shifts from ImageNet.

Links: Arxiv/Project Page/Poster/Slides

Welcome to cite our work if you find it is helpful to your research.

@article{sun2023dime,
  title={DIME-FM: DIstilling Multimodal and Efficient Foundation Models},
  author={Sun, Ximeng and Zhang, Pengchuan and Zhang, Peizhao and Shah, Hardik and Saenko, Kate and Xia, Xide},
  journal={arXiv preprint arXiv:2303.18232},
  year={2023}
}

Release TODO List

Checkpoints
Evaluation code
Training code (Expected by the end of Oct)

Checkpoints

Model	Image Training Set	Text Training Set	ZS on IN-1K	ZS on ELEVATER	LP on ELEVATER	Robustness	Download
ViT-B/32	IN-21K + GCC-15M + YFCC-14M	Filtered Roberta NLP Corpus	66.5%	56.4%	79.2%	50.2%	ckpt
ViT-B/32	IN-21K + GCC-15M + YFCC-14M	IN-21K Prompts + GCC-15M + YFCC-14M + Downstream Tasks' Prompts	66.1%	57.7%	79.4%	-	ckpt

Evaluation

Our evaluation is based on ELEVATER benchmark (Please refer to README in Evaluation folder). We extend ELEVATER benchmark to include the ImageNet Variants as the Robustness Evaluation. The download links of these datasets can be found HERE.

Training

We provide the training code in Training folder.

Environment

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install timm 
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
pip install transformers
pip install yacs

Prepare the Data

For faster dataloading speed, we pack all data in tsv files according to this repo. We extract the features of image and text using CLIP-ViT-L/14 models. All features are also stored in tsv form.

We provide examples of image, text, image features, text features.

Data	Download Link
Image	tsv / lineidx
Text	tsv / lineidx
Image Feature	tsv / lineidx
Text Feature	tsv / lineidx

Training command

cd Training
python train_amp.py  --amp --dataroot <your_dataroot> --tsv_file_list  configs/datalists/cc3m/image.list \
configs/datalists/cc3m/image_feat.list configs/datalists/cc3m/text.list  configs/datalists/cc3m/text_feat.list \
--batch_size <your_batch_size> --use_pvl_loss

Please refer to Data Parallel Example to extend to multiple gpus and nodes.