Home

Awesome

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

This is the official repository of MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. (CVPR 2024) Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel. The repository contains code for inference, training, and evaluation of MobileCLIP models trained on DataCompDR datasets.

<p align="center"> <img src="docs/fig_accuracy_latency.png" alt="Accuracy vs latency figure." width="400"/> </p>

Highlights

Examples

Getting Started

Setup

conda create -n clipenv python=3.10
conda activate clipenv
pip install -e .

To download pretrained checkpoints follow the code snippet below

source get_pretrained_models.sh   # Files will be downloaded to `checkpoints` directory.

Usage Example

To models from the official repo, follow the code snippet below

import torch
from PIL import Image
import mobileclip

model, _, preprocess = mobileclip.create_model_and_transforms('mobileclip_s0', pretrained='/path/to/mobileclip_s0.pt')
tokenizer = mobileclip.get_tokenizer('mobileclip_s0')

image = preprocess(Image.open("docs/fig_accuracy_latency.png").convert('RGB')).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

For an example of loading the data from HuggingFace see hf_dataset_example.py.

OpenCLIP Support

Our models are now natively supported in OpenCLIP. To use MobileCLIP models in OpenCLIP, setup your environment as shown below,

conda create -n clipenv python=3.10
conda activate clipenv

pip install git+https://github.com/mlfoundations/open_clip
pip install git+https://github.com/huggingface/pytorch-image-models

To run inference, see example below,

import open_clip
from mobileclip.modules.common.mobileone import reparameterize_model
 
model, _, preprocess = open_clip.create_model_and_transforms('MobileCLIP-S2', pretrained='datacompdr')
tokenizer = open_clip.get_tokenizer('MobileCLIP-S2')

# For inference/model exporting purposes, please reparameterize first
model.eval() 
model = reparameterize_model(model)

# ... follow examples in open_clip repo ...

Variants currently available on OpenCLIP, [('MobileCLIP-S1', 'datacompdr'), ('MobileCLIP-S2', 'datacompdr'), ('MobileCLIP-B', 'datacompdr'), ('MobileCLIP-B', 'datacompdr_lt')]

Evaluation

Please find the detailed evaluation results here. To reproduce results, we provide script to perform zero-shot evaluation on ImageNet-1k dataset. To evaluate on all the 38 datasets, please follow instructions in datacomp.

# Run evaluation with single GPU
python eval/zeroshot_imagenet.py --model-arch mobileclip_s0 --model-path /path/to/mobileclip_s0.pt

Please refer to Open CLIP Results to compare with other models.

Model# Seen <BR>Samples (B)# Params (M) <BR> (img + txt)Latency (ms) <BR> (img + txt)IN-1k Zero-Shot <BR> Top-1 Acc. (%)Avg. Perf. (%) <BR> on 38 datasetsPytorch Checkpoint (url)
MobileCLIP-S01311.4 + 42.41.5 + 1.667.858.1mobileclip_s0.pt
MobileCLIP-S11321.5 + 63.42.5 + 3.372.661.3mobileclip_s1.pt
MobileCLIP-S21335.7 + 63.43.6 + 3.374.463.7mobileclip_s2.pt
MobileCLIP-B1386.3 + 63.410.4 + 3.376.865.2mobileclip_b.pt
MobileCLIP-B (LT)3686.3 + 63.410.4 + 3.377.265.8mobileclip_blt.pt

Note: MobileCLIP-B(LT) is trained for 300k iterations with constant learning rate schedule and 300k iterations with cosine learning rate schedule.

Citation

If you found this code useful, please cite the following paper:

@InProceedings{mobileclip2024,
  author = {Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel},
  title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2024},
}

Acknowledgements

Our codebase is built using multiple opensource contributions, please see ACKNOWLEDGEMENTS for more details.