Home

Awesome

ViTamin: Designing Scalable Vision Models in the Vision-language Era

šŸ”„ Officially supported by timm and OpenCLIP. Thanks @rwightman!

One line of code to call ViTamin:

model = timm.create_model('vitamin_xlarge_384')

ViTamin-XL, with only 436M parameters and trained on the public DataComp-1B dataset, achieves an impressive 82.9% zero-shot ImageNet accuracy.

ViTamin-L sets a new SOTA across seven benchmarks for open-vocabulary segmentation, and also push forward the capabilities of large multi-modal models (e.g., LLaVA) significantly.

šŸ¤— The HuggingFace collection of ViTamin model cards has been released! Check out the model cards!

<!-- [ViTamin: Designing Scalable Vision Models in the Vision-language Era](https://arxiv.org/pdf/2404.02132.pdf).\ āœØ &ensp;[Jieneng Chen](https://beckschen.github.io), [Qihang Yu](https://yucornetto.github.io/), [Xiaohui Shen](https://xiaohuishen.github.io/), [Alan Yuille](https://www.cs.jhu.edu/~ayuille/) and [Liang-Chieh Chen](http://liangchiehchen.com/)\ šŸ  &ensp;Johns Hopkins University, Bytedance --> <p> <img src="image0.png" alt="teaser" width=90% height=90%> </p>

Get Started

It currently includes code and models for the following tasks:

ViTamin Pre-training: See ./ViTamin/README.md for a quick start, which includes CLIP pre-training / fine-tuning pipelines and zero-shot evaluation pipelines.

Open-vocabulary Detection and Segmentation: See ViTamin for Open-vocab Detection and ViTamin for Open-vocab Segmentation.

Large Multi-Modal Models: See ViTamin for Large Multi-Modal Models.

We also support ViTamin with Hugging Face model jienengchen/ViTamin-XL-384px.

import torch
import open_clip
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    'jienengchen/ViTamin-XL-384px',
    trust_remote_code=True).to(device).eval()

image = Image.open('./image.png').convert('RGB')
image_processor = CLIPImageProcessor.from_pretrained('jienengchen/ViTamin-XL-384px')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
text = tokenizer(["a photo of vitamin", "a dog", "a cat"]).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features, text_features, logit_scale = model(pixel_values, text)
    text_probs = (100.0 * image_features @ text_features.to(torch.float).T).softmax(dim=-1)

print("Label probs:", text_probs) 

Main Results with CLIP Pre-training on DataComp-1B

We will provide 61 trained VLMs (48 benchmarked + 13 best performing) in Hugging Face for community use. Stay tuned!

image encoderšŸ¤— HuggingFaceimage sizenum patchestext encoder depth/widthseen samples (B)trainable params Image+Text (M)MACs Image+Text (G)ImageNet Acc.avg. 38 datasetsImageNet dist. shift.VTABretrieval
ViTamin-LLink22419612/76812.8333.3+123.772.6+6.680.866.769.865.360.3
ViTamin-LLink25625612/76812.8+0.2333.4+123.794.8+6.681.267.071.165.361.2
ViTamin-LLink33644112/76812.8+0.2333.6+123.7163.4+6.681.667.072.164.461.6
ViTamin-LLink38457612/76812.8+0.2333.7+123.7213.4+6.681.867.272.464.761.8
ViTamin-L2Link22419624/102412.8333.6+354.072.6+23.380.966.470.663.461.5
ViTamin-L2Link25625624/102412.8+0.5333.6+354.094.8+23.381.567.471.964.163.1
ViTamin-L2Link33644124/102412.8+0.5333.8+354.0163.4+23.381.867.873.064.563.6
ViTamin-L2Link38457624/102412.8+0.5334.0+354.0213.4+23.382.168.173.464.863.7
ViTamin-XLLink25625627/115212.8+0.5436.1+488.7125.3+33.182.167.672.365.462.7
ViTamin-XLLink38457627/115212.8+0.5436.1+488.7281.9+33.182.668.173.665.663.8
ViTamin-XLLink25625627/115240436.1+488.7125.3+33.182.367.572.864.062.1
ViTamin-XLLink33644127/115240+1436.1+488.7215.9+33.182.768.073.964.162.6
ViTamin-XLLink38457627/115240+1436.1+488.7281.9+33.182.968.174.164.062.5

Main Results on Downstream tasks

Open-Vocab Detection

image encoderdetectorOV-COCO (AP<sub>50</sub><sup>novel</sup>)OV-LVIS (AP<sub>r</sub>)
ViT-L/14Sliding F-ViT36.132.5
ViTamin-LSliding F-ViT37.535.6

Open-Vocab Segmentation

image encodersegmentorADECityscapesMVA-150A-847PC-459PC-59PAS-21
ViT-L/14Sliding FC-CLIP24.640.716.531.814.318.355.181.5
ViTamin-LSliding FC-CLIP27.344.018.235.616.120.458.483.4

Note: Panoptic dataset (ADE, CityScapes, MV) are with the metric of PQ. Semantic dataset (A-150, A-847, PC-459, PC-59, PAS-21) are with the metric of mIoU.

Large Multi-modal Models

image encoderimage sizeVQAv2GQAVizWizSQAT-VQAPOPEMMEMM-BenchMM-B-CNSEEDLLaVA-WildMM-Vet
ViTamin-L33678.461.651.166.958.784.6142165.458.457.764.533.6
ViTamin-L38478.961.655.467.659.885.5144764.558.357.966.133.6

Citing ViTamin

@inproceedings{chen2024vitamin,
  title={ViTamin: Designing Scalable Vision Models in the Vision-language Era},
  author={Chen, Jieneng and Yu, Qihang and Shen, Xiaohui and Yuille, Alan and Chen, Liang-Chieh},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}