Home

Awesome

CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation

[Paper][🤗Ckpts]

Abstract

'''State space models and Mamba-based models have been increasingly applied across various domains, achieving state-of-the-art performance. This technical report introduces the first attempt to train a transferable Mamba model utilizing contrastive language-image pretraining (CLIP). We have trained Mamba models of varying sizes and undertaken comprehensive evaluations of these models on 26 zero-shot classification datasets and 16 out-of-distribution (OOD) datasets. Our findings reveal that a Mamba model with 67 million parameters is on par with a 307 million-parameter Vision Transformer (ViT) model in zero-shot classification tasks, highlighting the parameter efficiency of Mamba models. In tests of OOD generalization, Mamba-based models exhibit exceptional performance in conditions of OOD image contrast or when subjected to high-pass filtering. However, a Hessian analysis indicates that Mamba models feature a sharper and more non-convex landscape compared to ViT-based models, making them more challenging to train.'''

Main results

Zero-shot performance of different architectures trained with CLIP

MethodsFood-101CIFAR-10CIFAR-100CUBSUN397CarsAircraftDTDPetsCaltech-101FlowersMNISTFER-2013STL-10EuroSATRESISC45GTSRBKITTICountry211PCAMUCF101Kinetics700CLEVRHatefulMemesSST2ImageNet
VMamba_B (89M)48.558.029.936.550.45.88.526.530.264.752.89.719.691.916.030.47.940.210.259.935.225.612.651.650.138.3
VMamba_S (50M)49.470.334.339.153.96.98.426.031.368.754.110.19.892.817.631.46.923.510.954.238.427.113.250.550.040.0
VMamba_T220 (30M)46.550.922.935.651.15.76.825.131.064.954.010.112.591.613.925.410.732.39.955.034.025.112.753.950.638.7
Simba_L (66.6M)52.767.431.039.152.76.99.127.833.468.955.98.016.093.917.432.38.941.511.158.135.727.912.154.950.141.6
VIT_B(84M)50.666.034.538.851.14.05.421.228.560.953.38.417.390.530.221.56.135.110.553.528.522.110.852.450.737.6
VIT-L(307M)59.572.941.540.353.66.96.420.627.965.455.010.334.594.222.728.85.841.412.554.934.324.012.954.350.140.4

Acknowledgment

This project is based on A-CLIP (paper, code), VMamba (paper, code), SiMBA (paper, code), thanks for their excellent works.