Home

Awesome

ViT-Lens

Project Homepage arXiv arXiv Static Badge

TL;DR: We present ViT-Lens, an approach for advancing omni-modal representation learning by leveraging a pretrained-ViT with modality Lens to comprehend diverse modalities.

<p align="center"> <img src="assets/vitlens-teaser.png" alt="vit-lens-omni-modal" width="400" /> </p> <p align="center"> <img src="assets/vitlens-sc.png" alt="vit-lens-capabilities" width="600" /> </p>

πŸ“’ News

<!-- -->

πŸ“ Todo

πŸ”¨ Installation

conda create -n vit-lens python=3.8.8 -y
conda activate vit-lens

# Install pytorch>=1.9.0 
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch -y

# Install ViT-Lens
git clone https://github.com/TencentARC/ViT-Lens.git
cd ViT-Lens/
pip install -e vitlens/
pip install -r vitlens/requirements-training.txt
<details> <summary>Training/Inference on OpenShape Triplets on 3D point clouds: environment setup (click to expand)</summary>
conda create -n vit-lens python=3.8.8 -y
conda activate vit-lens
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch -y
conda install -c dglteam/label/cu113 dgl -y

# Install ViT-Lens
git clone https://github.com/TencentARC/ViT-Lens.git
cd ViT-Lens/
pip install -e vitlens/
pip install -r vitlens/requirements-training.txt
</details>

πŸ” ViT-Lens Model

MN40SUN.DNYU.DAudiosetVGGSoundESC50ClothoAudioCapsTAG.MIN.EEGDownload
ImageBind(Huge)-35.154.017.627.866.96.0/28.49.3/42.3---
ViT-Lens-L80.652.268.526.731.775.98.1/31.214.4/54.965.842.7vitlensL

We release a one-stop ViT-Lens-L model (based on Large ViT) and show its performance on ModelNet40 (MN40, top1 accuracy), SUN RGBD Depth-only (SUN.D, top1 accuracy), NYUv2 Depth-only (NYU.D, top1 accuracy), Audioset (Audioset, mAP), VGGSound (VGGSound, top1 accuracy), ESC50 (ESC50, top1 accuracy), Clotho (Clotho, R@1/R@10), AudioCaps (AudioCaps, R@1/R@10), TAG.M (Touch-and-Go Material, top1 accuracy) and IN.EEG (ImageNet EEG, top1 accuracy). ViT-Lens consistently outperforms ImageBind.

For more model checkpoints (trained on different data or with better performance), please refer to MODEL_ZOO.md.

πŸ“š Usage

πŸ“¦ Datasets

Please refer to DATASETS.md for dataset preparation.

πŸš€ Training & Inference

Please refer to TRAIN_INFERENCE.md for details.

🧩 Model Zoo

Please refer to MODEL_ZOO.md for details.

πŸ‘€ Visualization of Demo

<details open><summary>[ Plug ViT-Lens into SEED: Video Demo ]</summary><img src="./assets/vid_seed.gif" alt="vitlens-seed.video" style="width: 80%; height: auto;"></details> <details close><summary>[ Plug ViT-Lens into SEED: enabling compound Any-to-Image Generation ]</summary><img src="./assets/seed_integrated.png" alt="vitlens-seed" style="width: 70%; height: auto;"></details> <details open><summary>[ Plug ViT-Lens into InstructBLIP: Video Demo ]</summary><img src="./assets/insblip.gif" alt="insblip.video" style="width: 80%; height: auto;"></details> <details close><summary>[ Plug ViT-Lens into InstructBLIP: enabling Any instruction following ]</summary><img src="./assets/insblip_2inp.png" alt="vitlens.instblip2" style="width: 70%; height: auto;"> </details> <details close><summary>[ Plug ViT-Lens into InstructBLIP: enabling Any instruction following ]</summary><img src="./assets/insblip_3inp.png" alt="mmvitlens.instblip3" style="width: 70%; height: auto;"> </details> <details close><summary>[ Example: Plug 3D lens to LLM ]</summary><img src="./assets/e_3d_plant.png" alt="plant" style="width: 60%; height: auto;"> </details> <details close><summary>[ Example: Plug 3D lens to LLM ]</summary><img src="./assets/e_3d_piano.png" alt="piano" style="width: 60%; height: auto;"> </details>

πŸŽ“ Citation

If you find our work helps, please give us a star🌟 and consider citing:

@InProceedings{Lei_2024_CVPR,
    author    = {Lei, Weixian and Ge, Yixiao and Yi, Kun and Zhang, Jianfeng and Gao, Difei and Sun, Dylan and Ge, Yuying and Shan, Ying and Shou, Mike Zheng},
    title     = {ViT-Lens: Towards Omni-modal Representations},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {26647-26657}
}

βœ‰οΈ Contact

Questions and discussions are welcome via leiwx52@gmail.com or open an issue.

πŸ™ Acknowledgement

This codebase is based on open_clip, ULIP, OpenShape and LAVIS. Big thanks to the authors for their awesome contributions!