Home

Awesome

✌️ VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Framework: PyTorch HuggingFace space YouTube

Jitesh Jain, Jianwei Yang, Humphrey Shi

[Project Page] [COST Dataset] [arXiv] [pdf] [Video] [BibTeX]

This repo contains the code for our paper VCoder: Versatile Vision Encoders for Multimodal Large Language Models.

<p align="center"> <img src="images/features.svg" width="100%" class="center"/> </p> <p align="center"> <img src="images/vcoder.svg" width="100%" class="center"/> </p>

Contents

  1. Installation Instructions
  2. Demo
  3. Dataset Preparation
  4. Getting Started
  5. Results
  6. Citation

News

Installation Instructions

We use Python 3.10 and PyTorch 2.0.1 (CUDA 11.7 build) on Ubuntu 20.04.3 LTS.

Demo

HuggingFace space

You can use one of the CLI or Gradio interface to interact with VCoder LLaVA-1.5 locally.

Note: You can obtain the segmentation map from the OneFormer Demo and the depth map from DINOv2.

Gradio Interface

Run the following command:

CUDA_VISIBLE_DEVICES=0 python -m vcoder_llava.serve.gradio_app --model-path shi-labs/vcoder_ds_llava-v1.5-13b

CLI Inference

Run the following command:

CUDA_VISIBLE_DEVICES=0 python -m vcoder_llava.serve.cli \
    --model-path shi-labs/vcoder_ds_llava-v1.5-13b \
    --image-file "vcoder_llava/serve/examples/suits.jpg" \
    --seg-image-file "vcoder_llava/serve/examples/suits_pan.png" \ # optional [reqd with depth input]
    --depth-image-file "vcoder_llava/serve/examples/suits_depth.jpeg" \ # optional
    --load-4bit # optional, you may also use --load-8bit

Getting Started

Please see Getting Started with VCoder for training and evaluation commands.

Results

Note that we do not finetune any parameters in the original LLaVA-1.5 models, so VCoder's performance on general question answering benchmarks is the same as LLaVA-1.5 .

Benchmarking on COST

ModelSemanticInstancePanopticDepthCheckpoint
CS(↑)/HS(↓)CS(↑)/HS(↓)CS(↑)/HS(↓)DS(↓)
VCoder LLaVA-1.5-7b88.6/10.471.1/26.986.0/12.8-HF Hub
VCoder LLaVA-1.5-13b89.0/10.073.3/25.087.2/11.6-HF Hub
VCoder-DS LLaVA-1.5-7b87.8/11.569.9/28.586.8/12.465.9HF Hub
VCoder-DS LLaVA-1.5-13b88.5/10.971.7/26.388.5/10.863.3HF Hub

We release the model responses used for benchmarking here.

Citation

If you found VCoder useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

@article{jain2023vcoder,
    title={{VCoder: Versatile Vision Encoders for Multimodal Large Language Models}},
    author={Jitesh Jain and Jianwei Yang and Humphrey Shi},
    journal={arXiv},
    year={2023}
}

Acknowledgement

We thank the authors of LLaVA, OneFormer, and DINOv2 for open-sourcing their codebase and checkpoints. We are also grateful to the authors of CHAIR for releasing their synonym word mapping.