Home

Awesome

S<sup>2</sup>-Wrapper

This repo contains the Pytorch implementation of S<sup>2</sup>-Wrapper, a simple mechanism that enables multi-scale feature extraction on any vision model.

<div align="center"> <image src="assets/s2_wrapper_2.png" width="840px" /> <p></p> </div>

Read our paper about when scaling on image scales is better than scaling on model size.

When Do We Not Need Larger Vision Models?<br> Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell<br> UC Berkeley, Microsoft Research<br>

Paper: https://arxiv.org/abs/2403.13043

News

To-Dos

<!-- ✅ ⬜️ -->

Quickstart

Step 1. Install s2wrapper through pip.

pip install git+https://github.com/bfshi/scaling_on_scales.git

Step 2. Extract multi-scale feature on any vision model with one line of code.

Assume you have a function (could be model, model.forward, etc.) that takes in BxCxHxW images and outputs BxNxC features.

For example, you have model (e.g., ViT-B) that extracts feature by

feature = model(x)   # e.g., x: 32*3*224*224, feature: 32*196*768

Then extract multi-scale features (e.g., scales of 1 and 2) by

from s2wrapper import forward as multiscale_forward
mutliscale_feature = multiscale_forward(model, x, scales=[1, 2])   # x: 32*3*224*224, feature: 32*196*1536

Above we assume the input is 224x224 and s2wrapper will interpolate it into 448x448. If the original 448x448 image is already available, we can get better performance if we interpolate from the 448x448 image instead of the 224x224 image. In this case, extract features at scales of 224x224 and 448x448 by

from s2wrapper import forward as multiscale_forward
mutliscale_feature = multiscale_forward(model, x, scales=[0.5, 1], max_split_size=224)   # x: 32*3*448*448, feature: 32*196*1536, note that we need to set `max_split_size=224` to make it split the 448 image into 4 sub-images.
# mutliscale_feature = multiscale_forward(model, x, img_sizes=[224, 448], max_split_size=224)   # alternatively, set `img_sizes` instead of `scales`

Usage

s2wrapper.forward(
    model,
    input,
    scales=None,
    img_sizes=None,
    max_split_size=None,
    resize_output_to_idx=0,
    num_prefix_token=0,
    output_shape='bnc',
    split_forward=False,
)

model: Your vision model or any function that takes in BxCxHxW image tensor and outputs BxNxC feature tensor.

input: Input image tensor with shape BxCxHxW.

scales: A list of scales to extract features on. For example, scales=[1, 2] will extract feature on 224<sup>2</sup> and 448<sup>2</sup> scales if default size is 224<sup>2</sup>.

img_sizes: Alternatively, instead of assigning scales, you can assign the image size for each scale. For example, img_sizes=[224, 448] will yeild with same results as scales=[1, 2] for default size of 224<sup>2</sup>.

max_split_size: The maximum size of sub-images splitted from the large image. For each scale, the image will be splitted into ceil(img_size_that_scale / max_split_size)**2 sub-images. If None, set by default as the size of input.

resize_output_to_idx: Which scale to resize the final feature map to. Default is the first scale in scales or img_sizes.

num_prefix_token: Number of prefix tokens in the feature map. For example, if the feature map returned by model contains 1 [CLS] token and other spatial tokens, set num_prefix_token=1. Default is 0.

output_shape: Shape of the output features. Need to be either bnc (e.g., ViT) or bchw (e.g., ConvNet). Default is bnc.

split_forward: Whether to run model on each sub-image separately or batch all sub-images into a single run. Setting to True can reduce memory usage (roughly the same GPU memory usage as single-scale during inference). Default is False.

Example: LLaVA with S<sup>2</sup>-Wrapper

S<sup>2</sup>-Wrapper is officially integrated into LLaVA (see the PR here). To use LLaVA with S<sup>2</sup>-Wrapper, simply install this repo and the latest version of LLaVA repo and download the checkpoints listed below. We've released the checkpoints of LLaVA-1.5-7B and LLaVA-1.5-13B with S<sup>2</sup>-Wrapper.

ModelSizeScheduleCheckpointVQAv2VizWizTextVQAMMMU-valMathVistaMM-BenchSEEDMM-Vet
LLaVA-1.57Bfull_ft-1eliuhaotian/llava-v1.5-7b78.550.058.236.225.264.365.731.1
LLaVA-1.57Blora-1eliuhaotian/llava-v1.5-7b-lora79.147.858.2--66.1-30.2
LLaVA-1.5-S27Blora-1ebfshi/llava-v1.5-7b-s2-lora80.050.161.037.725.366.267.932.4
LLaVA-1.513Bfull_ft-1eliuhaotian/llava-v1.5-13b80.053.661.336.427.667.768.236.1
LLaVA-1.513Blora-1eliuhaotian/llava-v1.5-13b-lora80.058.960.2--68.5-38.3
LLaVA-1.5-S213Blora-1ebfshi/llava-v1.5-13b-s2-lora80.956.063.137.427.867.968.936.4

An example script of model inference with LLaVA-1.5-S2:

python3 -m llava.eval.run_llava \
    --model-path bfshi/llava-v1.5-7b-s2-lora \
    --model-base lmsys/vicuna-7b-v1.5 \
    --image-file <image> \
    --query <query> \
    --conv-mode vicuna_v1

Training. To train LLaVA with S<sup>2</sup>-Wrapper, since the current LLaVA repo only supports evaluation with S<sup>2</sup>, please additionally apply the changes here to your LLaVA repo and you are good to go!

Training configurations should be the same as training a regular LLaVA without anyres (i.e., image_aspect_ratio="resize" and mm_patch_merge_type="flat"), except for two new model configs:

Example: NVIDIA VILA with S<sup>2</sup>-Wrapper

S<sup>2</sup>-Wrapper is officially integrated into NVIDIA VILA. VILA is a multi-modal LLM that supports multi-image understanding and video understanding with superb results on multiple benchmarks (e.g., ranked #1 on MMMU among all open-source models). VILA comes with several model sizes: 3B, 8B, 13B, and 40B, each also with a quantized version (AWQ).

Currently we've released the checkpoints of VILA-3B with S<sup>2</sup>-Wrapper which is your to-go choice for running MLLM on edge devices. Checkpoints of other model sizes are on the way! Meanwhile, welcome to check out more details here.

$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$Prec.VQAv2GQAVizWizSQA-IVQA-TPOPEMMEMMBMMB-CNSEEDSEED-IMMMU (val)MMMU (test)llava-benchMM-VetAverage
VILA1.5-3Bfp1680.461.553.569.060.485.91442.4463.452.760.967.933.330.875.935.460.2
VILA1.5-3B-S2fp1679.861.461.369.663.485.31431.6562.852.260.066.432.831.376.738.660.9
VILA1.5-3B-AWQint480.061.153.867.860.485.91437.3463.351.459.866.632.731.175.037.359.9
VILA1.5-3B-S2-AWQint479.461.362.369.263.085.81417.0661.651.559.165.733.430.477.136.760.5

Please refer to the original repo of VILA for checkpoints as well as guidance on training, evaluation, and deployment.

Example: HuggingFace CLIP with S<sup>2</sup>-Wrapper

Regular feature extraction using HuggingFace CLIP vision model (reference: official example):

from PIL import Image
import requests
from transformers import AutoProcessor, CLIPVisionModel

model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt").pixel_values

# model.forward returns an object that contains "last_hidden_state" which is the feature map we need
outputs = model(inputs).last_hidden_state
print(outputs.shape)  # 1*50*768

Making it multi-scale:

from PIL import Image
import requests
from transformers import AutoProcessor, CLIPVisionModel

model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt").pixel_values

# wrap the feature extraction process into a single function that
# takes image tensor as input and outputs feature tensor
def forward_features(inputs):
    return model(inputs).last_hidden_state

# extracting features with scales=[1, 2]. Note the output has one [CLS] token
# so setting num_prefix_token=1.
outputs = multiscale_forward(forward_feature, inputs, scales=[1, 2], num_prefix_token=1)
print(outputs.shape)  # 1*50*1536

Citation

@article{shi2024we,
  title={When Do We Not Need Larger Vision Models?},
  author={Shi, Baifeng and Wu, Ziyang and Mao, Maolin and Wang, Xin and Darrell, Trevor},
  journal={arXiv preprint arXiv:2403.13043},
  year={2024}
}