Awesome
Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
Introduction
In this work, we investigate whether multimodal LLMs can perceive small details as well as large details in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question, declining up to $46%$ with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. Inspired by the usefulness of human cropping, we then propose three automatic visual cropping methods as inference time mechanisms to improve the zero-shot performance of multimodal LLMs. We study their effectiveness on four popular VQA datasets, and a subset of the VQAv2 dataset tailored towards fine visual details. Our findings suggest that multimodal LLMs should be used with caution in detail-sensitive VQA applications, and that visual cropping is a promising direction to improve their zero-shot performance.
Testing Cropping Methods and See How Cropping helps BLIP2 Answer Question Better
(optional) Create a conda environment and activate it.
conda create -n visual_crop_zsvqa python=3.8
conda activate visual_crop_zsvqa
Clone the repoisitory
git clone https://github.com/saccharomycetes/visual_crop_zsvqa.git
cd visual_crop_zsvqa
Since we have made a modification to the original LAVIS library, please use the following command to install the modified LAVIS library.
cd LAVIS
pip install -e .
Then install the rest of the dependencies.
cd ..
pip install -r requirements.txt
Download the model checkpoints
SAM model checkpoint here
YOLO model checkpoint here
Or you can download them using the following command
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8x.pt
Now you will be ready to run the crop.ipynb
to see how cropping helps BLIP2 answer question better.
Citation
If you find our research to be useful or insightful, please consider citing the following paper:
@article{zhang2023visual,
title={Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models},
author={Zhang, Jiarui and Khayatkhoei, Mahyar and Chhikara, Prateek and Ilievski, Filip},
journal={arXiv preprint arXiv:2310.16033},
year={2023}
}
Contact
jrzhang [AT] isi.edu