Home

Awesome

GroundVLP

GroundVLP: A simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data

<p align="center"> <img src='docs/introduction3.png' align="center" height="400px"> </p>

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection
Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin
AAAI 2024 (arXiv 2312.15043)

Installation

pip install -r requirements.txt

An example code for setting up the environment:

# create a new environment
conda create --name groundvlp python=3.8
conda activate groundvlp

git clone https://github.com/om-ai-lab/GroundVLP.git
cd GroundVLP

# install pytorch
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

# install detectron2
python -m pip install detectron2 -f \
  https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/index.html

# install requirements
pip install -r requirements.txt

Download

Checkpoints

Download the following checkpoints and place them at the path checkpoints/ :

Json files

Download the following link and unzip them at the path data/ :

Images

Download the COCO images and unzip them at the path images/train2014 :

Finally, the folder tree is that:

GroundVLP
  ├── checkpoints                                  
  │   └── ALBEF.pth
  │   └── Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
  ├── data
  │   └── refcoco_val_info.json
  │   └── ...
  │   └── refcocog_val_info.json
  │── images
  │   └── train2014
  │       └── COCO_train2014_xxx.jpg
 ...

Run

Results of RefCOCO/+/g

Run this command to evaluate GroundVLP on REC datasets using the ground-truth category:

python eval_rec.py \
  --image_folder="./images/train2014" \
  --eval_data="refcoco_val,refcoco_testA,refcoco_testB,refcoco+_val,refcoco+_testA,refcoco+_testB,refcocog_val,refcocog_test" \
  --model_id="ALBEF" \
  --use_gt_category

Currently, the code we release only supports the ALBEF and TCL models. We will continue to update the code to support more models.

If you want to get the results using the predicted category, you should get the agent of each query first and map it into the coco label:

python utils/map_to_coco_label.py

Then run thie command:

python eval_rec.py \
  --image_folder="./images/train2014" \
  --eval_data="refcoco_val,refcoco_testA,refcoco_testB,refcoco+_val,refcoco+_testA,refcoco+_testB,refcocog_val,refcocog_test" \
  --model_id="ALBEF" \

Demo

Run this command to evaluate GroundVLP on a single image-query pair:

python demo.py \
  --image_path="./docs/demo.jpg" \
  --query="boy with white hair" \

If setup correctly, the output image in the path output/demo.jpg should look like:

<p align="center"> <img src='docs/demo_output.jpg' align="center" width="400px"> </p>

Citations

If you find this project useful for your research, please use the following BibTeX entry.

@article{shen2023groundvlp,
  title={GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection},
  author={Shen, Haozhan and Zhao, Tiancheng and Zhu, Mingwei and Yin, Jianwei},
  journal={arXiv preprint arXiv:2312.15043},
  year={2023}
}