Home

Awesome

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Welcome to the official repository for the method presented in "LAVT: Language-Aware Vision Transformer for Referring Image Segmentation."

Pipeline Image

Code in this repository is written using PyTorch and is organized in the following way (assuming the working directory is the root directory of this repository):

Updates

April 13<sup>th</sup>, 2023. Using the Dice loss instead of the cross-entropy loss can improve results. Will add code and release weights later when get a chance.

June 21<sup>st</sup>, 2022. Uploaded the training logs and trained model weights of lavt_one.

June 9<sup>th</sup>, 2022. Added a more efficient implementation of LAVT.

Setting Up

Preliminaries

The code has been verified to work with PyTorch v1.7.1 and Python 3.7.

  1. Clone this repository.
  2. Change directory to root of this repository.

Package Dependencies

  1. Create a new Conda environment with Python 3.7 then activate it:
conda create -n lavt python==3.7
conda activate lavt
  1. Install PyTorch v1.7.1 with a CUDA version that works on your cluster/machine (CUDA 10.2 is used in this example):
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch
  1. Install the packages in requirements.txt via pip:
pip install -r requirements.txt

Datasets

  1. Follow instructions in the ./refer directory to set up subdirectories and download annotations. This directory is a git clone (minus two data files that we do not need) from the refer public API.

  2. Download images from COCO. Please use the first downloading link 2014 Train images [83K/13GB], and extract the downloaded train_2014.zip file to ./refer/data/images/mscoco/images.

The Initialization Weights for Training

  1. Create the ./pretrained_weights directory where we will be storing the weights.
mkdir ./pretrained_weights
  1. Download pre-trained classification weights of the Swin Transformer, and put the pth file in ./pretrained_weights. These weights are needed for training to initialize the model.

Trained Weights of LAVT for Testing

  1. Create the ./checkpoints directory where we will be storing the weights.
mkdir ./checkpoints
  1. Download LAVT model weights (which are stored on Google Drive) using links below and put them in ./checkpoints.
RefCOCORefCOCO+G-Ref (UMD)G-Ref (Google)
  1. Model weights and training logs of the new lavt_one implementation are below.
RefCOCORefCOCO+G-Ref (UMD)G-Ref (Google)
log | weightslog | weightslog | weightslog | weights

Training

We use DistributedDataParallel from PyTorch. The released lavt weights were trained using 4 x 32G V100 cards (max mem on each card was about 26G). The released lavt_one weights were trained using 8 x 32G V100 cards (max mem on each card was about 13G). Using more cards was to accelerate training. To run on 4 GPUs (with IDs 0, 1, 2, and 3) on a single node:

mkdir ./models

mkdir ./models/refcoco
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 train.py --model lavt --dataset refcoco --model_id refcoco --batch-size 8 --lr 0.00005 --wd 1e-2 --swin_type base --pretrained_swin_weights ./pretrained_weights/swin_base_patch4_window12_384_22k.pth --epochs 40 --img_size 480 2>&1 | tee ./models/refcoco/output

mkdir ./models/refcoco+
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 train.py --model lavt --dataset refcoco+ --model_id refcoco+ --batch-size 8 --lr 0.00005 --wd 1e-2 --swin_type base --pretrained_swin_weights ./pretrained_weights/swin_base_patch4_window12_384_22k.pth --epochs 40 --img_size 480 2>&1 | tee ./models/refcoco+/output

mkdir ./models/gref_umd
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 train.py --model lavt --dataset refcocog --splitBy umd --model_id gref_umd --batch-size 8 --lr 0.00005 --wd 1e-2 --swin_type base --pretrained_swin_weights ./pretrained_weights/swin_base_patch4_window12_384_22k.pth --epochs 40 --img_size 480 2>&1 | tee ./models/gref_umd/output

mkdir ./models/gref_google
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 train.py --model lavt --dataset refcocog --splitBy google --model_id gref_google --batch-size 8 --lr 0.00005 --wd 1e-2 --swin_type base --pretrained_swin_weights ./pretrained_weights/swin_base_patch4_window12_384_22k.pth --epochs 40 --img_size 480 2>&1 | tee ./models/gref_google/output

Testing

For RefCOCO/RefCOCO+, run one of

python test.py --model lavt --swin_type base --dataset refcoco --split val --resume ./checkpoints/refcoco.pth --workers 4 --ddp_trained_weights --window12 --img_size 480
python test.py --model lavt --swin_type base --dataset refcoco+ --split val --resume ./checkpoints/refcoco+.pth --workers 4 --ddp_trained_weights --window12 --img_size 480

For G-Ref (UMD)/G-Ref (Google), run one of

python test.py --model lavt --swin_type base --dataset refcocog --splitBy umd --split val --resume ./checkpoints/gref_umd.pth --workers 4 --ddp_trained_weights --window12 --img_size 480
python test.py --model lavt --swin_type base --dataset refcocog --splitBy google --split val --resume ./checkpoints/gref_google.pth --workers 4 --ddp_trained_weights --window12 --img_size 480

Results

  1. The evaluation results (those reported in the paper) of LAVT trained with a cross-entropy loss and based on our original implementation are summarized as follows:
DatasetP@0.5P@0.6P@0.7P@0.8P@0.9Overall IoUMean IoU
RefCOCO val84.4680.9075.2864.7134.3072.7374.46
RefCOCO test A88.0785.1779.9068.5235.6975.8276.89
RefCOCO test B79.1274.9469.1759.3734.4568.7970.94
RefCOCO+ val74.4470.9165.5856.3430.2362.1465.81
RefCOCO+ test A80.6877.9672.9062.2132.3668.3870.97
RefCOCO+ test B65.6661.8555.9447.5627.2455.1059.23
G-Ref val (UMD)70.8165.2858.6047.4922.7361.2463.34
G-Ref test (UMD)71.5466.3859.0048.2123.1062.0963.62
G-Ref val (Goog.)71.1667.2161.7651.9827.3060.5063.66

       - We have validated LAVT on RefCOCO with multiple runs. The overall IoU on the val set generally lies in the range of 72.73±0.5%.

  1. In the following, we report the results of LAVT trained with a multi-class Dice loss and based on the new implementation (lavt_one).
DatasetP@0.5P@0.6P@0.7P@0.8P@0.9Overall IoUMean IoU
RefCOCO val85.8782.1376.6465.4535.3073.5075.41
RefCOCO test A88.4785.6380.5768.8435.7175.9777.31
RefCOCO test B80.2076.4970.3460.1234.9469.3371.86
RefCOCO+ val76.1972.2766.8256.8730.1563.7967.65
RefCOCO+ test A82.5079.4474.0063.2731.9969.7972.53
RefCOCO+ test B68.0363.3557.2947.9226.9856.4961.22
G-Ref val (UMD)75.8271.0663.9952.9827.3164.0267.41
G-Ref test (UMD)76.1271.1364.5853.6228.0364.4967.45
G-Ref val (Goog.)72.5768.6563.0953.3328.1461.3164.84

Demo: Try LAVT on Your Own Image-Text Pairs

You can run inference on any image-text pair and visualize the result by running the script ./demo_inference.py. Have fun!

Citing LAVT

@inproceedings{yang2022lavt,
  title={LAVT: Language-Aware Vision Transformer for Referring Image Segmentation},
  author={Yang, Zhao and Wang, Jiaqi and Tang, Yansong and Chen, Kai and Zhao, Hengshuang and Torr, Philip HS},
  booktitle={CVPR},
  year={2022}
}

Contributing

We appreciate all contributions. It helps the project if you could

Acknowledgements

Code in this repository is built upon several public repositories. Specifically,

Some of these repositories in turn adapt code from OpenMMLab and TorchVision. We'd like to thank the authors/organizations of these repositories for open sourcing their projects.

License

GNU GPLv3