Home

Awesome

LAVT: Language-Aware Vision Transformer for Referring Segmentation

Zhao Yang*, Jiaqi Wang*, Xubing Ye*, Yansong Tang, Kai Chen, Hengshuang Zhao, Philip H.S. Torr

* Equal Contribution.

LAVT is officially accepted by TPAMI 2024! πŸŽ‰πŸŽ‰πŸŽ‰

Welcome to the repository for the method presented in "Language-Aware Vision Transformer for Referring Segmentation." Code in this repository is written using PyTorch and is organized in the following way (assuming the working directory is the root directory of this repository):

Setting Up

Preliminaries

The code has been verified to work with PyTorch v1.7.1/v1.8.1 and Python 3.7.

  1. Clone this repository.
  2. Change directory to root of this repository.

Package Dependencies

  1. Create a new Conda environment with Python 3.7 then activate it:
conda create -n lavt python==3.7
conda activate lavt
  1. Install PyTorch v1.7.1 with a CUDA version that works on your cluster/machine (CUDA 10.2 is used in this example):
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch
  1. Install the packages in requirements.txt via pip:
pip install -r requirements.txt

Datasets

Image

  1. Follow instructions in the ./refer directory to set up subdirectories and download annotations. This directory is a git clone (minus two data files that we do not need) from the refer public API.

  2. Download images from COCO. Please use the first downloading link 2014 Train images [83K/13GB], and extract the downloaded train_2014.zip file to ./refer/data/images/mscoco/images.

Video

Data directories have the following structure:

lavt_video/
└── data/
    β”œβ”€β”€ A2D/
    β”‚   └── Release/
    β”‚       β”œβ”€β”€ a2d_annotation.txt
    β”‚       β”œβ”€β”€ a2d_missed_videos.txt
    β”‚       β”œβ”€β”€ videoset.csv
    β”‚       β”œβ”€β”€ a2d_annotation_with_instances/  # ls -l | wc -l gives 3756
    β”‚       β”‚   └── */  (video folders)
    β”‚       β”‚       └── *.h5  (mask annotation files) 
    β”‚       β”œβ”€β”€ Annotations/
    β”‚       β”‚   β”œβ”€β”€ col  # ls -l | wc -l gives 3783
    β”‚       β”‚   β”‚   └── */ (video folders)
    β”‚       β”‚   β”‚       └── *.png  (masks in png format) 
    β”‚       β”‚   └── mat  # ls -l | wc -l gives 3783
    β”‚       β”‚       └── */ (video folders)
    β”‚       β”‚           └── *.mat  (masks stored as matrices)
    β”‚       β”œβ”€β”€ pngs320H/  # ls -l | wc -l gives 3783
    β”‚       β”‚   └── */ (video folders)
    β”‚       β”‚       └── *.png  (frame images; index starts at 00001)
    β”‚       └── clips320H/  # ls -l | wc -l gives 3783
    β”‚           └── *.mp4  (raw MP4 videos)
    β”‚
    β”‚
    └── ReferringYouTubeVOS2021/
        β”œβ”€β”€ train/
        β”‚    β”œβ”€β”€ Annotations/  # ls -l | wc -l gives 3472
        β”‚    β”‚   └── */  (video folders)
        β”‚    β”‚       └── *.png  (mask images)
        β”‚    β”œβ”€β”€ JPEGImages/  # ls -l | wc -l gives 3472
        β”‚    β”‚   └── */  (video folders)
        β”‚    β”‚       └── *.jpg  (frame images)
        β”‚    └── meta.json  # (this is 2019 training set meta file; has no expressions)
        β”‚
        β”œβ”€β”€ valid/
        β”‚   └── JPEGImages/  # ls -l | wc -l gives 203
        β”‚       └── */  (video folders)
        β”‚           └── *.jpg  (frame images)
        β”œβ”€β”€ test/ 
        β”‚   └── JPEGImages/  # ls -l | wc -l gives 306
        β”‚       └── */  (video folders)
        β”‚           └── *.jpg  (frame images)
        └── meta_expressions/
            β”œβ”€β”€ train/
            β”‚   └── meta_expressions.json  (video meta info with expressions)
            β”œβ”€β”€ valid/
            β”‚   └── meta_expressions.json  (video meta info with expressions)
            └── test/
                └── meta_expressions.json  (video meta info with expressions)

Weights for Training

  1. Create the ./pretrained_weights directory where we will be storing the weights.
mkdir ./pretrained_weights
    1. The original Swin Transformer. Download pre-trained classification weights of the Swin Transformer, swin_base_patch4_window12_384_22k.pth, into ./pretrained_weights. These weights are needed in training to initialize the model.
    2. The Video Swin Transformer. Download swin_tiny_patch244_window877_kinetics400_1k.pth, swin_small_patch244_window877_kinetics400_1k.pth, swin_base_patch244_window877_kinetics400_1k.pth, swin_base_patch244_window877_kinetics400_22k.pth, swin_base_patch244_window877_kinetics600_22k.pth, and swin_base_patch244_window1677_sthv2.pth into ./pretrained_weights.
  1. Create the ./checkpoints directory where the program will save the weights during training. (this is only true for the image-version LAVT; video LAVT saves 10 currently best checkpoints in ./models/[args.model_id]).

mkdir ./checkpoints

Training

We use DistributedDataParallel from PyTorch. The released lavt weights were trained using 4 x 32G V100 cards (max mem on each card was about 26G). The released lavt_one weights were trained using 8 x 32G V100 cards (max mem on each card was about 13G). The released lavt_video weights were trained using 8 x 32G V100 cards (max mem on each card was about 13G). Using more cards was to accelerate training.

To run on 4 GPUs (with IDs 0, 1, 2, and 3) on a single node for RIS:

mkdir ./models

mkdir ./models/refcoco
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 train.py --model lavt --dataset refcoco --model_id refcoco --batch-size 8 --lr 0.00005 --wd 1e-2 --swin_type base --pretrained_swin_weights ./pretrained_weights/swin_base_patch4_window12_384_22k.pth --epochs 40 --img_size 480 2>&1 | tee ./models/refcoco/output

mkdir ./models/refcoco+
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 train.py --model lavt --dataset refcoco+ --model_id refcoco+ --batch-size 8 --lr 0.00005 --wd 1e-2 --swin_type base --pretrained_swin_weights ./pretrained_weights/swin_base_patch4_window12_384_22k.pth --epochs 40 --img_size 480 2>&1 | tee ./models/refcoco+/output

mkdir ./models/gref_umd
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 train.py --model lavt --dataset refcocog --splitBy umd --model_id gref_umd --batch-size 8 --lr 0.00005 --wd 1e-2 --swin_type base --pretrained_swin_weights ./pretrained_weights/swin_base_patch4_window12_384_22k.pth --epochs 40 --img_size 480 2>&1 | tee ./models/gref_umd/output

mkdir ./models/gref_google
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 train.py --model lavt --dataset refcocog --splitBy google --model_id gref_google --batch-size 8 --lr 0.00005 --wd 1e-2 --swin_type base --pretrained_swin_weights ./pretrained_weights/swin_base_patch4_window12_384_22k.pth --epochs 40 --img_size 480 2>&1 | tee ./models/gref_google/output

To run on 8 GPUs (with IDs 0, 1, 2, 3, 4, 5, 6, 7) on a single node for RVOS:

mkdir ./models

mkdir ./models/a2d
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 train.py --model lavt_video --dataset a2d --model_id a2d --batch-size 4 --lr 0.00006 --wd 1e-2 --swin_type tiny --sep_t_pwam --conv3d_kernel_size_t 3-3-3 --conv3d_kernel_size_s 1-1-1 --w_t3x3_s1x1 --mm_t3x3_s1x1 --pretrained_swin_weights ./pretrained_weights/swin_tiny_patch244_window877_kinetics400_1k.pth --epochs 40 --img_size 480 2>&1 | tee ./models/a2d/output

mkdir ./models/ytvos
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 train.py --model lavt_video --dataset a2d --model_id a2d --batch-size 1 --lr 0.00005 --wd 1e-2 --swin_type tiny --sep_t_pwam --conv3d_kernel_size_t 3-3-3 --conv3d_kernel_size_s 1-1-1 --w_t3x3_s1x1 --mm_t3x3_s1x1 --pretrained_swin_weights ./pretrained_weights/swin_tiny_patch244_window877_kinetics400_1k.pth --epochs 30 --img_size 480 2>&1 | tee ./models/a2d/output

Testing

For RefCOCO/RefCOCO+, run one of

python test.py --model lavt --swin_type base --dataset refcoco --split val --resume ./checkpoints/refcoco.pth --workers 4 --ddp_trained_weights --window12 --img_size 480
python test.py --model lavt --swin_type base --dataset refcoco+ --split val --resume ./checkpoints/refcoco+.pth --workers 4 --ddp_trained_weights --window12 --img_size 480

For G-Ref (UMD)/G-Ref (Google), run one of

python test.py --model lavt --swin_type base --dataset refcocog --splitBy umd --split val --resume ./checkpoints/gref_umd.pth --workers 4 --ddp_trained_weights --window12 --img_size 480
python test.py --model lavt --swin_type base --dataset refcocog --splitBy google --split val --resume ./checkpoints/gref_google.pth --workers 4 --ddp_trained_weights --window12 --img_size 480

For A2D, run

python test.py --model lavt_video --swin_type tiny --dataset a2d --conv3d_kernel_size_t 3-3-3 --conv3d_kernel_size_s 1-1-1 --w_t3x3_s1x1 --mm_t3x3_s1x1 --num_frames 8 --split val --resume ./checkpoints/a2d.pth --sample_3 --img_size 480 --clip_length 16 --split val

For YTVOS, run

python test_ytvos.py 1 --model lavt_video --sep_t_pwam --conv3d_kernel_size_t 3-3-3 --conv3d_kernel_size_s 1-1-1 --w_t3x3_s1x1 --mm_t3x3_s1x1 --swin_type tiny --dataset ytvos --split valid --resume ./models/ytvos.pth --img_size 480

Results and weights

Image

The complete test results of the released LAVT models are summarized as follows: we report the results of LAVT trained with a multi-class Dice loss and based on the new implementation (lavt_one).

DatasetP@0.5P@0.6P@0.7P@0.8P@0.9Overall IoUMean IoU
RefCOCO val85.8782.1376.6465.4535.3073.5075.41
RefCOCO test A88.4785.6380.5768.8435.7175.9777.31
RefCOCO test B80.2076.4970.3460.1234.9469.3371.86
RefCOCO+ val76.1972.2766.8256.8730.1563.7967.65
RefCOCO+ test A82.5079.4474.0063.2731.9969.7972.53
RefCOCO+ test B68.0363.3557.2947.9226.9856.4961.22
G-Ref val (UMD)75.8271.0663.9952.9827.3164.0267.41
G-Ref test (UMD)76.1271.1364.5853.6228.0364.4967.45
G-Ref val (Goog.)72.5768.6563.0953.3328.1461.3164.84

To train weights of image LAVT for testing, you could follow:

  1. Create the ./checkpoints directory where we will be storing the weights.
mkdir ./checkpoints
  1. Download LAVT model weights (which are stored on Google Drive) using links below and put them in ./checkpoints.
RefCOCORefCOCO+G-Ref (UMD)G-Ref (Google)
  1. Model weights and training logs of the new lavt_one implementation are below.
RefCOCORefCOCO+G-Ref (UMD)G-Ref (Google)
log | weightslog | weightslog | weightslog | weights

Video

Results on the Refer-YouTube-VOS dataset under the β€œtrain-from-scratch” training setting with different backbone networks employed.

BackboneJ & FJF
Video Swin-T57.0455.3958.69
Video Swin-S58.7957.1060.49
Video Swin-B60.4558.4962.42

Results on the Refer-YouTube-VOS dataset under the β€œpretrain-then-finetune” training setting with different backbone networks employed.

BackboneJ & FJF
Video Swin-T60.9159.3762.45
Video Swin-S62.9660.3565.56
Video Swin-B64.9062.2267.58

Results on the A2D-Sentences dataset under the β€œtrain-from-scratch” training setting with different backbone networks employed.

BackboneoIoUmIoU
Video Swin-T74.465.9
Video Swin-S75.567.7
Video Swin-B77.068.7

Results on the A2D-Sentences dataset under the β€œpretrain-then-finetune” training setting with different backbone networks employed.

BackboneoIoUmIoU
Video Swin-T77.970.0
Video Swin-S79.170.4
Video Swin-B80.771.9

You could download video LAVT model weights (which are stored on Tsinghua cloud disk) using links below and put them in ./checkpoints.

Refer-YouTube-VOSA2D-Sentences
Refcoco_pretrainRefcoco_pretrain
YTVOS_finetuneA2D_finetune
YTVOS_scratchA2D_scratch
3D_PWAM_ablation-
CM-FPN_ablation-

Contributing

We appreciate all contributions. It helps the project if you could

Acknowledgements

Code in this repository is built upon several public repositories. Specifically,

Some of these repositories in turn adapt code from OpenMMLab and TorchVision. We'd like to thank the authors/organizations of these repositories for open sourcing their projects.