Home

Awesome

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Where we are ?

目前和原论文仍有1%左右得差距,但已经力压很多SOTA了

ckpt__448_epoch_25.pthmIoUOverall IoU
Refcoco val70.74371.671
Refcoco testA73.67974.772
Refcoco testB67.58267.339

对原论文的复现

论文链接: https://arxiv.org/abs/2112.02244

官方实现: https://github.com/yz93/LAVT-RIS

Architecture

特点

用法

详细参数设置可以见args.py

for training

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 main.py --batch_size 2 --cfg_file configs/swin_base_patch4_window7_224.yaml --size 448

for evaluation

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch --nproc_per_node 4 --master_port 23458 main.py --size 448 --batch_size 1 --resume --eval --type val --eval_mode cat --pretrain ckpt_448_epoch_20.pth --cfg_file configs/swin_base_patch4_window7_224.yaml

*.pth 都放在./checkpoint

for resume from checkpoint

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12346 main.py --batch_size 2 --cfg_file configs/swin_base_patch4_window7_224.yaml --size 448 --resume --pretrain ckpt_448_epoch_10.pth

需要完善的地方

由于我在复现的时候,官方的code还没有出来,所以一些细节上的设置可能和官方code不同

结果呈现

详细见inference.ipynb

input sentences

  1. right girl
  2. closest girl on right

results