Awesome

Referring Image Segmentation Using Text Supervision

Official PyTorch implementation of TRIS, from the following paper:

Referring Image Segmentation Using Text Supervision. ICCV 2023.
Fang Liu*, Yuhao Liu*, Yuqiu Kong, Ke Xu, Lihe Zhang, Baocai Yin, Gerhard Hancke, Rynson Lau

Environment

We recommend running the code using Pytorch 1.13.1 or higher version.

Dataset

RefCOCO/+/g

Download refer annotations from refer.
Download train2014 images from COCO.

├── data/
|   ├── train2014
|   ├── refer
|   |   ├── refcocog
|   |   |   ├── instances.json
|   |   |   ├── refs(google).p
|   |   |   ├── refs(umd).p
|   |   ├── refcoco

ReferIt

Download parsed annotations from our link.
Download saiapr_tc-12 images from referit.

├── data/
|   ├── referit
|   |   ├── annotations
|   |   |   ├── train.pickle
|   |   |   ├── test.pickle
|   |   ├── images
|   |   ├── masks

If you want to generate referit annotations by yourself, refer to MG for more details.

Evaluation

Note that we use mIoU to evaluate the accuracy of the generated masks.

Create the ./weights directory

mkdir ./weights

Download model weights using github links below and put them in ./weights.

	ReferIt	RefCOCO	RefCOCO+	G-Ref (Google)	G-Ref (UMD)
Step-1	weight	weight	weight	weight	weight
Step-2	weight	weight	weight	weight	weight

Shell for G-Ref(UMD) evaluation. Replace refcocog with refcoco, and umd with unc for RefCOCO dataset evaluation.

bash scripts/validate_stage1.sh

Demo

The output of the demo is saved in ./figs/.

python demo.py  --img figs/demo.png  --text 'man on the right'

Training

Train Step1 network on Gref (UMD) dataset.

bash scripts/train_stage1.sh

Validate and generate response maps on the Gref (UMD) train set, based on the proposed PRMS strategy (--prms). The response maps are saved in ./output/refcocog_umd/cam/ indicated by the args --cam_save_dir.

## path to save response maps and pseudo labels
dir=./output

python validate.py   --batch_size 1   --size 320   \
    --dataset refcocog   --splitBy umd   --test_split train   \
    --max_query_len 20   --output ./weights/   --resume \
    --pretrain  stage1_refcocog_umd.pth   --cam_save_dir $dir/refcocog_umd/cam/   \
    --name_save_dir $dir/refcocog_umd  --eval --prms  --save_cam

Train IRNet and generate pseudo masks.

cd IRNet

dir=../output
## single GPU
CUDA_VISIBLE_DEVICES=0 python run_sample_refer.py \
    --voc12_root ../../../work/datasets/train2014 \
    --cam_out_dir $dir/refcocog_umd/cam \
    --ir_label_out_dir $dir/refcocog_umd/ir_label \
    --ins_seg_out_dir $dir/refcocog_umd/ins_seg \
    --cam_eval_thres 0.15 \
    --work_space output_refer/refcocog_umd \
    --train_list $dir/refcocog_umd/refcocog_train_names.json \
    --num_workers 2 \
    --irn_batch_size 24 \
    --cam_to_ir_label_pass True \
    --train_irn_pass True \
    --make_ins_seg_pass True \

## the code can run faster if more GPUs are available
#CUDA_VISIBLE_DEVICES=0,1,2,3 python run_sample_refer.py   --cam_out_dir $dir/refcocog_umd/cam   --ir_label_out_dir $dir/refcocog_umd/ir_label   --ins_seg_out_dir $dir/refcocog_umd/ins_seg   --train_list $dir/refcocog_umd/refcocog_train_names.json   --cam_eval_thres 0.15   --work_space output_refer/refcocog_umd   --num_workers 8   --irn_batch_size 96   --cam_to_ir_label_pass True   --train_irn_pass True   --make_ins_seg_pass True

Train Step2 network using the generated pseudo masks in output/refcocog_umd/ins_seg indicated by the args --pseudo_path.

cd ../
bash scripts/train_stage2.sh

## python train_stage2.py  --batch_size 48  --size 320  --dataset refcocog  --splitBy umd  --test_split val  --bert_tokenizer clip  --backbone clip-RN50  --max_query_len 20  --epoch 15  --pseudo_path output/refcocog_umd/ins_seg  --output ./weights/stage2/pseudo_refcocog_umd

Acknowledgement

This repository was based on LAVT, WWbL, CLIMS and IRNet.

Citation

If you find this repository helpful, please consider citing:

@inproceedings{liu2023referring,
  title={Referring Image Segmentation Using Text Supervision},
  author={Liu, Fang and Liu, Yuhao and Kong, Yuqiu and Xu, Ke and Zhang, Lihe and Yin, Baocai and Hancke, Gerhard and Lau, Rynson},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={22124--22134},
  year={2023}
}

Contact

If you have any questions, please feel free to reach out at fawnliu2333@gmail.com.