Home

Awesome

Referring Image Segmentation Using Text Supervision

Official PyTorch implementation of TRIS, from the following paper:

Referring Image Segmentation Using Text Supervision. ICCV 2023.
Fang Liu*, Yuhao Liu*, Yuqiu Kong, Ke Xu, Lihe Zhang, Baocai Yin, Gerhard Hancke, Rynson Lau


<p align="left"> <img src="figs/pipeline.png" class="center"> </p>

Environment

We recommend running the code using <b>Pytorch 1.13.1</b> or higher version.

<!-- ```bash conda env create -f environment.yml ``` -->

Dataset

RefCOCO/+/g

  1. Download refer annotations from refer.
  2. Download train2014 images from COCO.
├── data/
|   ├── train2014
|   ├── refer
|   |   ├── refcocog
|   |   |   ├── instances.json
|   |   |   ├── refs(google).p
|   |   |   ├── refs(umd).p
|   |   ├── refcoco

ReferIt

  1. Download parsed annotations from our link.
  2. Download saiapr_tc-12 images from referit.
├── data/
|   ├── referit
|   |   ├── annotations
|   |   |   ├── train.pickle
|   |   |   ├── test.pickle
|   |   ├── images
|   |   ├── masks

If you want to generate referit annotations by yourself, refer to MG for more details.

Evaluation

Note that we use <b>mIoU</b> to evaluate the accuracy of the generated masks.

  1. Create the ./weights directory
mkdir ./weights
  1. Download model weights using github links below and put them in ./weights.
ReferItRefCOCORefCOCO+G-Ref (Google)G-Ref (UMD)
Step-1weightweightweightweightweight
Step-2weightweightweightweightweight
  1. Shell for G-Ref(UMD) evaluation. Replace refcocog with refcoco, and umd with unc for RefCOCO dataset evaluation.
bash scripts/validate_stage1.sh
<!-- ```bash python validate.py --batch_size 1 --size 320 --dataset refcocog --splitBy umd --test_split val --max_query_len 20 --dataset_root ./data --output weights/ --resume --pretrain stage1_refcocog_umd.pth --eval ``` For ReferIt dataset: ```bash python validate_referit.py --batch_size 1 --size 320 --dataset referit --test_split test --backbone clip-RN50 --max_query_len 20 --dataset_root ./data/referit/ --output weights/ --resume --pretrain stage1_referit.pth --eval ``` -->

Demo

The output of the demo is saved in ./figs/.

python demo.py  --img figs/demo.png  --text 'man on the right'
<p align="left"> <img src="figs/demo.png" style="width: 200px; height: auto; "> <img src="figs/demo_(man on the right).png" style="width: 200px; height: auto;"> </p>

Training

  1. Train Step1 network on Gref (UMD) dataset.
bash scripts/train_stage1.sh
<!-- ```bash python train_stage1.py --batch_size 48 --size 320 --dataset refcocog --splitBy umd --test_split val --epoch 15 --backbone clip-RN50 --max_query_len 20 --negative_samples 3 --output ./weights/refcocog_umd --board_folder ./output/board ``` -->
  1. Validate and generate response maps on the Gref (UMD) train set, based on the proposed PRMS strategy (--prms). The response maps are saved in ./output/refcocog_umd/cam/ indicated by the args --cam_save_dir.
## path to save response maps and pseudo labels
dir=./output

python validate.py   --batch_size 1   --size 320   \
    --dataset refcocog   --splitBy umd   --test_split train   \
    --max_query_len 20   --output ./weights/   --resume \
    --pretrain  stage1_refcocog_umd.pth   --cam_save_dir $dir/refcocog_umd/cam/   \
    --name_save_dir $dir/refcocog_umd  --eval --prms  --save_cam 
  1. Train IRNet and generate pseudo masks.
cd IRNet

dir=../output
## single GPU
CUDA_VISIBLE_DEVICES=0 python run_sample_refer.py \
    --voc12_root ../../../work/datasets/train2014 \
    --cam_out_dir $dir/refcocog_umd/cam \
    --ir_label_out_dir $dir/refcocog_umd/ir_label \
    --ins_seg_out_dir $dir/refcocog_umd/ins_seg \
    --cam_eval_thres 0.15 \
    --work_space output_refer/refcocog_umd \
    --train_list $dir/refcocog_umd/refcocog_train_names.json \
    --num_workers 2 \
    --irn_batch_size 24 \
    --cam_to_ir_label_pass True \
    --train_irn_pass True \
    --make_ins_seg_pass True \

## the code can run faster if more GPUs are available
#CUDA_VISIBLE_DEVICES=0,1,2,3 python run_sample_refer.py   --cam_out_dir $dir/refcocog_umd/cam   --ir_label_out_dir $dir/refcocog_umd/ir_label   --ins_seg_out_dir $dir/refcocog_umd/ins_seg   --train_list $dir/refcocog_umd/refcocog_train_names.json   --cam_eval_thres 0.15   --work_space output_refer/refcocog_umd   --num_workers 8   --irn_batch_size 96   --cam_to_ir_label_pass True   --train_irn_pass True   --make_ins_seg_pass True 
  1. Train Step2 network using the generated pseudo masks in output/refcocog_umd/ins_seg indicated by the args --pseudo_path.
cd ../
bash scripts/train_stage2.sh

## python train_stage2.py  --batch_size 48  --size 320  --dataset refcocog  --splitBy umd  --test_split val  --bert_tokenizer clip  --backbone clip-RN50  --max_query_len 20  --epoch 15  --pseudo_path output/refcocog_umd/ins_seg  --output ./weights/stage2/pseudo_refcocog_umd

Acknowledgement

This repository was based on LAVT, WWbL, CLIMS and IRNet.

Citation

If you find this repository helpful, please consider citing:

@inproceedings{liu2023referring,
  title={Referring Image Segmentation Using Text Supervision},
  author={Liu, Fang and Liu, Yuhao and Kong, Yuqiu and Xu, Ke and Zhang, Lihe and Yin, Baocai and Hancke, Gerhard and Lau, Rynson},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={22124--22134},
  year={2023}
}

Contact

If you have any questions, please feel free to reach out at fawnliu2333@gmail.com.