Home

Awesome

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval

GitHub PWC

Official PyTorch implementation of the paper Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. (CVPR 2023) arXiv

Updates

Highlights

The goal of this work is to enhance global text-to-image person retrieval performance, without requiring any additional supervision and inference cost. To achieve this, we utilize the full CLIP model as our feature extraction backbone. Additionally, we propose a novel cross-modal matching loss (SDM) and an Implicit Relation Reasoning module to mine fine-grained image-text relationships, enabling IRRA to learn more discriminative global image-text representations.

Usage

Requirements

we use single RTX3090 24G GPU for training and evaluation.

pytorch 1.9.0
torchvision 0.10.0
prettytable
easydict

Prepare Datasets

Download the CUHK-PEDES dataset from here, ICFG-PEDES dataset from here and RSTPReid dataset form here

Organize them in your dataset root dir folder as follows:

|-- your dataset root dir/
|   |-- <CUHK-PEDES>/
|       |-- imgs
|            |-- cam_a
|            |-- cam_b
|            |-- ...
|       |-- reid_raw.json
|
|   |-- <ICFG-PEDES>/
|       |-- imgs
|            |-- test
|            |-- train 
|       |-- ICFG_PEDES.json
|
|   |-- <RSTPReid>/
|       |-- imgs
|       |-- data_captions.json

Training

python train.py \
--name iira \
--img_aug \
--batch_size 64 \
--MLM \
--loss_names 'sdm+mlm+id' \
--dataset_name 'CUHK-PEDES' \
--root_dir 'your dataset root dir' \
--num_epoch 60

Testing

python test.py --config_file 'path/to/model_dir/configs.yaml'

IRRA on Text-to-Image Person Retrieval Results

CUHK-PEDES dataset

MethodBackboneRank-1Rank-5Rank-10mAPmINP
CMPM/CRN50/LSTM49.37-79.27--
DSSLRN50/BERT59.9880.4187.56--
SSANRN50/LSTM61.3780.1586.73--
Han et al.RN101/Xformer64.0881.7388.1960.08-
LGURDeiT-Small/BERT65.2583.1289.00--
IVTViT-B-16/BERT65.5983.1189.21--
CFineViT-B-16/BERT69.5785.9391.15--
CLIPViT-B-16/Xformer68.1986.4791.4761.1244.86
IRRA (ours)ViT-B-16/Xformer73.3889.9393.7166.1350.24

Model & log for CUHK-PEDES

ICFG-PEDES dataset

MethodRank-1Rank-5Rank-10mAPmINP
CMPM/C43.5165.4474.26--
SSAN54.2372.6379.53--
IVT56.0473.6080.22--
CFine60.8376.5582.42--
CLIP56.7475.7282.2631.845.03
IRRA (ours)63.4680.2485.8238.057.92

Model & log for ICFG-PEDES

RSTPReid dataset

MethodRank-1Rank-5Rank-10mAPmINP
DSSL39.0562.6073.95--
SSAN43.5067.8077.15--
IVT46.7070.0078.80--
CFine50.5572.5081.60--
CLIP54.0580.7088.0043.4122.31
IRRA (ours)60.2081.3088.2047.1725.28

Model & log for RSTPReid

Acknowledgments

Some components of this code implementation are adopted from CLIP, TextReID and TransReID. We sincerely appreciate for their contributions.

Citation

If you find this code useful for your research, please cite our paper.

@inproceedings{cvpr23crossmodal,
  title={Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval},
  author={Jiang, Ding and Ye, Mang},
  booktitle={IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2023},
}

Contact

If you have any question, please feel free to contact us. E-mail: jiangding@whu.edu.cn, yemang@whu.edu.cn.