Home

Awesome

PLIP

PLIP is a novel Language-Image Pre-training framework for generic Person representation learning which benefits a range of downstream person-centric tasks.

Also, we present a large-scale person dataset named SYNTH-PEDES to verify its effectiveness, where the Stylish Pedestrian Attributes-union Captioning method (SPAC) is proposed to synthesize diverse textual descriptions.

Experiments show that our model not only significantly improves existing methods on downstream tasks, but also shows great ability in the few-shot and domain generalization settings. More details can be found at our paper PLIP: Language-Image Pre-training for Person Representation Learning.

<div align="center"><img src="assets/abstract.png" width="600"></div>

News

SYNTH-PEDES

SYNTH-PEDES is by far the largest person dataset with textual descriptions without any human annotation effort. Every person image has 2 or 3 different texutal descriptions and 6 attribute annotations. The dataset is released at Baidu Yun.

Note that SYNTH-PEDES can only be used for research, any commercial usage is forbidden.

This is the comparison of SYNTH-PEDES with other popular datasets.

<div align="center"><img src="assets/SYNTH-PEDES.png" width="900"></div>

These are some examples of our SYNTH-PEDES dataset.

<div align="center"><img src="assets/examples.png" width="900"></div>

Annotation format:

{
    "id": 7,
    "file_path": "Part1/7/1.jpg",
    "attributes": [
        "man,black hair,black shirt,pink shorts,black shoes,unknown"
    ],
    "captions": [
        "A man in his mid-twenties with short black hair is wearing a black t-shirt over light pink trousers. He is also wearing black shoes.",
        "The man with short black hair is wearing a black shirt and salmon pink shorts. He is also wearing black shoes."
    ],
    "prompt_caption": [
        "A man with black hair is wearing a black shirt with pink shorts and a pair of black shoes."
    ]
}

Models

We utilize ResNet50 and Bert as our encoders. After pre-training, we fine-tune and evaluate the performance on three downstream tasks. The checkpoints have been released at Baidu Yun and Google Drive.

CUHK-PEDES dataset (Text Re-ID R@1/R@10)

Pre-trainCMPM/CSSANLGUR
IN sup54.81/83.2261.37/86.7364.21/87.93
IN unsup55.34/83.7661.97/86.6365.33/88.47
CLIP55.67/83.8262.09/86.8964.70/88.76
LUP57.21/84.6863.91/88.3665.42/89.36
LUP-NL57.35/84.7763.71/87.4664.68/88.69
PLIP(ours)69.23/91.1664.91/88.3967.22/89.49

ICFG-PEDES dataset (Text Re-ID R@1/R@10)

Pre-trainCMPM/CSSANLGUR
IN sup47.61/75.4854.23/79.5357.42/81.45
IN unsup48.34/75.6655.27/79.6459.90/82.94
CLIP48.12/75.5153.58/78.9658.35/82.02
LUP50.12/76.2356.51/80.4160.33/83.06
LUP-NL49.64/76.1555.59/79.7860.25/82.84
PLIP(ours)64.25/86.3260.12/82.8462.27/83.96

Market1501 & DukeMTMC (Image Re-ID mAP/cmc1)

MethodsMarket1501DukeMTMC
BOT85.9/94.576.4/86.4
BDB86.7/95.376.0/89.0
MGN87.5/95.179.4/89.0
ABDNet88.3/95.678.6/89.0
PLIP+BOT88.0/95.177.0/86.5
PLIP+BDB88.4/95.778.2/89.8
PLIP+MGN90.6/96.381.7/90.3
PLIP+ABDNet91.2/96.781.6/90.9

Evaluate on PETA & PA-100K & RAP (PAR mA/F1)

MethodsPETAPA-100KRAP
DeepMAR80.14/83.5678.28/84.3276.81/78.94
Rethink83.96/86.3580.21/87.4079.27/79.95
VTB84.12/86.6381.02/87.3181.43/80.63
Label2Label84.08/86.5782.24/87.0881.82/80.93
PLIP+DeepMAR82.46/85.8780.33/87.2478.96/80.12
PLIP+Rethink85.56/87.6382.09/88.1281.87/81.53
PLIP+VTB86.03/88.1483.24/88.5783.64/81.78
PLIP+Label2Label86.12/88.0884.36/88.6383.77/81.49

Usage

Install Requirements

we use 4 RTX3090 24G GPU for training and evaluation.

Create conda environment.

conda create --name PLIP --file requirements.txt
conda activate PLIP

Datasets Prepare

Download the CUHK-PEDES dataset from here and ICFG-PEDES dataset from here.

Organize them in data folder as follows:

|-- data/
|   |-- <CUHK-PEDES>/
|       |-- imgs
|            |-- cam_a
|            |-- cam_b
|            |-- ...
|       |-- reid_raw.json
|
|   |-- <ICFG-PEDES>/
|       |-- imgs
|            |-- test
|            |-- train 
|       |-- ICFG_PEDES.json
|
|   |-- <SYNTH-PEDES>/
|       |-- Part1
|       |-- ...
|       |-- Part11
|       |-- synthpedes_dataset.json

Zero-shot Inference

Our pre-trained model can directly be transfered to downstream tasks, especially text-based Re-ID.

  1. Run the python file and generate train/test/valid json files respectively.
python dataset_split.py
  1. Then you can evaluate by running:
python zs_inference.py

Fine-tuning Inference

Almost all existing downstream person-centric methods can be improved through replacing the backbone with our pre-trained model. Taking CMPM/C as example:

  1. Go to the CMPM/C root:
cd Downstreams/CMPM-C
  1. Run the following to train and test. Note that you can modify the code yourself for single GPU training:
python dataset_split.py 
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train.py
python test.py

Evaluate on Other Methods and Tasks.

By simply replacing the visual backbone with our pre-trained model, almost all existing methods on downstream tasks make significant improvements. For example, you can try by the following repositories:

Text-based Re-ID: SSAN, LGUR

Image-based Re-ID: BOT, MGN, ABD-Net

Person Attribute Recognition: Rethink, Label2label, VTB

Reference

If you use PLIP in your research, please cite it by the following BibTeX entry:

@inproceedings{
zuo2024plip,
title={PLIP: Language-Image Pre-training for Person Representation Learning},
author={Jialong Zuo and Jiahao Hong and Feng Zhang and Changqian Yu and Hanyu Zhou and Changxin Gao and Nong Sang and Jingdong Wang},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}