Awesome
Masked Vision-Language Transformer in Fashion
- Authors: Ge-Peng Ji^, Mingcheng Zhuge^, Dehong Gao, Deng-Ping Fan#, Christos Sakaridis, and Luc Van Gool
- Accepted by Machine Intelligence Research 2023
- Link: arXiv Paper
- This project is still working in progress, and we invite all to contribute in making it more accessible and useful. If you have any questions, please feel free to drop us an e-mail (gepengai.ji@gmail.com) or directly report it in the issue or push a PR.
- Your star is our motivation, let's enjoy it!
- Welcome to our WeChat Group (QR Code)
Dataset Preparation
This project conducts several experiments on the public dataset, Fashion-Gen, which contains 260,480 training text-image pairs for training and 35,528 text-image pairs for inference. The MVLT model can directly process the original image and text without any feature engineering pre-processing of the data. However, it is necessary to sort out the storage form of the data to facilitate the dataloader of torch:
Please download the reorganized dataset from Google Drive (9.72GB).
Preliminaries
Installing the basic libraries python3.6, pytorch1.8, cuda10.1 on UBUNTU18.04. I did validate the flexibility on other versions of libraries and systems, but I think it is easy to adapt with minor changes.
- Create env via
conda create -n MVLT python=3.6
- Installing Pytorch via
~/miniconda3/envs/MVLT/bin/python3.6 -m pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
- Installing the auxiliary libraries via running
~/miniconda3/envs/MVLT/bin/python3.6 -m pip install -r requirements.txt
- The checkpoint of PVT-tiny for pre-training is prepared at
./preweights/pvt_v1/pvt_tiny.pth
. You can also try other PVT-v1 and PVT-v2 variants (download link) to further boost the performance if enough GPU memory is available for you. - Downloading the checkpoint from Google Drive (689.4MB) and move them into
./checkpoints/
. Note this tar.gz file contains two weights:checkpoint_retrieval.pth
andcheckpoint_recognition.pth
.
Training
Please note that we only use PVT-Tiny to learn multi-modal features, and other stronger backbone would further improve representation abilities, such as SwinTransformer or PVTv2.
- Please revise your data path (
--data-path
parameter) in./scripts_dws/dws_mvlt_exp21.sh
or./scripts_dws/dws_mvlt_ft_exp48.sh
- Just run
bash ./scripts_dws/dws_mvlt_exp21.sh
for pre-training - Just run
bash ./scripts_dws/dws_mvlt_ft_exp48.sh
for fine-tuning
Inference
-
Downstream retrieval tasks
- We provide the zero-shot retrieval performance without any finetuning process, and thus, the well-trained weight could be directly used in the retrieval tasks.
- Just run
bash downstream_retrieval.sh
and then get the prediction results of Image-Text Retrieval (ITR) and Text-Image Retrieval (TIR).- Text-Image Retrieval (TIR): acc@1: 0.346, acc@5: 0.780, acc@10: 0.895
- Image-Text Retrieval (ITR): acc@1: 0.331, acc@5: 0.772, acc@10: 0.911
-
Downstream recognition tasks
- This task needs the fine-tuning process because our pre-trained model is not equipped with the classification head.
- Just run
bash downstream_recognition.sh
and then get the prediction results of Main-Category Recognition (M-CR) and Sub-Category Recognition (S-CR).- Main-category recognition (M-CR): accuracy (0.9825996064928677) macro_f1 (0.8954719842489123)
- Sub-category recognition (S-CR): accuracy (0.9356554353172651) macro_f1 (0.8285927576055913)
Citation
@article{ji2023masked,
title={Masked Vision-Language Transformer in Fashion},
author={Ji, Ge-Peng and Zhuge, Mingchen and Gao, Dehong and Fan, Deng-Ping and Sakaridis, Christos and Van Gool, Luc},
journal={Machine Intelligence Research},
year={2023}
}
Here are two concurrent works from Alibaba ICBU Team.
@inproceedings{zhuge2021kaleido,
title={Kaleido-bert: Vision-language pre-training on fashion domain},
author={Zhuge, Mingchen and Gao, Dehong and Fan, Deng-Ping and Jin, Linbo and Chen, Ben and Zhou, Haoming and Qiu, Minghui and Shao, Ling},
booktitle={CVPR},
pages={12647--12657},
year={2021}
}
@inproceedings{10.1145/3397271.3401430,
author = {Gao, Dehong and Jin, Linbo and Chen, Ben and Qiu, Minghui and Li, Peng and Wei, Yi and Hu, Yi and Wang, Hao},
title = {FashionBERT: Text and Image Matching with Adaptive Loss for Cross-Modal Retrieval},
year = {2020},
publisher = {Association for Computing Machinery},
booktitle = {Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2251–2260},
numpages = {10},
location = {Virtual Event, China},
series = {SIGIR '20}
}
Acknowlegement
Thanks, Alibaba ICBU Search Team and Wenhai Wang (PVT) for their technical support.