Home

Awesome

Masked Vision-Language Transformer in Fashion

Dataset Preparation

This project conducts several experiments on the public dataset, Fashion-Gen, which contains 260,480 training text-image pairs for training and 35,528 text-image pairs for inference. The MVLT model can directly process the original image and text without any feature engineering pre-processing of the data. However, it is necessary to sort out the storage form of the data to facilitate the dataloader of torch:

Please download the reorganized dataset from Google Drive (9.72GB).

Preliminaries

Installing the basic libraries python3.6, pytorch1.8, cuda10.1 on UBUNTU18.04. I did validate the flexibility on other versions of libraries and systems, but I think it is easy to adapt with minor changes.

Training

Please note that we only use PVT-Tiny to learn multi-modal features, and other stronger backbone would further improve representation abilities, such as SwinTransformer or PVTv2.

Inference

Citation

@article{ji2023masked,
  title={Masked Vision-Language Transformer in Fashion},
  author={Ji, Ge-Peng and Zhuge, Mingchen and Gao, Dehong and Fan, Deng-Ping and Sakaridis, Christos and Van Gool, Luc},
  journal={Machine Intelligence Research},
  year={2023}
}

Here are two concurrent works from Alibaba ICBU Team.

@inproceedings{zhuge2021kaleido,
  title={Kaleido-bert: Vision-language pre-training on fashion domain},
  author={Zhuge, Mingchen and Gao, Dehong and Fan, Deng-Ping and Jin, Linbo and Chen, Ben and Zhou, Haoming and Qiu, Minghui and Shao, Ling},
  booktitle={CVPR},
  pages={12647--12657},
  year={2021}
}

@inproceedings{10.1145/3397271.3401430,
  author = {Gao, Dehong and Jin, Linbo and Chen, Ben and Qiu, Minghui and Li, Peng and Wei, Yi and Hu, Yi and Wang, Hao},
  title = {FashionBERT: Text and Image Matching with Adaptive Loss for Cross-Modal Retrieval},
  year = {2020},
  publisher = {Association for Computing Machinery},
  booktitle = {Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages = {2251–2260},
  numpages = {10},
  location = {Virtual Event, China},
  series = {SIGIR '20}
}

Acknowlegement

Thanks, Alibaba ICBU Search Team and Wenhai Wang (PVT) for their technical support.