Awesome
TagAlign - Official Pytorch Implementation
<div align="center"> <img src="figs/pipeline.png" width="100%"> </div>TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification <br> Qinying Liu, Kecheng Zheng, Wei Wu, Zhan Tong, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen<br>
š News
[2023/12/25] The paper and project page are released!
š” Highlights
- š„ 3.65% mIOU improvement on a broad suite of semantic segmentation datasets (VOC: PASCAL VOC, Context: PASCAL Context, Object: COCO-Object, IN: ImageNet-S, Stuff: COCO-Stuff, City: Cityscapes, ADE: ADE20K).
- š„ A strong CLIP encoder with the help of designed parsing pipeline that is fully automatic and thus enjoys good scalability.
šØāš» Todo
- Meta-files of TagAlign
- Checkpoints of TagAlign
- Web demo and local demo of TagAlign
- Training and evaluation code for TagAlign
š ļø Usage
Installation
- apex==0.1
- clip==1.0
- mmcv-full==1.4.7
- mmsegmentation==0.21.1
- torch==1.11.0
Data Preparation
For the training phase, we utilize the CC12M dataset. Researchers can procure the CC12M dataset either directly from its source or by employing the img2dataset tool. The dataset should adhere to the following file structure:
CC12M
āāā 000002a0c848e78c7b9d53584e2d36ab0ac14785.jpg
āāā 000002ca5e5eab763d95fa8ac0df7a11f24519e5.jpg
āāā 00000440ca9fe337152041e26c37f619ec4c55b2.jpg
...
In addition, we provide the captions of the images in meta_file(TODO).
For evaluation, refer to the GroupVit to properly prepare the datasets. Make sure to update the image directories in 'segmentation/configs/base/datasets/*.py' as necessary.
Train and Evaluate
-
Modify the 'tagalign.yml'. We provide the processed tag_file(TODO) and label_file(TODO).
-
Train the TagAlign model by run
torchrun --rdzv_endpoint=localhost:6000 --nproc_per_node=auto main.py --cfg configs/tagalign.yml
-
You can evaluate the TagAlign model by running the command below.
torchrun --rdzv_endpoint=localhost:6000 --nproc_per_node=auto main.py --cfg configs/eval.yml --eval --resume $WEIGHT
$WEIGHT is the path of the pre-trained checkpoints. We provide our pre-trained weights in weights(TODO).
āļø Citation
If you find our work to be useful for your research, please consider citing.
@article{liu2023tagalign,
title={TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification},
author={Liu, Qinying and Zheng, Kecheng and Wei, Wu and Tong, Zhan and Liu, Yu and Chen, Wei and Wang, Zilei and Shen, Yujun},
journal={arXiv preprint arXiv:2312.14149},
year={2023}
}
ā¤ļø Acknowledgements
- TCL: The codebase we built upon. Thanks for their wonderful work.
- CLIP_Surgery: An effective training-free strategy for enhancing the fine-grained localization capabilities of CLIP.