Home

Awesome

<div align="center"> <h1>UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers</h1> </div> <p align="center"> <a href="https://github.com/sdc17/UPop/actions/workflows/build.yml"> <img alt="Build" src="https://github.com/sdc17/UPop/actions/workflows/build.yml/badge.svg" /> </a> <a href="https://proceedings.mlr.press/v202/shi23e/shi23e.pdf"> <img alt="Paper" src="https://img.shields.io/badge/paper-link-blue?logo=quicklook" /> </a> <a href="https://arxiv.org/abs/2301.13741"> <img alt="Paper" src="https://img.shields.io/badge/arXiv-2301.13741-B31B1B?logo=arxiv" /> </a> <a href="https://github.com/sdc17/UPop"> <img alt="Code" src="https://img.shields.io/badge/code-link-181717?logo=github" /> </a> <a href="https://dachuanshi.com/UPop-Project/"> <img alt="Webiste" src="https://img.shields.io/badge/website-link-4285F4?logo=googleearth" /> </a> <a href="https://dachuanshi.medium.com/compressing-multimodal-and-unimodal-transformers-via-upop-466c11680ac0"> <img alt="Blog" src="https://img.shields.io/badge/blog-English-FFA500?logo=rss" /> </a> <a href="https://zhuanlan.zhihu.com/p/640634482"> <img alt="Blog" src="https://img.shields.io/badge/blog-δΈ­ζ–‡-FFA500?logo=rss" /> </a><br> <a href="https://pytorch.org/get-started/previous-versions/"> <img alt="Pytorch" src="https://img.shields.io/badge/pytorch-v1.11.0-EE4C2C?logo=pytorch" /> </a> <a href="https://www.python.org/downloads/release/python-3811/"> <img alt="Pytorch" src="https://img.shields.io/badge/python-v3.8.11-3776AB?logo=python" /> </a> <a href="https://github.com/sdc17/UPop/blob/main/LICENSE"> <img alt="License" src="https://img.shields.io/badge/license-BSD 3--Clause-F96702?logo=cloudera&logoColor=c0c0c0" /> </a> </p> <!-- <img src="UPop.png" width="800"> -->

🧐 A Quick Look

πŸ₯³ What's New

πŸƒ Installation

The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:

conda env create -f environment.yml

The status of installing dependencies: build

<!-- ### Supported Tasks, Models, and Datasets Type | Supported Tasks | Supported Models | Supported Datasets | --- | --- | :---: | :---: Multi-modal | [Visual Reasoning](https://github.com/sdc17/UPop#visual-reasoning-on-the-nlvr2-dataset) | [BLIP](https://github.com/salesforce/BLIP) ([instructions](https://github.com/sdc17/UPop#visual-reasoning-on-the-nlvr2-dataset)) | [NLVR2](https://lil.nlp.cornell.edu/nlvr/) Multi-modal |[Image Caption](https://github.com/sdc17/UPop#image-caption-on-the-coco-caption-dataset) | [BLIP](https://github.com/salesforce/BLIP) ([instructions](https://github.com/sdc17/UPop#image-caption-on-the-coco-caption-dataset)) | [COCO Caption](https://cocodataset.org/#home) Multi-modal |[Visual Question Answer](https://github.com/sdc17/UPop#visual-question-answer-on-the-vqav2-dataset) | [BLIP](https://github.com/salesforce/BLIP) ([instructions](https://github.com/sdc17/UPop#visual-question-answer-on-the-vqav2-dataset)) | [VQAv2](https://visualqa.org/) Multi-modal |[Image-Text Retrieval](https://github.com/sdc17/UPop#image-text-and-text-image-retrieval-on-the-coco-dataset) | [CLIP](https://github.com/openai/CLIP) ([instructions](https://github.com/sdc17/UPop#image-text-and-text-image-retrieval-on-the-coco-dataset-with-clip)), [BLIP](https://github.com/salesforce/BLIP) ([instructions](https://github.com/sdc17/UPop#image-text-and-text-image-retrieval-on-the-coco-dataset)) | [COCO](https://cocodataset.org/#home), [Flickr30k](https://shannon.cs.illinois.edu/DenotationGraph/) Multi-modal |[Text-Image Retrieval](https://github.com/sdc17/UPop#image-text-and-text-image-retrieval-on-the-coco-dataset) | [CLIP](https://github.com/openai/CLIP) ([instructions](https://github.com/sdc17/UPop#image-text-and-text-image-retrieval-on-the-flickr30k-dataset-with-clip)), [BLIP](https://github.com/salesforce/BLIP) ([instructions](https://github.com/sdc17/UPop#image-text-and-text-image-retrieval-on-the-flickr30k-dataset)) | [COCO](https://cocodataset.org/#home), [Flickr30k](https://shannon.cs.illinois.edu/DenotationGraph/) Uni-modal |[Image Classification](https://github.com/sdc17/UPop#image-classification-on-the-imagenet-dataset) | [DeiT](https://github.com/facebookresearch/deit) ([instructions](https://github.com/sdc17/UPop#image-classification-on-the-imagenet-dataset)) | [ImageNet](https://www.image-net.org/) Uni-modal |[Image Segmentation](https://github.com/sdc17/UPop#image-segmentation-on-the-ade20k-dataset) | [Segmenter](https://github.com/rstrudel/segmenter) ([instructions](https://github.com/sdc17/UPop#image-segmentation-on-the-ade20k-dataset)) | [Ade20k](https://groups.csail.mit.edu/vision/datasets/ADE20K/) -->

πŸš€ Visual Reasoning on the NLVR2 Dataset

πŸš€ Image Caption on the COCO Caption Dataset

πŸš€ Visual Question Answer on the VQAv2 Dataset

πŸš€ Image-Text and Text-Image Retrieval on the COCO Dataset

πŸš€ Image-Text and Text-Image Retrieval on the Flickr30K Dataset

πŸš€ Image-Text and Text-Image Retrieval on the COCO Dataset with CLIP

πŸš€ Image-Text and Text-Image Retrieval on the Flickr30K Dataset with CLIP

πŸš€ Image Classification on the ImageNet Dataset

πŸš€ Image Segmentation on the Ade20k Dataset

πŸ“‘ Common Issues

1. Evaluation with single GPU

2. Compress with single GPU

3. Out of memory during the evaluation

4. Out of memory during the compression

🌲 Expected Folder Structures

β”œβ”€β”€ annotation
β”‚Β Β  β”œβ”€β”€ answer_list.json
β”‚Β Β  β”œβ”€β”€ coco_gt
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ coco_karpathy_test_gt.json
β”‚Β Β  β”‚Β Β  └── coco_karpathy_val_gt.json
β”‚Β Β  β”œβ”€β”€ ...
β”œβ”€β”€ clip                                               
β”œβ”€β”€ compress_caption.py       
β”œβ”€β”€ compress_deit.py        
β”œβ”€β”€ compress_nlvr.py                  
β”œβ”€β”€ compress ...    
β”œβ”€β”€ configs                                             
β”œβ”€β”€ data                                        
β”œβ”€β”€ datasets
β”‚Β Β  └── vision
β”‚Β Β      β”œβ”€β”€ coco
β”‚Β Β      β”œβ”€β”€ flickr
β”‚Β Β      β”œβ”€β”€ NLVR2     
β”‚Β Β      β”œβ”€β”€ ...                                                                              
β”œβ”€β”€ deit   
β”œβ”€β”€ log                                     
β”œβ”€β”€ models            
β”œβ”€β”€ output                                    
β”œβ”€β”€ pretrained
β”‚   β”œβ”€β”€ bert-base-uncased
β”‚   β”œβ”€β”€ clip_large_retrieval_coco.pth
β”‚   β”œβ”€β”€ clip_large_retrieval_flickr.pth
β”‚   β”œβ”€β”€ ...       
β”œβ”€β”€ segm                                                                                   
β”œβ”€β”€ transform                                                                           
└── utils.py                                

πŸ’¬ Acknowledgments

This code is built upon <a href="https://github.com/salesforce/BLIP">BLIP</a>, <a href="https://github.com/openai/CLIP">CLIP</a>, <a href="https://github.com/facebookresearch/deit">DeiT</a>, <a href="https://github.com/rstrudel/segmenter">Segmenter</a>, and <a href=https://github.com/huggingface/pytorch-image-models/tree/main/timm>timm</a>. Thanks for these awesome open-source projects!

✨ Citation

If you find our work or this code useful, please consider citing the corresponding paper:

@InProceedings{pmlr-v202-shi23e,
  title = {{UP}op: Unified and Progressive Pruning for Compressing Vision-Language Transformers},
  author = {Shi, Dachuan and Tao, Chaofan and Jin, Ying and Yang, Zhendong and Yuan, Chun and Wang, Jiaqi},
  booktitle = {Proceedings of the 40th International Conference on Machine Learning},
  pages = {31292--31311},
  year = {2023},
  volume = {202},
  publisher = {PMLR}
}