Home

Awesome

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

<p align="center"> <a href="https://arxiv.org/pdf/2403.02991.pdf" target="_blank">[Paper]</a> <a href="https://arxiv.org/abs/2403.02991" target="_blank">[ArXiv]</a> <a href="https://github.com/double125/MADTP" target="_blank">[Code]</a> <img src="MADTP.png" width="800">

Official implementation of MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer.

What's New šŸ„³

Installation

The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:

conda env create -f environment.yml

Supported Tasks, Models, and Datasets

TypeSupported TasksSupported ModelsSupported Datasets
Multi-modalVisual ReasoningBLIP (instructions)NLVR2
Multi-modalImage CaptionBLIP (instructions)COCO Caption
Multi-modalVisual Question AnswerBLIP (instructions)VQAv2
Multi-modalImage-Text RetrievalCLIP (instructions), BLIP (instructions)COCO, Flickr30k
Multi-modalText-Image RetrievalCLIP (instructions), BLIP (instructions)COCO, Flickr30k

Visual Reasoning on the NLVR2 Dataset

Image Caption on the COCO Caption Dataset

<!-- * Resources Reduction | Uncompressed Model | Compression Script | Training Log | Compressed Checkpoint | Evaluation Script --- | :---: | :---: | :---: | :---: | :---: 0.5 | <a href="https://drive.google.com/uc?export=download&id=1qW_0DpQsDc6u9g3fSfTI4g_VXYsMA5s8">Download</a> | [Link](./scripts/compress_caption_coco_p0.5.sh) | <a href="*****r">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_caption_coco_p0.5_compressed.sh) 0.75 | <a href="https://drive.google.com/uc?export=download&id=1qW_0DpQsDc6u9g3fSfTI4g_VXYsMA5s8">Download</a> | [Link](./scripts/compress_caption_coco_p0.75.sh)| <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_caption_coco_p0.75_compressed.sh) -->

Visual Question Answer on the VQAv2 Dataset

<!-- * Resources Reduction | Uncompressed Model | Compression Script | Training Log | Compressed Checkpoint | Evaluation Script --- | :---: | :---: | :---: | :---: | :---: 0.5 | <a href="https://drive.google.com/uc?export=download&id=18Ihg2NA_puj3_92uVszqonSusLFgmID-">Download</a> | [Link](./scripts/compress_vqa_vqa2_p0.5.sh) | <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_vqa_vqa2_p0.5_compressed.sh) 0.75 | <a href="https://drive.google.com/uc?export=download&id=18Ihg2NA_puj3_92uVszqonSusLFgmID-">Download</a> | [Link](./scripts/compress_vqa_vqa2_p0.75.sh)| <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_vqa_vqa2_p0.75_compressed.sh) -->

Image-Text and Text-Image Retrieval on the COCO Dataset

<!-- * Resources Reduction | Uncompressed Model | Compression Script | Training Log | Compressed Checkpoint | Evaluation Script --- | :---: | :---: | :---: | :---: | :---: 0.5 | <a href="https://drive.google.com/uc?export=download&id=19nxvphpnIH2kbV4unL0MDAM_2zlBnruq">Download</a> | [Link](./scripts/compress_retrieval_coco_p0.5.sh) | <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_retrieval_coco_p0.5_compressed.sh) 0.75 | <a href="https://drive.google.com/uc?export=download&id=19nxvphpnIH2kbV4unL0MDAM_2zlBnruq">Download</a> | [Link](./scripts/compress_retrieval_coco_p0.75.sh)| <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_retrieval_coco_p0.75_compressed.sh) -->

Image-Text and Text-Image Retrieval on the Flickr30K Dataset

<!-- * Resources Reduction | Uncompressed Model | Compression Script | Training Log | Compressed Checkpoint | Evaluation Script --- | :---: | :---: | :---: | :---: | :---: 0.5 | <a href="https://drive.google.com/uc?export=download&id=1mrd7unZMFMC77Qb_3DAx7MhpZJv4Ptbw">Download</a> | [Link](./scripts/compress_retrieval_flickr_p0.5.sh) | <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_retrieval_flickr_p0.5_compressed.sh) 0.75 | <a href="https://drive.google.com/uc?export=download&id=1mrd7unZMFMC77Qb_3DAx7MhpZJv4Ptbw">Download</a> | [Link](./scripts/compress_retrieval_flickr_p0.75.sh)| <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_retrieval_flickr_p0.75_compressed.sh) -->

Image-Text and Text-Image Retrieval on the COCO Dataset with CLIP

<!-- * Resources Reduction | Uncompressed Model | Compression Script | Training Log | Compressed Checkpoint | Evaluation Script --- | :---: | :---: | :---: | :---: | :---: 0.5 | <a href="https://drive.google.com/uc?export=download&id=10p1oPdiMUqo0MfPul5hCb_h9mCaNCh6q">Download</a> | [Link](./scripts/compress_retrieval_coco_clip_p0.5.sh) | <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_retrieval_coco_clip_p0.5_compressed.sh) 0.75 | <a href="https://drive.google.com/uc?export=download&id=10p1oPdiMUqo0MfPul5hCb_h9mCaNCh6q">Download</a> | [Link](./scripts/compress_retrieval_coco_clip_p0.75.sh)| <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_retrieval_coco_clip_p0.75_compressed.sh) -->

Image-Text and Text-Image Retrieval on the Flickr30K Dataset with CLIP

<!-- * Resources Reduce Ratio | Uncompressed Model | Compression Script | Training Log | Compressed Checkpoint | Evaluation Script --- | :---: | :---: | :---: | :---: | :---: 0.5 | <a href="https://drive.google.com/uc?export=download&id=1-MZP6xQRnmLZr1_pqUK4TvOA8Ic7XCoI">Download</a> | [Link](./scripts/compress_retrieval_flickr_clip_p0.5.sh) | <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_retrieval_flickr_clip_p0.5_compressed.sh) 0.75 | <a href="https://drive.google.com/uc?export=download&id=1-MZP6xQRnmLZr1_pqUK4TvOA8Ic7XCoI">Download</a> | [Link](./scripts/compress_retrieval_flickr_clip_p0.75.sh)| <a href="*****">Download</a> | <a href="*****">Download</a> | [Link](./scripts/evaluate_retrieval_flickr_clip_p0.75_compressed.sh) -->

Common Issues

1. Evaluation with single GPU

2. Compress with single GPU

3. Other issues

You can post them on the Issues page.

Expected Folder Structures

ā”œā”€ā”€ annotation
ā”‚Ā Ā  ā”œā”€ā”€ answer_list.json
ā”‚Ā Ā  ā”œā”€ā”€ coco_gt
ā”‚Ā Ā  ā”‚Ā Ā  ā”œā”€ā”€ coco_karpathy_test_gt.json
ā”‚Ā Ā  ā”‚Ā Ā  ā””ā”€ā”€ coco_karpathy_val_gt.json
ā”‚Ā Ā  ā”œā”€ā”€ ...
ā”œā”€ā”€ clip                                               
ā”œā”€ā”€ compress_caption_dtp.py             
ā”œā”€ā”€ compress_nlvr_dtp.py                  
ā”œā”€ā”€ compress ...    
ā”œā”€ā”€ configs                                             
ā”œā”€ā”€ data                                        
ā”œā”€ā”€ datasets
ā”‚Ā Ā  ā””ā”€ā”€ vision
ā”‚Ā Ā      ā”œā”€ā”€ coco
ā”‚Ā Ā      ā”œā”€ā”€ flickr
ā”‚Ā Ā      ā”œā”€ā”€ NLVR2     
ā”‚Ā Ā      ā”œā”€ā”€ ...                                                                               
ā”œā”€ā”€ log                                     
ā”œā”€ā”€ models            
ā”œā”€ā”€ output                                    
ā”œā”€ā”€ pretrained
ā”‚   ā”œā”€ā”€ bert-base-uncased
ā”‚   ā”œā”€ā”€ clip_large_retrieval_coco.pth
ā”‚   ā”œā”€ā”€ clip_large_retrieval_flickr.pth
ā”‚   ā”œā”€ā”€ ...       
ā”œā”€ā”€                                                                                
ā”œā”€ā”€ transform                                                                           
ā””ā”€ā”€ utils.py                                

Acknowledgments

This code is built upon <a href="https://github.com/salesforce/BLIP">BLIP</a>, <a href="https://github.com/openai/CLIP">CLIP</a>, <a href="https://github.com/sdc17/UPop">UPop</a>, and <a href=https://github.com/huggingface/pytorch-image-models/tree/main/timm>timm</a>. We thank the original authors for their open-source work.

Citation

If you find this work useful, please consider citing the corresponding paper:

@article{cao2024madtp,
  title={MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer},
  author={Jianjian, Cao and Peng, Ye and Shengze, Li and Chong, Yu and Yansong, Tang and Jiwen, Lu and Tao, Chen},
  journal={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2024}
}