Home

Awesome

R-MeeTo: Rebuild Your Faster Vision Mamba in Minutes

The official implementation of "Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training".

Mingjia Shi<sup>*</sup>, Yuhao Zhou<sup>*</sup>, Ruiji Yu<sup></sup>, Zekai Li<sup></sup>, Zhiyuan Liang<sup></sup>, Xuanlei Zhao<sup></sup>, Xiaojiang Peng<sup></sup>, Tanmay Rajpurohit<sup></sup>, Ramakrishna Vedantam<sup></sup>, Wangbo Zhao<sup>โ€ </sup>, Kai Wang<sup>โ€ </sup>, Yang You<sup></sup>

(*: equal contribution, โ€ : corresponding authors)

๐ŸŒŸ๐ŸŒŸ Mingjia, Ruiji, Zekai, and Zhiyuan are looking for Ph.D. positions, many thanks for considering their applications.

Paper Project Page

TL;DR

The anwser to all is the key knowledge loss.

https://github.com/user-attachments/assets/4239a2df-85cb-4721-ba0c-a39832832bb8

The key knowledge loss mainly causes the heavier performance drop after applying token reduction. R-MeeTo is thus proposed, fast fixing key knowledge and therefore recovering performance.

R-MeeTo is simple and effective, with only two main modules: merging and re-training. Merging lowers the knowledge loss while re-training fast recovers the knowledge structure of Mamba.

https://github.com/user-attachments/assets/b276534e-394c-473b-8420-11ad168796cf

Overview

<p align="center"> <img src="./fig/R_MeeTo.png" width=100% height=45% class="center">

Figure: Analysisโ€™ sketch: Mamba is sensitive to token reduction. Experiments about i. token reduction are conducted with DeiT-S (Transformer) and Vim-S (Mamba) on ImageNet-1K. The reduction ratios in the experiment about ii. shuffled tokens are 0.14 for Vim-Ti and 0.31 for Vim-S/Vim-B. Shuffle strategy is odd-even shuffle: [0,1,2,3]โ†’[0,2], [1,3]โ†’[0,2,1,3]. The empirical results of I(X;Y), the mutual information between inputs X and outputs Y of the Attention Block and SSM, are measured by MINE on the middle layers of DeiT-S and Vim-S (7-th/12 layers and the 14-th/24 layers respectively.) See this implementation repo of MINE.

Abstract: Vision Mamba (e.g., Vim) has successfully been integrated into computer vision, and token reduction has yielded promising outcomes in Vision Transformers (ViTs). However, token reduction performs less effectively on Vision Mamba compared to ViTs. Pruning informative tokens in Mamba leads to a high loss of key knowledge tokens and a drop in performance, making it not a good solution for enhancing efficiency. Token merging, which preserves more token information than pruning, has demonstrated commendable performance in ViTs, but vanilla merging performance decreases as the reduction ratio increases either, failing to maintain the key knowledge and performance in Mamba. Re-training the model with token merging, which effectively rebuilds the key knowledge, enhances the performance of Mamba. Empirically, pruned Vims, recovered on ImageNet-1K, only drop up to 0.9% accuracy, by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drop 1.3% with 1.2 $\times$ (up to 1.5 $\times$) speed up in inference.

๐Ÿš€ News

โšก๏ธ Faster Vision Mamba is Rebuilt in Minutes

HardwareVim-TiVim-SVim-B
1 x 8 x H100 (single machine)16.2 mins25.2 mins57.6 mins
2 x 8 x H100 (Infiniband)8.1 mins12.9 mins30.6 mins
4 x 8 x H100 (Infiniband)4.2 mins6.8 mins16.9 mins

Wall time in minutes of re-training Vim-Ti, Vim-S and Vim-B for 3 epochs on 3 hardwares by R-MeeTo. Give us minutes, we give back a faster Mamba.

๐Ÿ›  Dataset Prepare

๐Ÿ›  Installation

1. Clone the repository

git clone https://github.com/NUS-HPC-AI-Lab/R-MeeTo

2. Create a new Conda environment

conda env create -f environment.yml

or install the necessary packages by requirement.txt

conda create -n R_MeeTo python=3.10.12
pip install -r requirements.txt

3. Install Mamba package manually

git clone https://github.com/hustvl/Vim
cd Vim 
pip install -e causal_conv1d==1.1.0
pip install -e mamba-1p1p1
git clone https://github.com/OpenGVLab/VideoMamba
cd VideoMamba
pip install -e causal_conv1d
pip install -e mamba

4. Download the baseline pretrained models from our baseline official source

See PRETRAINED for downloading the pretrained model of our baseline.

โš™๏ธ Usage

๐Ÿ› ๏ธ Reproduce our results

For image task:

bash ./image_task/exp_sh/tab2/vim_tiny.sh

For video task:

bash ./video_task/exp_sh/tab13/videomamba_tiny.sh

Checkpoints:

See CKPT to find our reproduced checkpoints and logs of the main results.

โฑ๏ธ Measure inference speed

<p align="center"> <img src="./fig/speed.png" width=100% height=45% class="center">

R-MeeTo effectively optimizes inference speed and is adaptable for both consumer-level, enterprise-level and other high-performance devices. See this example for testing FLOPS (G) and throughput (im/s).

๐Ÿ–ผ๏ธ Visualization

<p align="center"> <img src="./fig/vis_exp.png" width=100% height=45% class="center">

See this example of visualization of merged token on ImageNet-1k val using a re-trained Vim-S.

Citation

If you found our work useful, please consider citing us.

@misc{shi2024faster,
      title={Faster Vision Mamba is Rebuilt in Minutes Via Merged Token Re-training},
      author={Shi, Mingjia and Zhou, Yuhao and Yu, Ruiji and Li, Zekai and Liang, Zhiyuan and Zhao, Xuanlei and
       Peng, Xiaojiang and Rajpurohit, Tanmay and Vedantam, Ramakrishna and
       Zhao, Wangbo and Wang, Kai and You, Yang},
      year={2024},
      eprint={2412.12496},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2412.12496},
}

Acknowledge

The repo is partly built based on ToMe, Vision Mamba, and VideoMamba. We are grateful for their generous contributions to open source.