Awesome
Strip-MLP
This is the official code (based on Pytorch framework) for the paper "Strip-MLP: Efficient Token Interaction for Vision MLP". Strip-MLP is a general backbone for vision MLP with significantly superiorities on the performance and computing complexity of the model.
Updates
17/04/2024
Our new work ``MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection'' has been accepted to IJCAI2024. Strip-MLP is extended to be a strong backbone for downstream task of object detection. The code of MLP-DINO will be available at MLP-DINO.
15/09/2023
Initial commits: we release the source codes and the checkpoints on ImageNet-1K.
14/07/2023
News
: Strip-MLP has been accepted to ICCV 2023!
Introduction
We present Strip-MLP for deep MLP-based models to enrich the token interaction power in three ways.
- Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns).
- Secondly, a Cascade Group Strip Mixing Module (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size.
- Finally, based on the Strip MLP layer, we propose a
novel Local Strip Mixing Module (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by
+2.44%
on Caltech-101 and+2.16%
on CIFAR-100.
Methods
MLP-based models have the Token's interaction dilemma that the spatial feature resolution is down-sampled to a small size but with more channels, which means the feature pattern of each token is mainly concentrated on the channel dimension rather than the spatial one. Interacting tokens along the spatial dimension by sharing the weights among all channels would seriously ignore the feature pattern differences among different channels, which may degrade the token interaction power, especially in deep layers with small spatial feature resolution.
To address these challenges, we propose a new efficient Strip MLP model, dubbed Strip-MLP, to enrich the power of the token interaction layer in three ways. For the level of a single MLP layer, inspired by the cross-block normalization schemes of HOG and the sparse connections between the biological neurons, we design a Strip MLP layer to allow the token to interact with other tokens in a cross -strip manner, enabling each row or column of the tokens to contribute differently to other rows or columns. For the token interaction module level, we develop channel-wise group mixing of CGSMM, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns). to tackle the problem that the token interaction power decreases in deep layers with the spatial feature size significantly reduced but with multiplying channels. Considering the existing methods interact the tokens mainly in the long range of row (or column), which may not aggregate tokens well in the local region, we propose a new Local Strip Mixing Module (LSMM) with a small Strip MLP unit to strengthen the token interaction power on local interactions.
Main Results on ImageNet
Results on ImageNet-1K
Model | Dataset | Resolution | Acc@1 | Acc@5 | #Params | FLOPs | FPS | Checkpoint |
---|---|---|---|---|---|---|---|---|
Strip-MLP-T* | ImageNet-1K | 224x224 | 81.2 | 95.6 | 18M | 2.5G | 814 | Baidu |
Strip-MLP-T | ImageNet-1K | 224x224 | 82.2 | 96.1 | 25M | 3.7G | 597 | Baidu |
Strip-MLP-S | ImageNet-1K | 224x224 | 83.3 | 96.6 | 43M | 6.8G | 381 | Baidu |
Strip-MLP-B | ImageNet-1K | 224x224 | 83.6 | 96.5 | 57M | 9.2G | 300 | Baidu |
Citing Strip-MLP
@article{cao2023strip,
title={Strip-MLP: Efficient Token Interaction for Vision MLP},
author={Cao, Guiping and Luo, Shengda and Huang, Wenjian and Lan, Xiangyuan and Jiang, Dongmei and Wang, Yaowei and Zhang, Jianguo},
journal={International Conference on Computer Vision (ICCV)},
year={2023}
}
Data preparation of ImageNet-1K
Download and extract ImageNet train and val images from http://image-net.org/.
The directory structure is the standard layout for the torchvision datasets.ImageFolder
.
The directory structure is:
│path/to/imagenet/
├──train/
│ ├── n01530575
│ │ ├── n01530575_188.JPEG
│ │ ├── n01530575_190.JPEG
│ │ ├── ...
│ ├── ...
├──val/
│ ├── n01514668
│ │ ├── ILSVRC2012_val_00011403.JPEG
│ │ ├── ILSVRC2012_val_00012484.JPEG
│ │ ├── ...
│ ├── ...
The datasets of CIFAR-10 and CIFAR-100 can be download automatically for the torchvision [torchvision.datasets.CIFAR10
] [torchvision.datasets.CIFAR100
].
Data preparation of Caltech-101
Download the dataset from Caltech-101
. We split the
raw data into training dataset and testing dataset randomly. Here, we provide our code of dataset spliting (seeing the file [categories101_train_val_split.py
]), which splits the 80% of each class of the data as the training set and the remaining data as the testing set. We set a fixed number, like [1024
], as the seed in the program, so that the data spliting method can be reproduced.
Train Strip-MLP
To train the Strip-MLP-Base on ImageNet-1K:
python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --cfg configs/smlp_base_alpha3_patch4_224_imagenet1k.yaml --data-path <imagenet22k-path> --batch-size 128
Evaluate Strip-MLP
To evaluate the Strip-MLP-Base on ImageNet-1K:
python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main_eval.py --cfg configs/smlp_base_alpha3_patch4_224_imagenet1k.yaml --data-path <imagenet22k-path> --batch-size 128
License
Acknowledgement
Our source codes are built on top of Swin-Transformer and SPACH.