Awesome
Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges
The Mamba-360 framework is a collection of State Space Models in various Domains.
Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequences
-
Transformers have dominated sequence modeling tasks like Machine Translation, Named Entity Recognition (NER), etc., but they suffer from $O(N^2)$ attention complexity and challenges in handling inductive bias for long sequences.
-
State Space Models (SSMs) have emerged as promising alternatives for sequence modeling paradigms, especially with the advent of S4 and its variants, such as S4nd, Hippo, Hyena, Diagonal State Spaces (DSS), Gated State Spaces (GSS), Linear Recurrent Unit (LRU), Liquid-S4, Long-Conv, Mega, HGRN, TNN, S5, Mamba, Mamba-former, Mamba-MoE etc.
-In this survey, we categorize foundational SSMs based on three paradigms: Structural architectures, Gating architectures, and Recurrent architectures.
Contents
-
The survey highlights crucial applications of SSMs for Long Sequence context across various domains, including:
- Natural Language Processing (NLP)
- Computer Vision
- Medical Domain (including genomics, drug design)
- Video Domain
- Audio and Speech
- Multi-Modal Domain
- Tabular Domain
- Recommendation System
- Graph Domain
- Time Series forecasting
-
The work consolidates the performance of SSMs on benchmark datasets in:
- Text Domain: Long Range Arena (LRA), WikiText, Glue, Pile
- Vision Domain: ImageNet1K
- Video Dataset: Kinetics-400, SSTv2, Breakfast, COIN, LVU
- Time Series Datasets: Electricity, Weather, Traffic, ETTh1, ETTh2, ETTm1, and ETTm2.
Badri N. Patro, Vijay S. Agneeswaran Microsoft
@article{patro2024mamba,
title={Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges},
author={Patro, Badri Narayana and Agneeswaran, Vijay Srinivas},
journal={arXiv preprint arXiv:2404.16112},
year={2024}
}
Adavance State Space Models
SSMs for Various Applications
Architectural Evolution
Basic Of SSM.
SSM SOTA on ImageNet-1K dataset with Image size 224 x 224.
This table shows the performance of various SSM models for Image Recognition tasks on the ImageNet1K dataset (Deng et al., 2009).
Table: SSM SOTA on ImageNet-1K. This table shows the performance of various SSM models for Image Recognition tasks on the ImageNet1K dataset. Models are grouped into three categories based on their GFLOPs. This table is adapted from the original source.
Method | Image Size | #Param. | FLOPs | Top-1 acc. |
---|---|---|---|---|
HyenaViT-B | 224^2 | 88M | - | 78.5 |
S4ND-ViT-B | 224^2 | 89M | - | 80.4 |
TNN-T | - | 6.4M | - | 72.29 |
TNN-S | - | 23.4M | - | 79.20 |
Vim-Ti | 224^2 | 7M | - | 76.1 |
Vim-S | 224^2 | 26M | - | 80.5 |
HGRN-T | - | 6.1M | - | 74.40 |
HGRN-S | - | 23.7M | - | 80.09 |
PlainMamba-L1 | 224^2 | 7M | 3.0G | 77.9 |
PlainMamba-L2 | 224^2 | 25M | 8.1G | 81.6 |
PlainMamba-L3 | 224^2 | 50M | 14.4G | 82.3 |
Mamba-2D-S | 224^2 | 24M | - | 81.7 |
Mamba-2D-B | 224^2 | 92M | - | 83.0 |
VMamba-T | 224^2 | 22M | 5.6G | 82.2 |
VMamba-S | 224^2 | 44M | 11.2G | 83.5 |
VMamba-B | 224^2 | 75M | 18.0G | 83.2 |
LocalVMamba-T | 224^2 | 26M | 5.7G | 82.7 |
LocalVMamba-S | 224^2 | 50M | 11.4G | 83.7 |
SiMBA-S(Monarch) | 224^2 | 18.5M | 3.6G | 81.1 |
SiMBA-B(Monarch) | 224^2 | 26.9M | 6.3G | 82.6 |
SiMBA-L(Monarch) | 224^2 | 42M | 10.7G | 83.8 |
ViM2-T | 224^2 | 20M | - | 82.7 |
ViM2-S | 224^2 | 43M | - | 83.7 |
ViM2-B | 224^2 | 74M | - | 83.9 |
SiMBA-S(EinFFT) | 224^2 | 15.3M | 2.4G | 81.7 |
SiMBA-B(EinFFT) | 224^2 | 22.8M | 5.2G | 83.5 |
SiMBA-L(EinFFT) | 224^2 | 36.6M | 9.6G | 84.4 |
SiMBA-S(MLP) | 224^2 | 26.5M | 5.0G | 84.0 |
SiMBA-B(MLP) | 224^2 | 40.0M | 9.0G | 84.7 |
State of the Art results of various vision models (Convnets, Transformers, SSMs) on ImageNet-1K dataset with Image size 224 x 224.
Table: SOTA on ImageNet-1K}The table shows the performance of various vision backbones on the ImageNet1K dataset for image recognition tasks. $\star$ indicates additionally trained with the Token Labeling for patch encoding. We have grouped the vision models into three categories based on their GFLOPs (Small, Base, and Large). The GFLOP ranges: Small (GFLOPs$<$5), Base (5$\leq$GFLOPs$<$10), and Large (10$\leq$GFLOPs$<$30). This table is adapted from the SiMBA paper.
Method | Image Size | #Param. | FLOPs | Top-1 acc. |
---|---|---|---|---|
Convnets | ||||
ResNet-101 | $224^2$ | 45M | - | 77.4 |
RegNetY-8G | $224^2$ | 39M | 8.0G | 81.7 |
ResNet-152 | $224^2$ | 60M | - | 78.3 |
RegNetY-16G | $224^2$ | 84M | 16.0G | 82.9 |
Transformers | ||||
DeiT-S | $224^2$ | 22M | 4.6G | 79.8 |
Swin-T | $224^2$ | 29M | 4.5G | 81.3 |
EffNet-B4 | $380^2$ | 19M | 4.2G | 82.9 |
WaveViT-H-S$^\star$ | $224^2$ | 22.7M | 4.1G | 82.9 |
SpectFormer-H-S$^\star$ | $224^2$ | 22.2M | 3.9G | 84.3 |
SVT-H-S$^\star$ | $224^2$ | 22M | 3.9G | 84.2 |
SCT-H-S$^\star$ | $224^2$ | 21.7M | 4.1G | 84.5 |
EffNet-B5 | $456^2$ | 30M | 9.9G | 83.6 |
Swin-S | $224^2$ | 50M | 8.7G | 83.0 |
CMT-B | $224^2$ | 45M | 9.3G | 84.5 |
MaxViT-S | $224^2$ | 69M | 11.7G | 84.5 |
iFormer-B | $224^2$ | 48M | 9.4G | 84.6 |
Wave-ViT-B$^\star$ | $224^2$ | 33M | 7.2G | 84.8 |
SpectFormer-H-B$^\star$ | $224^2$ | 33.1M | 6.3G | 85.1 |
SVT-H-B$^\star$ | $224^2$ | 32.8M | 6.3G | 85.2 |
SCT-H-B$^\star$ | $224^2$ | 32.5M | 6.5G | 85.2 |
M2-ViT-b | $224^2$ | 45M | - | 79.5 |
DeiT-B | $224^2$ | 86M | 17.5G | 81.8 |
Swin-B | $224^2$ | 88M | 15.4G | 83.5 |
M2-Swin-B | $224^2$ | 50M | - | 83.5 |
EffNet-B6 | $528^2$ | 43M | 19.0G | 84.0 |
MaxViT-B | $224^2$ | 120M | 23.4G | 85.0 |
VOLO-D3$^\star$ | $224^2$ | 86M | 20.6G | 85.4 |
Wave-ViT-L$^\star$ | $224^2$ | 57M | 14.8G | 85.5 |
SpectFormer-H-L$^\star$ | $224^2$ | 54.7M | 12.7G | 85.7 |
SVT-H-L$^\star$ | $224^2$ | 54.0M | 12.7G | 85.7 |
SCT-H-L$^\star$ | $224^2$ | 54.1M | 13.4G | 85.9 |
SSMs | ||||
Vim-Ti | $224^2$ | 7M | - | 76.1 |
PlainMamba-L1 | $224^2$ | 7M | 3.0G | 77.9 |
VMamba-T | $224^2$ | 22M | 5.6G | 82.2 |
SiMBA-S(Monarch) | $224^2$ | 18.5M | 3.6G | 81.1 |
Mamba-2D-S | $224^2$ | 24M | - | 81.7 |
SiMBA-S(EinFFT) | $224^2$ | 15.3M | 2.4G | 81.7 |
LocalVMamba-T | $224^2$ | 26M | 5.7G | 82.7 |
ViM2-T | $224^2$ | 20M | - | 82.7 |
SiMBA-S(MLP) | $224^2$ | 26.5M | 5.0G | 84.0 |
Vim-S | $224^2$ | 26M | - | 80.5 |
PlainMamba-L2 | $224^2$ | 25M | 8.1G | 81.6 |
SiMBA-B(Monarch) | $224^2$ | 26.9M | 6.3G | 82.6 |
Mamba-2D-B | $224^2$ | 92M | - | 83.0 |
SiMBA-B(EinFFT) | $224^2$ | 22.8M | 5.2G | 83.5 |
VMamba-S | $224^2$ | 44M | 11.2G | 83.5 |
LocalVMamba-S | $224^2$ | 50M | 11.4G | 83.7 |
ViM2-S | $224^2$ | 43M | - | 83.7 |
SiMBA-B(MLP) | $224^2$ | 40.0M | 9.0G | 84.7 |
HyenaViT-B | $224^2$ | 88M | - | 78.5 |
S4ND-ViT-B | $224^2$ | 89M | - | 80.4 |
PlainMamba-L3 | $224^2$ | 50M | 14.4G | 82.3 |
VMamba-B | $224^2$ | 75M | 18.0G | 83.2 |
SiMBA-L(Monarch) | $224^2$ | 42M | 10.7G | 83.8 |
ViM2-B | $224^2$ | 74M | - | 83.9 |
SiMBA-L(EinFFT) | $224^2$ | 36.6M | 9.6G | 84.4 |
State of the Art results of LRA benchmark tasks (Tay et al., 2020).
Table: Test accuracy on the LRA benchmark tasks (Tay et al., 2020). "✗" indicates the model did not exceed random guessing. The results for models ranging from Transformer to Performer are sourced from Tay et al. (2020). We compiled this table using data from the HGRN paper by Qin et al. (2023) and the S5 paper by Smith et al. (2022), consolidating the results into a unified presentation below.
Model | ListOps | Text | Retrieval | Image | Pathfinder | Path-X | Avg. |
---|---|---|---|---|---|---|---|
Transformer | 36.37 | 64.27 | 57.46 | 42.44 | 71.40 | ✗ | 53.66 |
Local Attention | 15.82 | 52.98 | 53.39 | 41.46 | 66.63 | ✗ | 46.71 |
Sparse Trans. | 17.07 | 63.58 | 59.59 | 44.24 | 71.71 | ✗ | 51.03 |
Longformer | 35.63 | 62.85 | 56.89 | 42.22 | 69.71 | ✗ | 52.88 |
Linformer | 35.70 | 53.94 | 52.27 | 38.56 | 76.34 | ✗ | 51.14 |
Reformer | 37.27 | 56.10 | 53.40 | 38.07 | 68.50 | ✗ | 50.56 |
Sinkhorn Trans. | 33.67 | 61.20 | 53.83 | 41.23 | 67.45 | ✗ | 51.23 |
Synthesizer | 36.99 | 61.68 | 54.67 | 41.61 | 69.45 | ✗ | 52.40 |
BigBird | 36.05 | 64.02 | 59.29 | 40.83 | 74.87 | ✗ | 54.17 |
Linear Trans. | 16.13 | 65.90 | 53.09 | 42.34 | 75.30 | ✗ | 50.46 |
Performer | 18.01 | 65.40 | 53.82 | 42.77 | 77.05 | ✗ | 51.18 |
cosFormer | 36.50 | 67.70 | 83.15 | 51.23 | 71.96 | - | 51.76 |
FLASH | 38.70 | 64.10 | 86.10 | 47.40 | 70.25 | - | 51.09 |
FNet | 35.33 | 65.11 | 59.61 | 38.67 | 77.80 | ✗ | 54.42 |
Nyströmformer | 37.15 | 65.52 | 79.56 | 41.58 | 70.94 | ✗ | 57.46 |
Luna-256 | 37.25 | 64.57 | 79.29 | 47.38 | 77.72 | ✗ | 59.37 |
H-Transformer-1D | 49.53 | 78.69 | 63.99 | 46.05 | 68.78 | ✗ | 61.41 |
CCNN | 43.60 | 84.08 | ✗ | 88.90 | 91.51 | ✗ | 68.02 |
S4 | 58.35 | 76.02 | 87.09 | 87.26 | 86.05 | 88.10 | 80.48 |
DSSEXP | 59.70 | 84.60 | 87.60 | 84.90 | 84.70 | 85.60 | 81.18 |
DSS(SOFTMAX) | 60.60 | 84.80 | 87.80 | 85.70 | 84.60 | 87.80 | 81.88 |
S4D-LegS | 60.47 | 86.18 | 89.46 | 88.19 | 93.06 | 91.95 | 84.89 |
Mega-chunk | 58.76 | 90.19 | 90.97 | 85.80 | 94.41 | 93.81 | 85.66 |
S4-LegS | 59.60 | 86.82 | 90.90 | 88.65 | 94.20 | 96.35 | 86.09 |
TNN | 61.04 | 87.90 | 90.97 | 88.24 | 93.00 | 96.10 | 86.21 |
LRU | 60.20 | 89.40 | 89.90 | 89.00 | 95.10 | 94.20 | 86.30 |
HGRN | 59.95 | 88.14 | 94.23 | 88.69 | 92.92 | 97.50 | 86.91 |
SGConv | 61.45 | 89.2 | 91.11 | 87.97 | 95.46 | 97.83 | 87.17 |
Liquid-S4 | 62.75 | 89.02 | 91.20 | 89.50 | 94.8 | 96.66 | 87.32 |
S5 | 62.15 | 89.31 | 91.40 | 88.00 | 95.33 | 98.58 | 87.46 |
Mega | 63.14 | 90.43 | 91.25 | 90.44 | 96.01 | 97.98 | 88.21 |
Multivariate Time series benchmark Datasets
Table: Multivariate long-term forecasting results: It uses prediction lengths (T \in {96, 192, 336, 720}) for all the datasets for lookup window 96. The best results are in bold and the second best is <ins>underlined</ins>. This table is adapted from the SiMBA paper [@patro2024simba].
Datasets | Models | Simba | TimesNet | Crossformer | PatchTST | ETSFormer | DLinear | FEDFormer | Autoformer | Pyraformer | MTGNN | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metric | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | |
ETTm1 | 96 | 0.324 | 0.360 | 0.338 | 0.375 | 0.349 | 0.395 | 0.339 | 0.377 | 0.375 | 0.398 | 0.345 | 0.372 | 0.379 | 0.419 | 0.505 | 0.475 | 0.543 | 0.510 | 0.379 | 0.446 |
192 | 0.363 | 0.382 | 0.374 | 0.387 | 0.405 | 0.411 | 0.376 | 0.392 | 0.408 | 0.410 | 0.380 | 0.389 | 0.426 | 0.441 | 0.553 | 0.496 | 0.557 | 0.537 | 0.470 | 0.428 | |
336 | 0.395 | 0.405 | 0.410 | 0.411 | 0.432 | 0.431 | 0.408 | 0.417 | 0.435 | 0.428 | 0.413 | 0.413 | 0.445 | 0.459 | 0.621 | 0.537 | 0.754 | 0.655 | 0.473 | 0.430 | |
720 | 0.451 | 0.437 | 0.478 | 0.450 | 0.487 | 0.463 | 0.499 | 0.461 | 0.499 | 0.462 | 0.474 | 0.453 | 0.543 | 0.490 | 0.671 | 0.561 | 0.908 | 0.724 | 0.553 | 0.479 | |
ETTm2 | 96 | 0.177 | 0.263 | 0.187 | 0.267 | 0.208 | 0.292 | 0.192 | 0.273 | 0.189 | 0.280 | 0.193 | 0.292 | 0.203 | 0.287 | 0.255 | 0.339 | 0.435 | 0.507 | 0.203 | 0.299 |
192 | 0.245 | 0.306 | 0.249 | 0.309 | 0.263 | 0.332 | 0.252 | 0.314 | 0.253 | 0.319 | 0.284 | 0.362 | 0.269 | 0.328 | 0.281 | 0.340 | 0.730 | 0.673 | 0.265 | 0.328 | |
336 | 0.304 | 0.343 | 0.321 | 0.351 | 0.337 | 0.369 | 0.318 | 0.357 | 0.314 | 0.357 | 0.369 | 0.427 | 0.325 | 0.366 | 0.339 | 0.372 | 1.201 | 0.845 | 0.365 | 0.374 | |
720 | 0.400 | 0.399 | 0.408 | 0.403 | 0.429 | 0.430 | 0.413 | 0.416 | 0.414 | 0.413 | 0.554 | 0.522 | 0.421 | 0.415 | 0.433 | 0.432 | 3.625 | 1.451 | 0.461 | 0.459 | |
ETTh1 | 96 | 0.379 | 0.395 | 0.384 | 0.402 | 0.384 | 0.428 | 0.385 | 0.408 | 0.494 | 0.479 | 0.386 | 0.400 | 0.376 | 0.419 | 0.449 | 0.459 | 0.664 | 0.612 | 0.515 | 0.517 |
192 | 0.432 | 0.424 | 0.436 | 0.429 | 0.438 | 0.452 | 0.431 | 0.432 | 0.538 | 0.504 | 0.437 | 0.432 | 0.420 | 0.448 | 0.500 | 0.482 | 0.790 | 0.681 | 0.553 | 0.522 | |
336 | 0.473 | 0.443 | 0.491 | 0.469 | 0.495 | 0.483 | 0.485 | 0.462 | 0.574 | 0.521 | 0.481 | 0.459 | 0.459 | 0.465 | 0.521 | 0.496 | 0.891 | 0.738 | 0.612 | 0.577 | |
720 | 0.483 | 0.469 | 0.521 | 0.500 | 0.522 | 0.501 | 0.497 | 0.483 | 0.562 | 0.535 | 0.519 | 0.516 | 0.506 | 0.507 | 0.514 | 0.512 | 0.963 | 0.782 | 0.609 | 0.597 | |
ETTh2 | 96 | 0.290 | 0.339 | 0.340 | 0.374 | 0.347 | 0.391 | 0.343 | 0.376 | 0.340 | 0.391 | 0.333 | 0.387 | 0.358 | 0.397 | 0.346 | 0.388 | 0.645 | 0.597 | 0.354 | 0.454 |
192 | 0.373 | 0.390 | 0.402 | 0.414 | 0.419 | 0.427 | 0.405 | 0.417 | 0.430 | 0.439 | 0.477 | 0.476 | 0.429 | 0.439 | 0.456 | 0.452 | 0.788 | 0.683 | 0.457 | 0.464 | |
336 | 0.376 | 0.406 | 0.452 | 0.452 | 0.449 | 0.465 | 0.448 | 0.453 | 0.485 | 0.479 | 0.594 | 0.541 | 0.496 | 0.487 | 0.482 | 0.486 | 0.907 | 0.747 | 0.515 | 0.540 | |
720 | 0.407 | 0.431 | 0.462 | 0.468 | 0.479 | 0.505 | 0.464 | 0.483 | 0.500 | 0.497 | 0.831 | 0.657 | 0.463 | 0.474 | 0.515 | 0.511 | 0.963 | 0.783 | 0.532 | 0.576 | |
Electricity | 96 | 0.165 | 0.253 | 0.168 | 0.272 | 0.185 | 0.288 | 0.159 | 0.268 | 0.187 | 0.304 | 0.197 | 0.282 | 0.193 | 0.308 | 0.201 | 0.317 | 0.386 | 0.449 | 0.217 | 0.318 |
192 | 0.173 | 0.262 | 0.198 | 0.300 | 0.211 | 0.312 | 0.195 | 0.296 | 0.212 | 0.329 | 0.209 | 0.301 | 0.214 | 0.329 | 0.231 | 0.338 | 0.376 | 0.443 | 0.260 | 0.348 | |
336 | 0.188 | 0.277 | 0.198 | 0.300 | 0.211 | 0.312 | 0.195 | 0.296 | 0.212 | 0.329 | 0.209 | 0.301 | 0.214 | 0.329 | 0.231 | 0.338 | 0.376 | 0.443 | 0.260 | 0.348 | |
720 | 0.214 | 0.305 | 0.220 | 0.320 | 0.223 | 0.335 | 0.215 | 0.317 | 0.233 | 0.345 | 0.245 | 0.333 | 0.246 | 0.355 | 0.254 | 0.361 | 0.376 | 0.445 | 0.290 | 0.369 | |
Traffic | 96 | 0.468 | 0.268 | 0.593 | 0.321 | 0.591 | 0.329 | 0.583 | 0.319 | 0.607 | 0.392 | 0.650 | 0.396 | 0.587 | 0.366 | 0.613 | 0.388 | 0.867 | 0.468 | 0.660 | 0.437 |
192 | 0.413 | 0.317 | 0.617 | 0.336 | 0.607 | 0.345 | 0.591 | 0.331 | 0.621 | 0.399 | 0.598 | 0.370 | 0.604 | 0.373 | 0.616 | 0.382 | 0.869 | 0.467 | 0.649 | 0.438 | |
336 | 0.529 | 0.284 | 0.629 | 0.336 | 0.613 | 0.339 | 0.599 | 0.332 | 0.622 | 0.396 | 0.605 | 0.373 | 0.621 | 0.383 | 0.622 | 0.337 | 0.881 | 0.469 | 0.653 | 0.472 | |
720 | 0.564 | 0.297 | 0.640 | 0.350 | 0.620 | 0.348 | 0.601 | 0.341 | 0.632 | 0.396 | 0.645 | 0.394 | 0.626 | 0.382 | 0.660 | 0.408 | 0.896 | 0.473 | 0.639 | 0.437 | |
Weather | 96 | 0.176 | 0.219 | 0.172 | 0.220 | 0.191 | 0.251 | 0.171 | 0.230 | 0.197 | 0.281 | 0.196 | 0.255 | 0.217 | 0.296 | 0.266 | 0.336 | 0.622 | 0.556 | 0.230 | 0.329 |
192 | 0.222 | 0.260 | 0.219 | 0.261 | 0.219 | 0.279 | 0.219 | 0.271 | 0.237 | 0.312 | 0.237 | 0.296 | 0.276 | 0.336 | 0.307 | 0.367 | 0.739 | 0.624 | 0.263 | 0.322 | |
336 | 0.275 | 0.297 | 0.280 | 0.306 | 0.287 | 0.332 | 0.277 | 0.321 | 0.298 | 0.353 | 0.283 | 0.335 | 0.339 | 0.380 | 0.359 | 0.395 | 1.004 | 0.753 | 0.354 | 0.396 | |
720 | 0.350 | 0.349 | 0.365 | 0.359 | 0.368 | 0.378 | 0.365 | 0.367 | 0.352 | 0.288 | 0.345 | 0.381 | 0.403 | 0.428 | 0.419 | 0.428 | 1.420 | 0.934 | 0.409 | 0.371 |
Comparison with SoTA methods on 8 benchmark datasets for Multimodal Applications
Benchmark names are abbreviated due to space limits. VQA-v2 (Goyal et al., 2017); GQA (Hudson and Manning, 2019); SQA-I: ScienceQA-IMG (Anonymous, 2022); VQA-T: TextVQA (Singh et al., 2019); POPE (Li et al., 2023); MME (Yin et al., 2023); MMB: MMBench (Liu et al., 2023); MM-Vet (Yu et al., 2023). PT and IT indicate the number of samples in the pretraining and instruction tuning stages, respectively.
Table: Comparison with State-of-the-Art (SoTA) methods on 8 benchmarks. Benchmark names are abbreviated for space considerations. PT and IT indicate the number of samples in the pretraining and instruction tuning stages, respectively. This table is adapted from the VL-Mamba paper (Qiao et al., 2024).
Method | LLM | PT | IT | VQA-v2 | GQA | SQA-I | VQA-T | POPE | MME | MMB | MM-Vet |
---|---|---|---|---|---|---|---|---|---|---|---|
BLIP-2 | Vicuna-13B | 129M | - | 41.0 | 41.0 | 61.0 | 42.5 | 85.3 | 1293.8 | -- | 22.4 |
MiniGPT-4 | Vicuna-7B | 5M | 5K | - | 32.2 | - | - | - | 581.7 | 23.0 | - |
InstructBLIP | Vicuna-7B | 129M | 1.2M | -- | 49.2 | 60.5 | 50.1 | -- | -- | 36 | 26.2 |
InstructBLIP | Vicuna-13B | 129M | 1.2M | -- | 49.5 | 63.1 | 50.7 | 78.9 | 1212.8 | -- | 25.6 |
Shikra | Vicuna-13B | 600K | 5.5M | 77.4 | -- | -- | -- | -- | -- | 58.8 | -- |
Otter | LLaMA-7B | - | - | - | - | - | - | - | 1292.3 | 48.3 | 24.6 |
mPLUG-Owl | LLaMA-7B | 2.1M | 102K | - | - | - | - | - | 967.3 | 49.4 | - |
IDEFICS-9B | LLaMA-7B | 353M | 1M | 50.9 | 38.4 | -- | 25.9 | -- | -- | 48.2 | -- |
IDEFICS-80B | LLaMA-65B | 353M | 1M | 60.0 | 45.2 | -- | 30.9 | -- | -- | 54.5 | -- |
Qwen-VL | Qwen-7B | 1.4B | 50M | 78.8 | 59.3 | 67.1 | 63.8 | -- | -- | 38.2 | -- |
Qwen-VL-Chat | Qwen-7B | 1.4B | 50M | 78.2 | 57.5 | 68.2 | 61.5 | -- | 1487.5 | 60.6 | -- |
LLaVA-1.5 | Vicuna-7B | 558K | 665K | 78.5 | 62.0 | 66.8 | 58.2 | 85.9 | 1510.7 | 64.3 | 30.5 |
LLaVA-1.5 | Vicuna-13B | 558K | 665K | 80.0 | 63.3 | 71.6 | 61.3 | 85.9 | 1531.3 | 67.7 | 35.4 |
LLaVA-Phi | Phi-2-2.7B | 558K | 665K | 71.4 | - | 68.4 | 48.6 | 85.0 | 1335.1 | 59.8 | 28.9 |
MobileVLM-3B | MobileLLaMA-2.7B | 558K | 665K | - | 59.0 | 61.2 | 47.5 | 84.9 | 1288.9 | 59.6 | - |
Cobra | Mamba-2.8B | - | - | 75.9 | 58.5 | - | 46.0 | 88.0 | - | - | - |
VL-Mamba | Mamba LLM-2.8B | 558K | 665K | 76.6 | 56.2 | 65.4 | 48.9 | 84.4 | 1369.6 | 57.0 | 32.6 |