Home

Awesome

Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

The Mamba-360 framework is a collection of State Space Models in various Domains.

Awesome Arxiv Paper Project Page GitHub issues MIT: License

Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequences

-In this survey, we categorize foundational SSMs based on three paradigms: Structural architectures, Gating architectures, and Recurrent architectures.

Contents

A Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

Badri N. Patro, Vijay S. Agneeswaran Microsoft

@article{patro2024mamba,
  title={Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges},
  author={Patro, Badri Narayana and Agneeswaran, Vijay Srinivas},
  journal={arXiv preprint arXiv:2404.16112},
  year={2024}
}

Adavance State Space Models

Mamba 360

SSMs for Various Applications

SSM_Applications

Architectural Evolution

Architectural_Evolution

Basic Of SSM.

model

SSM SOTA on ImageNet-1K dataset with Image size 224 x 224.

This table shows the performance of various SSM models for Image Recognition tasks on the ImageNet1K dataset (Deng et al., 2009).

Table: SSM SOTA on ImageNet-1K. This table shows the performance of various SSM models for Image Recognition tasks on the ImageNet1K dataset. Models are grouped into three categories based on their GFLOPs. This table is adapted from the original source.

MethodImage Size#Param.FLOPsTop-1 acc.
HyenaViT-B224^288M-78.5
S4ND-ViT-B224^289M-80.4
TNN-T-6.4M-72.29
TNN-S-23.4M-79.20
Vim-Ti224^27M-76.1
Vim-S224^226M-80.5
HGRN-T-6.1M-74.40
HGRN-S-23.7M-80.09
PlainMamba-L1224^27M3.0G77.9
PlainMamba-L2224^225M8.1G81.6
PlainMamba-L3224^250M14.4G82.3
Mamba-2D-S224^224M-81.7
Mamba-2D-B224^292M-83.0
VMamba-T224^222M5.6G82.2
VMamba-S224^244M11.2G83.5
VMamba-B224^275M18.0G83.2
LocalVMamba-T224^226M5.7G82.7
LocalVMamba-S224^250M11.4G83.7
SiMBA-S(Monarch)224^218.5M3.6G81.1
SiMBA-B(Monarch)224^226.9M6.3G82.6
SiMBA-L(Monarch)224^242M10.7G83.8
ViM2-T224^220M-82.7
ViM2-S224^243M-83.7
ViM2-B224^274M-83.9
SiMBA-S(EinFFT)224^215.3M2.4G81.7
SiMBA-B(EinFFT)224^222.8M5.2G83.5
SiMBA-L(EinFFT)224^236.6M9.6G84.4
SiMBA-S(MLP)224^226.5M5.0G84.0
SiMBA-B(MLP)224^240.0M9.0G84.7

State of the Art results of various vision models (Convnets, Transformers, SSMs) on ImageNet-1K dataset with Image size 224 x 224.

Table: SOTA on ImageNet-1K}The table shows the performance of various vision backbones on the ImageNet1K dataset for image recognition tasks. $\star$ indicates additionally trained with the Token Labeling for patch encoding. We have grouped the vision models into three categories based on their GFLOPs (Small, Base, and Large). The GFLOP ranges: Small (GFLOPs$<$5), Base (5$\leq$GFLOPs$<$10), and Large (10$\leq$GFLOPs$<$30). This table is adapted from the SiMBA paper.

MethodImage Size#Param.FLOPsTop-1 acc.
Convnets
ResNet-101$224^2$45M-77.4
RegNetY-8G$224^2$39M8.0G81.7
ResNet-152$224^2$60M-78.3
RegNetY-16G$224^2$84M16.0G82.9
Transformers
DeiT-S$224^2$22M4.6G79.8
Swin-T$224^2$29M4.5G81.3
EffNet-B4$380^2$19M4.2G82.9
WaveViT-H-S$^\star$$224^2$22.7M4.1G82.9
SpectFormer-H-S$^\star$$224^2$22.2M3.9G84.3
SVT-H-S$^\star$$224^2$22M3.9G84.2
SCT-H-S$^\star$$224^2$21.7M4.1G84.5
EffNet-B5$456^2$30M9.9G83.6
Swin-S$224^2$50M8.7G83.0
CMT-B$224^2$45M9.3G84.5
MaxViT-S$224^2$69M11.7G84.5
iFormer-B$224^2$48M9.4G84.6
Wave-ViT-B$^\star$$224^2$33M7.2G84.8
SpectFormer-H-B$^\star$$224^2$33.1M6.3G85.1
SVT-H-B$^\star$$224^2$32.8M6.3G85.2
SCT-H-B$^\star$$224^2$32.5M6.5G85.2
M2-ViT-b$224^2$45M-79.5
DeiT-B$224^2$86M17.5G81.8
Swin-B$224^2$88M15.4G83.5
M2-Swin-B$224^2$50M-83.5
EffNet-B6$528^2$43M19.0G84.0
MaxViT-B$224^2$120M23.4G85.0
VOLO-D3$^\star$$224^2$86M20.6G85.4
Wave-ViT-L$^\star$$224^2$57M14.8G85.5
SpectFormer-H-L$^\star$$224^2$54.7M12.7G85.7
SVT-H-L$^\star$$224^2$54.0M12.7G85.7
SCT-H-L$^\star$$224^2$54.1M13.4G85.9
SSMs
Vim-Ti$224^2$7M-76.1
PlainMamba-L1$224^2$7M3.0G77.9
VMamba-T$224^2$22M5.6G82.2
SiMBA-S(Monarch)$224^2$18.5M3.6G81.1
Mamba-2D-S$224^2$24M-81.7
SiMBA-S(EinFFT)$224^2$15.3M2.4G81.7
LocalVMamba-T$224^2$26M5.7G82.7
ViM2-T$224^2$20M-82.7
SiMBA-S(MLP)$224^2$26.5M5.0G84.0
Vim-S$224^2$26M-80.5
PlainMamba-L2$224^2$25M8.1G81.6
SiMBA-B(Monarch)$224^2$26.9M6.3G82.6
Mamba-2D-B$224^2$92M-83.0
SiMBA-B(EinFFT)$224^2$22.8M5.2G83.5
VMamba-S$224^2$44M11.2G83.5
LocalVMamba-S$224^2$50M11.4G83.7
ViM2-S$224^2$43M-83.7
SiMBA-B(MLP)$224^2$40.0M9.0G84.7
HyenaViT-B$224^2$88M-78.5
S4ND-ViT-B$224^2$89M-80.4
PlainMamba-L3$224^2$50M14.4G82.3
VMamba-B$224^2$75M18.0G83.2
SiMBA-L(Monarch)$224^2$42M10.7G83.8
ViM2-B$224^2$74M-83.9
SiMBA-L(EinFFT)$224^2$36.6M9.6G84.4

State of the Art results of LRA benchmark tasks (Tay et al., 2020).

Table: Test accuracy on the LRA benchmark tasks (Tay et al., 2020). "✗" indicates the model did not exceed random guessing. The results for models ranging from Transformer to Performer are sourced from Tay et al. (2020). We compiled this table using data from the HGRN paper by Qin et al. (2023) and the S5 paper by Smith et al. (2022), consolidating the results into a unified presentation below.

ModelListOpsTextRetrievalImagePathfinderPath-XAvg.
Transformer36.3764.2757.4642.4471.4053.66
Local Attention15.8252.9853.3941.4666.6346.71
Sparse Trans.17.0763.5859.5944.2471.7151.03
Longformer35.6362.8556.8942.2269.7152.88
Linformer35.7053.9452.2738.5676.3451.14
Reformer37.2756.1053.4038.0768.5050.56
Sinkhorn Trans.33.6761.2053.8341.2367.4551.23
Synthesizer36.9961.6854.6741.6169.4552.40
BigBird36.0564.0259.2940.8374.8754.17
Linear Trans.16.1365.9053.0942.3475.3050.46
Performer18.0165.4053.8242.7777.0551.18
cosFormer36.5067.7083.1551.2371.96-51.76
FLASH38.7064.1086.1047.4070.25-51.09
FNet35.3365.1159.6138.6777.8054.42
Nyströmformer37.1565.5279.5641.5870.9457.46
Luna-25637.2564.5779.2947.3877.7259.37
H-Transformer-1D49.5378.6963.9946.0568.7861.41
CCNN43.6084.0888.9091.5168.02
S458.3576.0287.0987.2686.0588.1080.48
DSSEXP59.7084.6087.6084.9084.7085.6081.18
DSS(SOFTMAX)60.6084.8087.8085.7084.6087.8081.88
S4D-LegS60.4786.1889.4688.1993.0691.9584.89
Mega-chunk58.7690.1990.9785.8094.4193.8185.66
S4-LegS59.6086.8290.9088.6594.2096.3586.09
TNN61.0487.9090.9788.2493.0096.1086.21
LRU60.2089.4089.9089.0095.1094.2086.30
HGRN59.9588.1494.2388.6992.9297.5086.91
SGConv61.4589.291.1187.9795.4697.8387.17
Liquid-S462.7589.0291.2089.5094.896.6687.32
S562.1589.3191.4088.0095.3398.5887.46
Mega63.1490.4391.2590.4496.0197.9888.21

Multivariate Time series benchmark Datasets

Table: Multivariate long-term forecasting results: It uses prediction lengths (T \in {96, 192, 336, 720}) for all the datasets for lookup window 96. The best results are in bold and the second best is <ins>underlined</ins>. This table is adapted from the SiMBA paper [@patro2024simba].

DatasetsModelsSimbaTimesNetCrossformerPatchTSTETSFormerDLinearFEDFormerAutoformerPyraformerMTGNN
MetricMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
ETTm1960.3240.3600.3380.3750.3490.3950.3390.3770.3750.3980.3450.3720.3790.4190.5050.4750.5430.5100.3790.446
1920.3630.3820.3740.3870.4050.4110.3760.3920.4080.4100.3800.3890.4260.4410.5530.4960.5570.5370.4700.428
3360.3950.4050.4100.4110.4320.4310.4080.4170.4350.4280.4130.4130.4450.4590.6210.5370.7540.6550.4730.430
7200.4510.4370.4780.4500.4870.4630.4990.4610.4990.4620.4740.4530.5430.4900.6710.5610.9080.7240.5530.479
ETTm2960.1770.2630.1870.2670.2080.2920.1920.2730.1890.2800.1930.2920.2030.2870.2550.3390.4350.5070.2030.299
1920.2450.3060.2490.3090.2630.3320.2520.3140.2530.3190.2840.3620.2690.3280.2810.3400.7300.6730.2650.328
3360.3040.3430.3210.3510.3370.3690.3180.3570.3140.3570.3690.4270.3250.3660.3390.3721.2010.8450.3650.374
7200.4000.3990.4080.4030.4290.4300.4130.4160.4140.4130.5540.5220.4210.4150.4330.4323.6251.4510.4610.459
ETTh1960.3790.3950.3840.4020.3840.4280.3850.4080.4940.4790.3860.4000.3760.4190.4490.4590.6640.6120.5150.517
1920.4320.4240.4360.4290.4380.4520.4310.4320.5380.5040.4370.4320.4200.4480.5000.4820.7900.6810.5530.522
3360.4730.4430.4910.4690.4950.4830.4850.4620.5740.5210.4810.4590.4590.4650.5210.4960.8910.7380.6120.577
7200.4830.4690.5210.5000.5220.5010.4970.4830.5620.5350.5190.5160.5060.5070.5140.5120.9630.7820.6090.597
ETTh2960.2900.3390.3400.3740.3470.3910.3430.3760.3400.3910.3330.3870.3580.3970.3460.3880.6450.5970.3540.454
1920.3730.3900.4020.4140.4190.4270.4050.4170.4300.4390.4770.4760.4290.4390.4560.4520.7880.6830.4570.464
3360.3760.4060.4520.4520.4490.4650.4480.4530.4850.4790.5940.5410.4960.4870.4820.4860.9070.7470.5150.540
7200.4070.4310.4620.4680.4790.5050.4640.4830.5000.4970.8310.6570.4630.4740.5150.5110.9630.7830.5320.576
Electricity960.1650.2530.1680.2720.1850.2880.1590.2680.1870.3040.1970.2820.1930.3080.2010.3170.3860.4490.2170.318
1920.1730.2620.1980.3000.2110.3120.1950.2960.2120.3290.2090.3010.2140.3290.2310.3380.3760.4430.2600.348
3360.1880.2770.1980.3000.2110.3120.1950.2960.2120.3290.2090.3010.2140.3290.2310.3380.3760.4430.2600.348
7200.2140.3050.2200.3200.2230.3350.2150.3170.2330.3450.2450.3330.2460.3550.2540.3610.3760.4450.2900.369
Traffic960.4680.2680.5930.3210.5910.3290.5830.3190.6070.3920.6500.3960.5870.3660.6130.3880.8670.4680.6600.437
1920.4130.3170.6170.3360.6070.3450.5910.3310.6210.3990.5980.3700.6040.3730.6160.3820.8690.4670.6490.438
3360.5290.2840.6290.3360.6130.3390.5990.3320.6220.3960.6050.3730.6210.3830.6220.3370.8810.4690.6530.472
7200.5640.2970.6400.3500.6200.3480.6010.3410.6320.3960.6450.3940.6260.3820.6600.4080.8960.4730.6390.437
Weather960.1760.2190.1720.2200.1910.2510.1710.2300.1970.2810.1960.2550.2170.2960.2660.3360.6220.5560.2300.329
1920.2220.2600.2190.2610.2190.2790.2190.2710.2370.3120.2370.2960.2760.3360.3070.3670.7390.6240.2630.322
3360.2750.2970.2800.3060.2870.3320.2770.3210.2980.3530.2830.3350.3390.3800.3590.3951.0040.7530.3540.396
7200.3500.3490.3650.3590.3680.3780.3650.3670.3520.2880.3450.3810.4030.4280.4190.4281.4200.9340.4090.371

Comparison with SoTA methods on 8 benchmark datasets for Multimodal Applications

Benchmark names are abbreviated due to space limits. VQA-v2 (Goyal et al., 2017); GQA (Hudson and Manning, 2019); SQA-I: ScienceQA-IMG (Anonymous, 2022); VQA-T: TextVQA (Singh et al., 2019); POPE (Li et al., 2023); MME (Yin et al., 2023); MMB: MMBench (Liu et al., 2023); MM-Vet (Yu et al., 2023). PT and IT indicate the number of samples in the pretraining and instruction tuning stages, respectively.

Table: Comparison with State-of-the-Art (SoTA) methods on 8 benchmarks. Benchmark names are abbreviated for space considerations. PT and IT indicate the number of samples in the pretraining and instruction tuning stages, respectively. This table is adapted from the VL-Mamba paper (Qiao et al., 2024).

MethodLLMPTITVQA-v2GQASQA-IVQA-TPOPEMMEMMBMM-Vet
BLIP-2Vicuna-13B129M-41.041.061.042.585.31293.8--22.4
MiniGPT-4Vicuna-7B5M5K-32.2---581.723.0-
InstructBLIPVicuna-7B129M1.2M--49.260.550.1----3626.2
InstructBLIPVicuna-13B129M1.2M--49.563.150.778.91212.8--25.6
ShikraVicuna-13B600K5.5M77.4----------58.8--
OtterLLaMA-7B-------1292.348.324.6
mPLUG-OwlLLaMA-7B2.1M102K-----967.349.4-
IDEFICS-9BLLaMA-7B353M1M50.938.4--25.9----48.2--
IDEFICS-80BLLaMA-65B353M1M60.045.2--30.9----54.5--
Qwen-VLQwen-7B1.4B50M78.859.367.163.8----38.2--
Qwen-VL-ChatQwen-7B1.4B50M78.257.568.261.5--1487.560.6--
LLaVA-1.5Vicuna-7B558K665K78.562.066.858.285.91510.764.330.5
LLaVA-1.5Vicuna-13B558K665K80.063.371.661.385.91531.367.735.4
LLaVA-PhiPhi-2-2.7B558K665K71.4-68.448.685.01335.159.828.9
MobileVLM-3BMobileLLaMA-2.7B558K665K-59.061.247.584.91288.959.6-
CobraMamba-2.8B--75.958.5-46.088.0---
VL-MambaMamba LLM-2.8B558K665K76.656.265.448.984.41369.657.032.6