Awesome
<div align="center"> <h1>Mamba-YOLO-World</h1> <h3>Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection</h3> Haoxuan Wang, Qingdong He, Jinlong Peng, Hao Yang, Mingmin Chi, Yabiao Wang <br> <br> </div>Abstract
Open-vocabulary detection (OVD) aims to detect objects beyond a predefined set of categories. As a pioneering model incorporating the YOLO series into OVD, YOLO-World is well-suited for scenarios prioritizing speed and efficiency. However, its performance is hindered by its neck feature fusion mechanism, which causes the quadratic complexity and the limited guided receptive fields. To address these limitations, we present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. It leverages multi-modal input sequences and mamba hidden states to guide the selective scanning process. Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.
✨ News
2024-10-30
: 🤗 We provide the Model Weights and Visualization Results on HuggingFace.
2024-09-24
: 🚀 We provide all the Model Weights for community.
2024-09-14
: 💎 We provide the Mamba-YOLO-World source code for community.
2024-09-12
: We provide the Visualization Results of ZERO-SHOT Inference on LVIS generated by Mamba-YOLO-World and YOLO-World for comparison. <br>
Introduction
This repo contains the PyTorch implementation, pre-trained weights, and pre-training/fine-tuning code for Mamba-YOLO-World.
-
We present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion-PAN as its neck architecture.
-
We introduce a State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm, with O(N+1) complexity and globally guided receptive fields.
-
Experiments demonstrate that our model outperforms the original YOLO-World while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.
📷 Visualization Results
- We adopt the pre-trained Mamba-YOLO-World-S, Mamba-YOLO-World-M, Mamba-YOLO-World-L, YOLO-World-v2-S, YOLO-World-v2-M, YOLO-World-v2-L and conduct zero-shot inferences on the LVIS-val2017 (COCO-val2017 images with the LVIS vocabulary). Specifically, the LVIS vocabulary contains 1203 categories.
- All visualization results are available at: https://pan.quark.cn/s/450070c03c58 (if you use Quark) and https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main/zeroshot_pictures_COCO_Comparison (the same for HuggingFace users). You are welcome to download them and make a comparison between our Mamba-YOLO-World and the original YOLO-World across small (S), medium (M) and large (L) size variants.
- The visualization results demonstrate that our Mamba-YOLO-World significantly outperforms YOLO-World (even YOLO-World-v2, the latest version of YOLO-World) in terms of accuracy and generalization across all size variants.
Model Zoo
Zero-shot Evaluation on LVIS-minival dataset
<div><font size=2>model | Pre-train Data | AP<sup>mini</su> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> | weights on Quark | weights on HuggingFace |
---|---|---|---|---|---|---|---|
Mamba-YOLO-World-S | O365+GoldG | 27.7 | 19.5 | 27.0 | 29.9 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-M | O365+GoldG | 32.8 | 27.0 | 31.9 | 34.8 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-L | O365+GoldG | 35.0 | 29.3 | 34.2 | 36.8 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Zero-shot Evaluation on COCO dataset
<div><font size=2>model | Pre-train Data | AP | AP<sub>50</sub> | AP<sub>75</sub> | weights on Quark | weights on HuggingFace |
---|---|---|---|---|---|---|
Mamba-YOLO-World-S | O365+GoldG | 38.0 | 52.9 | 41.0 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-M | O365+GoldG | 43.2 | 58.8 | 46.6 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-L | O365+GoldG | 45.4 | 61.3 | 49.4 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Fine-tuning Evaluation on COCO dataset
<div><font size=2>model | Pre-train Data | AP | AP<sub>50</sub> | AP<sub>75</sub> | weights on Quark | weights on HuggingFace |
---|---|---|---|---|---|---|
Mamba-YOLO-World-S | O365+GoldG | 46.4 | 62.5 | 50.5 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-M | O365+GoldG | 51.4 | 68.2 | 56.1 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-L | O365+GoldG | 54.1 | 71.1 | 59.0 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Getting started
1. Installation
Mamba-YOLO-World is developed based on torch==2.0.0
,mamba-ssm==2.1.0
, triton==2.1.0
,supervision==0.20.0
, mmcv==2.0.1
, mmyolo==0.6.0
and mmdetection==3.3.0
.
You need to link the mmyolo under third_party
directory.
2. Preparing Data
We provide the details about the pre-training data in docs/data.
Evaluation
./tools/dist_test.sh configs/mamba2_yolo_world_s.py CHECKPOINT_FILEPATH num_gpus_per_node
./tools/dist_test.sh configs/mamba2_yolo_world_m.py CHECKPOINT_FILEPATH num_gpus_per_node
./tools/dist_test.sh configs/mamba2_yolo_world_l.py CHECKPOINT_FILEPATH num_gpus_per_node
Pre-training
./tools/dist_train.sh configs/mamba2_yolo_world_s.py num_gpus_per_node --amp
./tools/dist_train.sh configs/mamba2_yolo_world_m.py num_gpus_per_node --amp
./tools/dist_train.sh configs/mamba2_yolo_world_l.py num_gpus_per_node --amp
Fine-tuning
./tools/dist_train.sh configs/mamba2_yolo_world_s_mask-refine_finetune_coco.py num_gpus_per_node --amp
./tools/dist_train.sh configs/mamba2_yolo_world_m_mask-refine_finetune_coco.py num_gpus_per_node --amp
./tools/dist_train.sh configs/mamba2_yolo_world_l_mask-refine_finetune_coco.py num_gpus_per_node --amp
Demo
image_demo.py
: inference with images or a directory of imagesvideo_demo.py
: inference on videos.
Acknowledgement
We sincerely thank mmyolo, mmdetection, YOLO-World, Mamba and VMamba for providing their wonderful code to the community!