Home

Awesome

Vision Longformer for Object Detection

This project provides the source code for the object detection part of vision longformer paper. It is based on detectron2.

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

The classification part of the code and checkpoints can be found here.

Updates

Usage

Here is an example command for evaluating a pretrained vision-longformer small model on COCO

python -m pip install -e .

ln -s /mnt/data_storage datasets

DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 1 --eval-only --config configs/msvit_maskrcnn_fpn_1x_small_sparse.yaml 
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0" 
SOLVER.AMP.ENABLED True 
MODEL.WEIGHTS /mnt/model_storage/msvit_det/visionlongformer/vilsmall/maskrcnn1x/model_final.pth

Here is an example training command for training the vision-longformer small model on COCO

python -m pip install -e .

ln -s /mnt/data_storage datasets

# convert the classification checkpoint into a detection checkpoint for initialization
python3 converter.py --source_model "/mnt/model_storage/msvit/visionlongformer/small1281_relative/model_best.pth"
--output_model msvit_pretrain.pth --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"

# train with the converted detection checkpoint as initialization
DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 8 --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.WEIGHTS msvit_pretrain.pth MODEL.TRANSFORMER.DROP_PATH 0.2 MODEL.TRANSFORMER.MSVIT.ATTN_TYPE
longformerhand MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"
SOLVER.AMP.ENABLED True SOLVER.BASE_LR 1e-4 SOLVER.WEIGHT_DECAY 0.1 TEST.EVAL_PERIOD
7330 SOLVER.IMS_PER_BATCH 16

Model Zoo on COCO

Vision Longformer with relative positional bias

BackboneMethodpretraindrop_pathLr Schdbox mAPmask mAP#paramsFLOPscheckpointslog
ViL-TinyMask R-CNNImageNet-1K0.051x41.438.126.9M145.6Gckpt configlog
ViL-TinyMask R-CNNImageNet-1K0.13x44.240.626.9M145.6Gckpt configlog
ViL-SmallMask R-CNNImageNet-1K0.21x44.941.145.0M218.3Gckpt configlog
ViL-SmallMask R-CNNImageNet-1K0.23x47.142.745.0M218.3Gckpt configlog
ViL-Medium (D)Mask R-CNNImageNet-21K0.21x47.643.060.1M293.8Gckpt configlog
ViL-Medium (D)Mask R-CNNImageNet-21K0.33x48.944.260.1M293.8Gckpt configlog
ViL-Base (D)Mask R-CNNImageNet-21K0.31x48.643.676.1M384.4Gckpt configlog
ViL-Base (D)Mask R-CNNImageNet-21K0.33x49.644.576.1M384.4Gckpt configlog
---------------------------
ViL-TinyRetinaNetImageNet-1K0.051x40.8--16.64M182.7Gckpt configlog
ViL-TinyRetinaNetImageNet-1K0.13x43.6--16.64M182.7Gckpt configlog
ViL-SmallRetinaNetImageNet-1K0.11x44.2--35.68M254.8Gckpt configlog
ViL-SmallRetinaNetImageNet-1K0.23x45.9--35.68M254.8Gckpt configlog
ViL-Medium (D)RetinaNetImageNet-21K0.21x46.8--50.77M330.4Gckpt configlog
ViL-Medium (D)RetinaNetImageNet-21K0.33x47.9--50.77M330.4Gckpt configlog
ViL-Base (D)RetinaNetImageNet-21K0.31x47.8--66.74M420.9Gckpt configlog
ViL-Base (D)RetinaNetImageNet-21K0.33x48.6--66.74M420.9Gckpt configlog

See more fine-grained results in Table 6 and Table 7 in the Vision Longformer paper. We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2, 0.3].

Comparison of various efficient attention mechanims with absolute positional embedding (Small size)

BackboneMethodpretraindrop_pathLr Schdbox mAPmask mAP#paramsFLOPsMemorycheckpointslog
srformer/64Mask R-CNNImageNet-1K0.11x36.434.673.3M224.1G7.1Gckpt configlog
srformer/32Mask R-CNNImageNet-1K0.11x39.937.351.5M268.3G13.6Gckpt configlog
Partial srformer/32Mask R-CNNImageNet-1K0.11x42.439.046.8M352.1G22.6Gckpt configlog
globalMask R-CNNImageNet-1K0.11x34.833.445.2M226.4G7.6Gckpt configlog
Partial globalMask R-CNNImageNet-1K0.11x42.539.245.1M326.5G20.1Gckpt configlog
performerMask R-CNNImageNet-1K0.11x36.134.345.0M251.5G8.4Gckpt configlog
Partial performerMask R-CNNImageNet-1K0.051x42.339.145.0M343.7G20.0Gckpt configlog
ViLMask R-CNNImageNet-1K0.11x42.939.645.0M218.3G7.4Gckpt configlog
Partial ViLMask R-CNNImageNet-1K0.11x43.339.845.0M326.8G19.5Gckpt configlog

We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2].