Awesome
Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images
This is the offical PyTorch code for paper "Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images"
Contents
OPT-RSVG Dataset
The dataset contains 25,452 RS images and 48,952 image-query pairs. Training, validation, and test sample numbers for OPT-RSVG datasets.
No. | Class Name | OPT-RSVG dataset | ||
---|---|---|---|---|
Training | Validation | Test | ||
C01 | airplane | 979 | 230 | 1142 |
C02 | ground track field | 1600 | 365 | 2066 |
C03 | tennis court | 1093 | 284 | 1313 |
C04 | bridge | 1699 | 452 | 2212 |
C05 | basketball court | 1036 | 263 | 1385 |
C06 | storage tank | 1050 | 271 | 1264 |
C07 | ship | 1084 | 243 | 1241 |
C08 | baseball diamond | 1477 | 361 | 1744 |
C09 | T junction | 1663 | 425 | 2055 |
C10 | crossroad | 1670 | 405 | 2088 |
C11 | parking lot | 1049 | 268 | 1368 |
C12 | harbor | 758 | 209 | 953 |
C13 | vehicle | 3294 | 811 | 4083 |
C14 | swimming pool | 1128 | 308 | 1563 |
- | Total | 19580 | 4895 | 24477 |
The dataset is open source: Google Drive, Baidu Netdisk 提取码: 92yk
LPVA Framework
The above line introduces the proposed framework of LPVA. It consists of five components: (1) Linguistic Backbone, which extracts linguistic features from referring expressions, (2) Progressive Attention module, which generates dynamic weights and biases for visual backbone conditioned on specific expressions, (3) Visual Backbone, which extracts visual features from raw images and its attention can be modified by language-adaptive weights, (4) Multi-Level Feature Enhancement Decoder, which aggregates visual contextual information to enhance the uniqueness, and (5) Localization Module, which predicts the bounding box.
Performance Comparison
Comparison with the SOTA methods for LPVA on the test set of OPT-RSVG
Methods | Venue | Visual Encoder | Language Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cmuIoU |
---|---|---|---|---|---|---|---|---|---|---|
One-stage: | ||||||||||
ZSGNet | ICCV'19 | ResNet-50 | BiLSTM | 48.64 | 47.32 | 43.85 | 27.69 | 6.33 | 43.01 | 47.71 |
FAOA | ICCV'19 | DarkNet-53 | BERT | 68.13 | 64.30 | 57.15 | 41.83 | \textcolor{blue}{15.33} | 58.79 | 65.20 |
ReSC | ECCV'20 | DarkNet-53 | BERT | 69.12 | 64.63 | 58.20 | 43.01 | 14.85 | 60.18 | 65.84 |
LBYL-Net | CVPR'21 | DarkNet-53 | BERT | 70.22 | 65.39 | 58.65 | 37.54 | 9.46 | 60.57 | 70.28 |
Transformer-based: | ||||||||||
TransVG | CVPR'21 | ResNet-50 | BERT | 69.96 | 64.17 | 54.68 | 38.01 | 12.75 | 59.80 | 69.31 |
QRNet | CVPR'22 | Swin | BERT | 72.03 | 65.94 | 56.90 | 40.70 | 13.35 | 60.82 | 75.39 |
VLTVG | CVPR'22 | ResNet-50 | BERT | 71.84 | 66.54 | 57.79 | 41.63 | 14.62 | 60.78 | 70.69 |
VLTVG | CVPR'22 | ResNet-101 | BERT | 73.50 | 68.13 | 59.93 | 43.45 | 15.31 | 62.48 | 73.86 |
MGVLF | TGRS'23 | ResNet-50 | BERT | 72.19 | 66.86 | 58.02 | 42.51 | 15.30 | 61.51 | 71.80 |
Ours: | ||||||||||
LPVA | - | ResNet-50 | BERT | 78.03 | 73.32 | 62.22 | 49.60 | 25.61 | 66.20 | 76.30 |
Comparison with the SOTA methods for LPVA on the test set of DIOR-RSVG
Methods | Venue | Visual Encoder | Language Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | meanIoU | cmuIoU |
---|---|---|---|---|---|---|---|---|---|---|
One-stage: | ||||||||||
ZSGNet | ICCV'19 | ResNet-50 | BiLSTM | 51.67 | 48.13 | 42.30 | 32.41 | 10.15 | 44.12 | 51.65 |
FAOA | ICCV'19 | DarkNet-53 | BERT | 67.21 | 64.18 | 59.23 | 50.87 | 34.44 | 59.76 | 63.14 |
ReSC | ECCV'20 | DarkNet-53 | BERT | 72.71 | 68.92 | 63.01 | 53.70 | 33.37 | 64.24 | 68.10 |
LBYL-Net | CVPR'21 | DarkNet-53 | BERT | 73.78 | 69.22 | 65.56 | 47.89 | 15.69 | 65.92 | 76.37 |
Transformer-based: | ||||||||||
TransVG | CVPR'21 | ResNet-50 | BERT | 72.41 | 67.38 | 60.05 | 49.10 | 27.84 | 63.56 | 76.27 |
QRNet | CVPR'22 | Swin | BERT | 75.84 | 70.82 | 62.27 | 49.63 | 25.69 | 66.80 | 83.02 |
VLTVG | CVPR'22 | ResNet-50 | BERT | 69.41 | 65.16 | 58.44 | 46.56 | 24.37 | 59.96 | 71.97 |
VLTVG | CVPR'22 | ResNet-101 | BERT | 75.79 | 72.22 | 66.33 | 55.17 | 33.11 | 66.32 | 77.85 |
MGVLF | TGRS'23 | ResNet-50 | BERT | 75.98 | 72.06 | 65.23 | 54.89 | 35.65 | 67.48 | 78.63 |
Ours: | ||||||||||
LPVA | - | ResNet-50 | BERT | 82.27 | 77.44 | 72.25 | 60.98 | 39.55 | 72.35 | 85.11 |