Home

Awesome

Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images

This is the offical PyTorch code for paper "Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images"

Contents

OPT-RSVG Dataset

The dataset contains 25,452 RS images and 48,952 image-query pairs. OPT-RSVG Dataset Training, validation, and test sample numbers for OPT-RSVG datasets.

No.Class NameOPT-RSVG dataset
TrainingValidationTest
C01airplane9792301142
C02ground track field16003652066
C03tennis court10932841313
C04bridge16994522212
C05basketball court10362631385
C06storage tank10502711264
C07ship10842431241
C08baseball diamond14773611744
C09T junction16634252055
C10crossroad16704052088
C11parking lot10492681368
C12harbor758209953
C13vehicle32948114083
C14swimming pool11283081563
-Total19580489524477

The dataset is open source: Google Drive, Baidu Netdisk 提取码: 92yk

LPVA Framework

OPT-RSVG Dataset The above line introduces the proposed framework of LPVA. It consists of five components: (1) Linguistic Backbone, which extracts linguistic features from referring expressions, (2) Progressive Attention module, which generates dynamic weights and biases for visual backbone conditioned on specific expressions, (3) Visual Backbone, which extracts visual features from raw images and its attention can be modified by language-adaptive weights, (4) Multi-Level Feature Enhancement Decoder, which aggregates visual contextual information to enhance the uniqueness, and (5) Localization Module, which predicts the bounding box.

Performance Comparison

Comparison with the SOTA methods for LPVA on the test set of OPT-RSVG

MethodsVenueVisual EncoderLanguage EncoderPr@0.5Pr@0.6Pr@0.7Pr@0.8Pr@0.9meanIoUcmuIoU
One-stage:
ZSGNetICCV'19ResNet-50BiLSTM48.6447.3243.8527.696.3343.0147.71
FAOAICCV'19DarkNet-53BERT68.1364.3057.1541.83\textcolor{blue}{15.33}58.7965.20
ReSCECCV'20DarkNet-53BERT69.1264.6358.2043.0114.8560.1865.84
LBYL-NetCVPR'21DarkNet-53BERT70.2265.3958.6537.549.4660.5770.28
Transformer-based:
TransVGCVPR'21ResNet-50BERT69.9664.1754.6838.0112.7559.8069.31
QRNetCVPR'22SwinBERT72.0365.9456.9040.7013.3560.8275.39
VLTVGCVPR'22ResNet-50BERT71.8466.5457.7941.6314.6260.7870.69
VLTVGCVPR'22ResNet-101BERT73.5068.1359.9343.4515.3162.4873.86
MGVLFTGRS'23ResNet-50BERT72.1966.8658.0242.5115.3061.5171.80
Ours:
LPVA-ResNet-50BERT78.0373.3262.2249.6025.6166.2076.30

Comparison with the SOTA methods for LPVA on the test set of DIOR-RSVG

MethodsVenueVisual EncoderLanguage EncoderPr@0.5Pr@0.6Pr@0.7Pr@0.8Pr@0.9meanIoUcmuIoU
One-stage:
ZSGNetICCV'19ResNet-50BiLSTM51.6748.1342.3032.4110.1544.1251.65
FAOAICCV'19DarkNet-53BERT67.2164.1859.2350.8734.4459.7663.14
ReSCECCV'20DarkNet-53BERT72.7168.9263.0153.7033.3764.2468.10
LBYL-NetCVPR'21DarkNet-53BERT73.7869.2265.5647.8915.6965.9276.37
Transformer-based:
TransVGCVPR'21ResNet-50BERT72.4167.3860.0549.1027.8463.5676.27
QRNetCVPR'22SwinBERT75.8470.8262.2749.6325.6966.8083.02
VLTVGCVPR'22ResNet-50BERT69.4165.1658.4446.5624.3759.9671.97
VLTVGCVPR'22ResNet-101BERT75.7972.2266.3355.1733.1166.3277.85
MGVLFTGRS'23ResNet-50BERT75.9872.0665.2354.8935.6567.4878.63
Ours:
LPVA-ResNet-50BERT82.2777.4472.2560.9839.5572.3585.11