Home

Awesome

awesome-remote-image-captioning

The project is currently under construction

Templates

Papers
| [paper name](paper url)<br/>*authors*<br/>> affiliations<br/>> journal year<br/>> tags<br/>> Cited by (cites num)<br/>> [[code](code url)]<br/>> [[demo](demo url)]<br/>> [[page](page url)] | <div align="center"><img src="main image url" width="300"></div><br/>description |

Datasets
| [dataset name](paper url)<br/>> affiliation<br/> >language<br/>> year<br/>>  (categories) categories<br/>> (total images num) images<br/>> resolution | <div align="center"><img src="main image url" width="300"></div><br/>dataset description |

Popular Implementations
| [仓库名称](仓库地址) | [论文名称](论文链接) | 框架 |

Blogs
| [Blog标题](Blog地址) | 作者 | 一句话概括 |

Papers

Lite Version

Full Version

Paper infoDescription
NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning<br/>Qimin Cheng; Haiyan Huang; Yuan Xu; Yuzhuo Zhou; Huanying Li; Zhongyuan Wang<br/>> Huazhong University of Science and Technology<br/>> TGRS 2022<br/>> Contextual attention, NWPU-Captions<br/>> Cited by 42<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9866055/cheng2-3201474-large.gif" width="300"></div><br/>The paper propose a novel encoder–decoder architecture — multilevel and contextual attention network (MLCA-Net), which improves the flexibility and diversity of the generated captions while keeping their accuracy and conciseness.<br/>
Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset<br/>Chenyang Liu; Rui Zhao; Hao Chen; Zhengxia Zou; Zhenwei Shi<br/>> Beihang University<br/>> TGRS 2022<br/>> Change captioning (CC), change detection (CD), Transformer<br/>> Cited by 52<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9934924/shi5abc-3218921-large.gif" width="300"></div><br/>The paper proposes a novel Transformer-based RSICC (RSICCformer) model, which consists of a CNN-based feature extractor, a dual-branch Transformer encoder (DTE) and a caption decoder.<br/>
A Joint-Training Two-Stage Method For Remote Sensing Image Captioning<br/>Xiutiao Ye; Shuang Wang; Yu Gu; Jihui Wang; Ruixuan Wang; Biao Hou; Fausto Giunchiglia; Licheng Jiao<br/>> Xidian University<br/>> TGRS 2022<br/>> joint training, multilabel attributes<br/>> Cited by 23<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9961235/wang1-3224244-large.gif" width="300"></div><br/>A novel joint-training two-stage (JTTS) method improves remote sensing image captioning by integrating multilabel classification for prior information, utilizing differentiable sampling, and employing an attribute-guided decoder.<br/>
Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning<br/>Zhengyuan Zhang;Wenkai Zhang;Menglong Yan;Xin Gao;Kun Fu;Xian Sun<br/>> Chinese Academy of Sciences<br/>> TGRS 2022<br/>> Attention mechanism<br/>> Cited by 44<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9632558/sun2-3132095-large.gif" width="300"></div><br/>This article proposes a global visual feature-guided attention mechanism for remote-sensing image captioning, which introduces global visual features, filters out redundant components.<br/>
Change Captioning: A New Paradigm for Multitemporal Remote Sensing Image Analysis<br/>Genc Hoxha;Seloua Chouaf;Farid Melgani;Youcef Smara<br/>> University of Trento<br/>> TGRS 2022<br/>> change detection (CD), support vector machines (SVMs)<br/>> Cited by 33<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9847254/melga1ab-3195692-large.gif" width="300"></div><br/>This article proposes change captioning systems that generate coherent sentence descriptions of occurred changes in remote sensing, which utilize convolutional neural networks to extract features and recurrent neural networks or support vector machines to generate change descriptions.<br/>
Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning<br/>Yunpeng Li;Xiangrong Zhang;Jing Gu;Chen Li;Xin Wang;Xu Tang;Licheng Jiao<br/>> Xidian University<br/>> TGRS 2022<br/>> Attention mechanism, encoder-decoder<br/>> Cited by 45<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9515452/zhang2abcd-3102590-large.gif" width="300"></div><br/>This article introduces a novel RASG framework for remote sensing image captioning, and it utilizes competitive visual features and a recurrent attention mechanism to generate improved context vectors and enhance word representations.<br/>
High-Resolution Remote Sensing Image Captioning Based on Structured Attention<br/>Rui Zhao;Zhenwei Shi;Zhengxia Zou<br/>> Beihang University, Beijing<br/>> TGRS 2022<br/>> structured attention<br/>> Cited by 85<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9400386/shi2-3070383-large.gif" width="300"></div><br/>A fine-grained, structured attention-based method is proposed for generating language descriptions of high-resolution remote sensing images, utilizes the structural characteristics of semantic contents and can generate pixelwise segmentation masks without requiring pixelwise annotations.<br/>
A Novel SVM-Based Decoder for Remote Sensing Image Captioning<br/>Genc Hoxha;Farid Melgani<br/>> University of Trento<br/>> TGRS 2022<br/>> support vector machines (SVMs)<br/>> Cited by 46<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9521989/melga1-3105004-large.gif" width="300"></div><br/>This article introduces a novel remote sensing image captioning system by using a network of support vector machines (SVMs) instead of recurrent neural networks (RNNs).<br/>
Meta captioning: A meta learning based remote sensing image captioning framework<br/>Qiaoqiao;YangZihao;NiPeng Ren<br/>> China University of Petroleum (East China)<br/>> ISPRS 2022<br/>> Meta learning<br/>> Cited by 8<br/>> [Code]<div align="center"><img src="https://ars.els-cdn.com/content/image/1-s2.0-S0924271622000351-gr1.jpg" width="300"></div><br/>The paper presents a meta captioning framework that utilizes meta learning to address the limitations of remote sensing image captioning, transferring meta features extracted from natural image classification and remote sensing image classification tasks to improve captioning performance with a relatively small amount of caption-labeled training data.
Generating the captions for remote sensing images: A spatial-channel attention based memory-guided transformer approach<br/>Gaurav O. Gajbhiye;Abhijeet V. Nandedkar<br/>> SGGS Institute of Engineering and Technology<br/>> ISPRS 2022<br/>> Spatial and channel-wise visual attention, Transformer, Memory guided decoder<br/>> Cited by 4<br/>> [Code]<div align="center"><img src="https://ars.els-cdn.com/content/image/1-s2.0-S0952197622002317-gr2.jpg" width="300"></div><br/>A novel fully-attentive CNN-Transformer approach is proposed for automatic caption generation in remote sensing images, integrating a multi-attentive visual encoder and a memory-guided Transformer-based linguistic decoder, with a statistical index to measure the model's ability to generate reliable captions across datasets.
Multi-label semantic feature fusion for remote sensing image captioning<br/>Usman ZiaM. Mohsin RiazAbdul Ghafoor<br/>> National University of Sciences and Technology (NUST), Pakistan<br/>> IJAEOG 2022<br/>> >Remote sensing image retrieval,Multi-modal domains<br/>> Cited by 0<br/><div align="center"><img src="https://ars.els-cdn.com/content/image/1-s2.0-S0303243422000678-gr2.jpg" width="300"></div><br/>This paper proposes a model for generating novel captions for remote sensing images by utilizing multi-scale features and an adaptive attention-based decoder with topic-sensitive word embedding.
Truncation Cross Entropy Loss for Remote Sensing Image Captioning<br/>Xuelong Li;Xueting Zhang;Wei Huang;Qi Wang<br/>> Northwestern Polytechnical University<br/>> TGRS 2021<br/>> overfitting, truncation cross entropy (TCE) loss<br/>> Cited by 75<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9438983/9153154/wang2-3010106-large.gif" width="300"></div><br/>This article introduces a new approach called truncation cross entropy (TCE) loss to address the overfitting problem in remote sensing image captioning (RSIC), which explores the limitations of cross entropy (CE) loss and proposes TCE loss to alleviate overfitting.<br/>
SD-RSIC: Summarization-Driven Deep Remote Sensing Image Captioning<br/>Gencer Sumbul;Sonali Nayak;Begüm Demir<br/>> Technische Universität Berlin<br/>> TGRS 2021<br/>> Caption summarization<br/>> Cited by 35<br/>> [Code]<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9492173/9239371/sumbu1-3031111-large.gif" width="300"></div><br/>The novel SD-RSIC approach addresses the issue of redundant information in remote sensing image captioning, which utilizes summarization techniques, adaptive weighting, and a combination of CNNs and LSTM networks to improve the mapping from the image domain to the language domain.<br/>
Word–Sentence Framework for Remote Sensing Image Captioning<br/>Qi Wang;Wei Huang;Xueting Zhang;Xuelong Li<br/>> Northwestern Polytechnical University<br/>> TRGS 2021<br/>> word–sentence framework<br/>> Cited by 73<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9624468/9308980/wang1-3044054-large.gif" width="300"></div><br/>This article introduces a new explainable word-sentence framework for remote sensing image captioning (RSIC), consisting of a word extractor and a sentence generator.<br/>
Denoising-Based Multiscale Feature Fusion for Remote Sensing Image Captioning<br/>Wei Huang; Qi Wang; Xuelong Li<br/>> Northwestern Polytechnical University<br/>> GRSL 2020<br/>> encoder–decoder, feature fusion, multiscale<br/>> Cited by 65<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/8859/9363032/9057472/wang2-2980933-large.gif" width="300"></div><br/>This paper proposes a denoising-based multi-scale feature fusion (DMSFF) mechanism for remote sensing image captioning, which improves caption quality by addressing the limitations caused by large-scale variation in remote sensing images. Experimental results on two public datasets validate the effectiveness of the proposed method.
A Multi-Level Attention Model for Remote Sensing Image Captions<br/>Y Li; S Fang; L Jiao; R Liu; R Shang<br/>> Xidian University<br/>> Remote Sensing 2019<br/>> attention, encoder-decoder<br/>> Cited by 42<div align="center"><img src="https://pub.mdpi-res.com/remotesensing/remotesensing-12-00939/article_deploy/html/images/remotesensing-12-00939-ag-550.jpg" width="300"></div><br/>This paper proposes a multi-level attention model for remote sensing image captioning, which mimics human attention mechanisms by incorporating three attention structures for different areas of the image, words, and vision and semantics, achieving superior results compared to previous methods.
Sound Active Attention Framework for Remote Sensing Image Captioning<br/>Xiaoqiang Lu; Binqiang Wang; Xiangtao Zheng<br/>> Chinese Academy of Sciences<br/>> TGRS 2019<br/>> sound activate attention<br/>> Cited by 74<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9014411/8931249/lu2-2951636-large.gif" width="300"></div><br/>It proposes a novel sound active attention framework for more specific caption generation according to the interest of the observer.
Semantic Descriptions of High-Resolution Remote Sensing Images<br/>Binqiang Wang; Xiaoqiang Lu; Xiangtao Zheng; Xuelong Li<br/>> University of Chinese Academy of Sciences<br/>> GRSL 2019<br/>> semantic embedding, sentence representation<br/>> Cited by 103<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/8859/8768247/8633358/lu2-2893772-large.gif" width="300"></div><br/>This paper proposes a framework that uses semantic embedding to measure the image representation and the sentence representation.
VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning<br/>Zhengyuan Zhang; Wenkai Zhang; Wenhui Diao; Menglong Yan; Xin Gao; Xian Sun<br/>> University of Chinese Academy of Sciences<br/>> IEEE Access 2019<br/>> Visual Aligning Attention Model<br/>> Cited by 35<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/6287639/8600701/8843891/zhang1-2942154-large.gif" width="300"></div><br/>This paper proposes a Visual Aligning Attention model (VAA) and ensures that attention layers accurately focus on regions of interest.
Multi-Scale Cropping Mechanism for Remote Sensing Image Captioning<br/>Xueting Zhang; Qi Wang; Shangdong Chen; Xuelong Li<br/>> Northwestern Polytechnical University<br/>> IGARSS 2019<br/>> multi-scale cropping<br/>> Cited by 36<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/8891871/8897702/8900503/zhang1-p4-zhang-large.gif" width="300"></div><br/>It proposes a training mechanism of multi-scale cropping for remote sensing image captioning in this paper, which can extract more fine-grained information from remote sensing images and enhance the generalization performance of the base model.
Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning<br/>Zhenghang Yuan; Xuelong Li; Qi Wang<br/>> Northwestern Polytechnical University<br/>> IEEE Access 2019<br/>> Multi-level attention, Graph convolutional networks<br/>> Cited by 43<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/6287639/8948470/8943170/wang2-2962195-large.gif" width="300"></div><br/>This paper proposes a remote sensing image captioning framework based on multi-level attention and multi-label attribute graph convolution is proposed to improve the performance from two aspects.
Intensive Positioning Network for Remote Sensing Image Captioning<br/>Shengsheng Wang; Jiawei Chen; Guagnyao Wang<br/>> Jilin University<br/>> Intelligence Science and Big Data Engineering 2018<br/>> Intensive positioning network (IPN)<br/>> Cited by 7<div align="center"><img src="https://media.springernature.com/full/springer-static/image/chp%3A10.1007%2F978-3-030-02698-1_49/MediaObjects/475114_1_En_49_Fig2_HTML.png?as=webp" width="300"></div><br/>The paper proposes a new network: intensive positioning network (IPN) , which can predict regions containing important information in the picture and output multiple region description blocks around these regions.
Exploring Models and Data for Remote Sensing Image Caption Generation <br/>Xiaoqiang Lu; Binqiang Wang; Xiangtao Zheng; Xuelong Li<br/>> University of Chinese Academy of Sciences<br/>> TGRS 2017<br/>> RSICD<br/>> Cited by 436<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/8323433/8240966/lu2-2776321-large.gif" width="300"></div><br/>The paper constructs a remote sensing image captioning dataset (RSICD) and evaluates different caption methods based on handcrafted representations and convolutional features on different datasets.
Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?<br/>Zhenwei Shi; Zhengxia Zou<br/>> Beihang University<br/>> TGRS 2017<br/>> Fully convolutional networks<br/>> Cited by 182<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/7932278/7891049/shi4-2677464-large.gif" width="300"></div><br/>The paper investigates an interesting question of can a machine automatically generate humanlike language description of remote sensing image and proposes a remote sensing image captioning framwork, where the experimental results on Google Earth and GF-2 images have demonstrated the superiority and transfer ability of the proposed method.
Deep semantic understanding of high resolution remote sensing image<br/>Bo Qu; Xuelong Li; Dacheng Tao; Xiaoqiang Lu<br/>> University of the Chinese Academy of Sciences<br/>> CITS 2016<br/>> HSR image<br/>> Cited by 218<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/7536293/7546377/7546397/7546397-fig-1-source-large.gif" width="300"></div><br/>This paper proposes a deep multimodal neural network model to solve the problem of understanding HSR remote sensing images in the semantic level and the VGG19-layers network with LSTMs is the best combination for HSR remote sensing image caption generation.

Datasets

Dataset infoDescription
NWPU-Captions<br/>> Huazhong University of Science and Technology<br/> >English<br/>> 2022<br/>> 45 categories<br/>> 31500 images<br/>> 30 m - 0.2 m<div align="center"><img src="assets/NWPU-Captions.jpg" width="300"></div><br/>The NWPU-Captions dataset is a larger and more challenging benchmark dataset for remote sensing image captioning, containing 157,500 manually annotated sentences and 31,500 images, offering a greater data volume, category variety, description richness, and wider coverage of complex scenes and vocabulary.
LEVIR-CC<br/>> Beihang University<br/> >English<br/>> 2022<br/>> 10 categories<br/>> 10077 images<br/>> 0.5 m - 0.5 m<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9934924/shi1-3218921-large.gif" width="300"></div><br/>The LEVIR-CC dataset is a large-scale dataset designed for the RSICC task, consisting of 10077 pairs of RS images and 50385 corresponding sentences describing image differences.
RSICD<br/>> University of Chinese Academy of Sciences<br/> >English<br/>> 2018<br/>> 30 categories<br/>> 10921 images<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/8323433/8240966/lu1-2776321-large.gif" width="300"></div><br/>It contains more than ten thousands remote sensing images which are collected from Google Earth, Baidu Map, MapABC and Tianditu.
Sydney captions<br/>> University of the Chinese Academy of Sciences<br/> >English<br/>> 2016<br/>> 7 categories<br/>> 613 images<br/>> 0.5m<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/7536293/7546377/7546397/7546397-fig-4-source-large.gif" width="300"></div><br/>It contains 7 different scene categories and totally has 613 HSR images.
UCM captions<br/>> University of the Chinese Academy of Sciences<br/> >English<br/>> 2016<br/>> 21 categories<br/>> 2100 images<br/>> 0.3048m<div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/7536293/7546377/7546397/7546397-fig-4-source-large.gif" width="300"></div><br/>It is based on the UC Merced Land Use Dataset and has 2100 HSR images which are divided into 21 challenging scene categories.

Popular Implementations

CodePaperFramework
a-PyTorch-Tutorial-to-Image-CaptioningShow, Attend and Tell: Neural Image Caption Generation with Visual AttentionPytorch
ImageCaptioning.pytorchSelf-critical Sequence Training for Image CaptioningPytorch
stylenetStyleNet: Generating Attractive Visual Captions with StylesPytorch
image-captioning-bottom-up-top-downBottom-Up and Top-Down Attention for Image Captioning and Visual Question AnsweringPytorch
knowing-when-to-look-adaptive-attentionKnowing When to Look: Adaptive Attention via A Visual Sentinel for Image CaptioningPytorch
show-control-and-tellShow, Control and Tell: A Framework for Generating Controllable and Grounded CaptionsPytorch
Multitask_Image_CaptioningMultitask Learning for Cross-Domain Image CaptioningPytorch
NeuralBabyTalkNeural Baby TalkPytorch
Recurrent_Fusion_NetworkRecurrent Fusion Network for Image CaptioningPytorch
Stack-CaptioningStack-Captioning: Coarse-to-Fine Learning for Image CaptioningPytorch
image_captioningShow, Attend and Tell: Neural Image Caption Generation with Visual AttentionTensorflow
densecapDenseCap: Fully Convolutional Localization Networks for Dense CaptioningTorch
AdaptiveAttentionKnowing When to Look: Adaptive Attention via A Visual Sentinel for Image CaptioningTorch

Benchmarks

DataSet: UCM-CAPTIONS

 MethodYearBLEU-1BLEU-2BLEU-3BLEU-4METEORROUGECIDErSPICE
VLAD-LSTM20180.70160.60850.54960.50300.34640.65202.3131--
SIFT-LSTM20180.55170.41660.34890.30400.24320.52351.3603--
PCSMLF20190.43610.27280.18550.12100.13200.39270.2227--
FC-ATT+LSTM20190.81350.75020.68490.63520.41730.75042.9958--
SM-ATT+LSTM20190.81540.75750.69360.64580.42400.76323.1864--
Soft Attention20190.74540.65450.58550.52500.38860.72372.6124--
Hard Attention20190.81570.73120.67020.61820.42630.76982.9947--
Sound-a-a20200.74840.68370.63100.58960.36230.65792.72810.3907
SAT(LAM)20190.81950.77640.74850.71610.48370.79083.61710.5024
ADAPTIVE(LAM)20190.81700.75100.69900.65400.44800.78703.28000.5030
TCE loss-based20200.82100.76220.71400.67000.47750.75672.8547--
Recurrent-ATT20210.85180.79250.74320.69760.45710.80723.38870.4891
GVFGA+LSGA20220.83190.76570.71030.65960.44360.78453.32700.4853
SVM-D BOW20220.76350.66640.58690.51950.36540.68772.7142--
SVM-D CONC20220.76530.69470.64170.59420.37020.68772.9228--
MLCA-Net20220.82600.7700.71700.66800.43500.77203.2400.4730
Structured attention20220.85380.80350.75720.71490.46320.81413.3489--
AJJTTSM20220.86960.82240.77880.73760.49060.83643.71020.5231

DataSet: SYDNEY-CAPTIONS

MethodYearBLEU-1BLEU-2BLEU-3BLEU-4METEORROUGECIDErSPICE
VLAD-LSTM20180.49130.34720.27600.23140.19300.42010.9164--
SIFT-LSTM20180.57930.47740.41830.37400.27070.53660.9873--
CSMLF20190.59980.45830.38690.34330.24750.50180.9378--
FC-ATT+LSTM20190.80760.71600.62760.55440.40990.71142.2033--
SM-ATT+LSTM20190.81430.73510.65860.58060.41110.71952.3021--
Soft Attention20190.73220.66740.62230.58200.39420.71272.4993--
Hard Attention20190.75910.66100.58890.52580.38980.71892.1819--
Sound-a-a20200.70930.62280.53930.46020.31210.59741.74770.3837
SAT(LAM)20190.74050.65500.59040.53040.36890.68142.35190.4038
ADAPTIVE(LAM)20190.73230.63160.56290.50740.36130.67752.34550.4243
TCE loss-based20200.79370.73040.67170.61930.44300.71302.4042--
Recurrent-ATT20210.80000.72170.65310.59090.39080.72182.63110.4301
GVFGA+LSGA20220.76810.68460.61450.55040.38660.70302.45220.4532
SVM-D BOW20220.77870.68350.60230.53050.37970.69922.2722--
SVM-D CONC20220.75470.67110.59700.53080.36430.67462.2222--
MLCA-Net20220.83100.74200.65900.58000.39000.71102.32400.4090
Structured attention20220.77950.70190.63920.58610.39540.72992.3791--
AJJTTSM20220.84920.77970.71370.64960.44570.76602.80100.4679

DataSet: RSICD

MethodYearBLEU-1BLEU-2BLEU-3BLEU-4METEORROUGECIDErSPICE
VLAD-LSTM20180.50040.31950.23190.17780.20460.43341.1801--
SIFT-LSTM20180.48590.30330.21860.16780.19660.41741.0528--
CSMLF20190.57590.39590.28320.22170.21280.44550.5297--
FC-ATT+LSTM20190.74590.62500.53380.45740.33950.63332.3664--
SM-ATT+LSTM20190.75710.63360.53850.46120.35130.64582.3563--
Soft Attention20190.67530.53080.43330.36170.32550.61091.9643--
Hard Attention20190.66690.51820.41640.34070.32010.60841.7925--
Sound-a-a20200.61960.48190.39020.31950.27330.51431.63860.3598
SAT(LAM)20190.67530.55370.46860.40260.32540.58232.58500.4636
ADAPTIVE(LAM)20190.66640.54860.46760.40700.32300.58432.60550.4673
TCE loss-based20200.76080.63580.54710.47910.34250.66872. 4665--
Recurrent-ATT20210.77290.66510.57820.50620.36260.66912.75490.4719
GVFGA+LSGA20220.67790.56000.47810.41650.32850.59292.60120.4683
SVM-D BOW20220.61120.42770.31530.24110.23030.45880.6825--
SVM-D CONC20220.59990.43470.35500.26890.22990.45570.6854--
MLCA-Net20220.75000.63100.53800.45900.34200.63802.41800.4440
Structured attention20220.70160.56140.46480.39340.32910.57061.7031--
AJJTTSM20220.78930.67950.58930.51350.37730.68232.79580.4877

DataSet: NWPU-CAPTIONS

metricssotamethodpaper
BLEU-10.745MLCA-NetNWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning
BLEU-20.624MLCA-NetNWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning
BLEU-30.541MLCA-NetNWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning
BLEU-40.478MLCA-NetNWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning
METEOR0.337MLCA-NetNWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning
ROUGE0.601MLCA-NetNWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning
CIDEr1.264MLCA-NetNWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning

DataSet: LEVIR-CC

metricssotamethodpaper
BLEU-10.8481RSICCformerRemote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset
BLEU-20.7639RSICCformerRemote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset
BLEU-30.6914RSICCformerRemote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset
BLEU-40.6307RSICCformerRemote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset
METEOR0.3961RSICCformerRemote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset
ROUGE0.7418RSICCformerRemote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset
CIDEr1.3468RSICCformerRemote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

Blogs

TitleAuthorOverview
illustrated-transformerJay AlammarVisualize the principle of the Transformer
visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attentionJay AlammarVisualize the principle of the Machine Translation