NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning<br/>Qimin Cheng; Haiyan Huang; Yuan Xu; Yuzhuo Zhou; Huanying Li; Zhongyuan Wang<br/>> Huazhong University of Science and Technology<br/>> TGRS 2022<br/>> Contextual attention , NWPU-Captions <br/>> Cited by 66 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9866055/cheng2-3201474-large.gif" width="300"></div><br/>The paper propose a novel encoder–decoder architecture — multilevel and contextual attention network (MLCA-Net), which improves the flexibility and diversity of the generated captions while keeping their accuracy and conciseness.<br/> |
Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset<br/>Chenyang Liu; Rui Zhao; Hao Chen; Zhengxia Zou; Zhenwei Shi<br/>> Beihang University<br/>> TGRS 2022<br/>> Change captioning (CC) , change detection (CD) , Transformer <br/>> Cited by 83 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9934924/shi5abc-3218921-large.gif" width="300"></div><br/>The paper proposes a novel Transformer-based RSICC (RSICCformer) model, which consists of a CNN-based feature extractor, a dual-branch Transformer encoder (DTE) and a caption decoder.<br/> |
A Joint-Training Two-Stage Method For Remote Sensing Image Captioning<br/>Xiutiao Ye; Shuang Wang; Yu Gu; Jihui Wang; Ruixuan Wang; Biao Hou; Fausto Giunchiglia; Licheng Jiao<br/>> Xidian University<br/>> TGRS 2022<br/>> joint training , multilabel attributes <br/>> Cited by 37 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9961235/wang1-3224244-large.gif" width="300"></div><br/>A novel joint-training two-stage (JTTS) method improves remote sensing image captioning by integrating multilabel classification for prior information, utilizing differentiable sampling, and employing an attribute-guided decoder.<br/> |
Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning<br/>Zhengyuan Zhang;Wenkai Zhang;Menglong Yan;Xin Gao;Kun Fu;Xian Sun<br/>> Chinese Academy of Sciences<br/>> TGRS 2022<br/>> Attention mechanism <br/>> Cited by 62 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9632558/sun2-3132095-large.gif" width="300"></div><br/>This article proposes a global visual feature-guided attention mechanism for remote-sensing image captioning, which introduces global visual features, filters out redundant components.<br/> |
Change Captioning: A New Paradigm for Multitemporal Remote Sensing Image Analysis<br/>Genc Hoxha;Seloua Chouaf;Farid Melgani;Youcef Smara<br/>> University of Trento<br/>> TGRS 2022<br/>> change detection (CD) , support vector machines (SVMs) <br/>> Cited by 50 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9847254/melga1ab-3195692-large.gif" width="300"></div><br/>This article proposes change captioning systems that generate coherent sentence descriptions of occurred changes in remote sensing, which utilize convolutional neural networks to extract features and recurrent neural networks or support vector machines to generate change descriptions.<br/> |
Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning<br/>Yunpeng Li;Xiangrong Zhang;Jing Gu;Chen Li;Xin Wang;Xu Tang;Licheng Jiao<br/>> Xidian University<br/>> TGRS 2022<br/>> Attention mechanism , encoder-decoder <br/>> Cited by 64 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9515452/zhang2abcd-3102590-large.gif" width="300"></div><br/>This article introduces a novel RASG framework for remote sensing image captioning, and it utilizes competitive visual features and a recurrent attention mechanism to generate improved context vectors and enhance word representations.<br/> |
High-Resolution Remote Sensing Image Captioning Based on Structured Attention<br/>Rui Zhao;Zhenwei Shi;Zhengxia Zou<br/>> Beihang University, Beijing<br/>> TGRS 2022<br/>> structured attention <br/>> Cited by 116 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9400386/shi2-3070383-large.gif" width="300"></div><br/>A fine-grained, structured attention-based method is proposed for generating language descriptions of high-resolution remote sensing images, utilizes the structural characteristics of semantic contents and can generate pixelwise segmentation masks without requiring pixelwise annotations.<br/> |
A Novel SVM-Based Decoder for Remote Sensing Image Captioning<br/>Genc Hoxha;Farid Melgani<br/>> University of Trento<br/>> TGRS 2022<br/>> support vector machines (SVMs) <br/>> Cited by 55 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9633014/9521989/melga1-3105004-large.gif" width="300"></div><br/>This article introduces a novel remote sensing image captioning system by using a network of support vector machines (SVMs) instead of recurrent neural networks (RNNs).<br/> |
Meta captioning: A meta learning based remote sensing image captioning framework<br/>Qiaoqiao;YangZihao;NiPeng Ren<br/>> China University of Petroleum (East China)<br/>> ISPRS 2022<br/>> Meta learning <br/>> Cited by 8<br/>> [Code] | <div align="center"><img src="https://ars.els-cdn.com/content/image/1-s2.0-S0924271622000351-gr1.jpg" width="300"></div><br/>The paper presents a meta captioning framework that utilizes meta learning to address the limitations of remote sensing image captioning, transferring meta features extracted from natural image classification and remote sensing image classification tasks to improve captioning performance with a relatively small amount of caption-labeled training data. |
Generating the captions for remote sensing images: A spatial-channel attention based memory-guided transformer approach<br/>Gaurav O. Gajbhiye;Abhijeet V. Nandedkar<br/>> SGGS Institute of Engineering and Technology<br/>> ISPRS 2022<br/>> Spatial and channel-wise visual attention , Transformer , Memory guided decoder <br/>> Cited by 4<br/>> [Code] | <div align="center"><img src="https://ars.els-cdn.com/content/image/1-s2.0-S0952197622002317-gr2.jpg" width="300"></div><br/>A novel fully-attentive CNN-Transformer approach is proposed for automatic caption generation in remote sensing images, integrating a multi-attentive visual encoder and a memory-guided Transformer-based linguistic decoder, with a statistical index to measure the model's ability to generate reliable captions across datasets. |
Multi-label semantic feature fusion for remote sensing image captioning<br/>Usman ZiaM. Mohsin RiazAbdul Ghafoor<br/>> National University of Sciences and Technology (NUST), Pakistan<br/>> IJAEOG 2022<br/>> >Remote sensing image retrieval ,Multi-modal domains <br/>> Cited by 0<br/> | <div align="center"><img src="https://ars.els-cdn.com/content/image/1-s2.0-S0303243422000678-gr2.jpg" width="300"></div><br/>This paper proposes a model for generating novel captions for remote sensing images by utilizing multi-scale features and an adaptive attention-based decoder with topic-sensitive word embedding. |
Truncation Cross Entropy Loss for Remote Sensing Image Captioning<br/>Xuelong Li;Xueting Zhang;Wei Huang;Qi Wang<br/>> Northwestern Polytechnical University<br/>> TGRS 2021<br/>> overfitting , truncation cross entropy (TCE) loss <br/>> Cited by 86 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9438983/9153154/wang2-3010106-large.gif" width="300"></div><br/>This article introduces a new approach called truncation cross entropy (TCE) loss to address the overfitting problem in remote sensing image captioning (RSIC), which explores the limitations of cross entropy (CE) loss and proposes TCE loss to alleviate overfitting.<br/> |
SD-RSIC: Summarization-Driven Deep Remote Sensing Image Captioning<br/>Gencer Sumbul;Sonali Nayak;Begüm Demir<br/>> Technische Universität Berlin<br/>> TGRS 2021<br/>> Caption summarization <br/>> Cited by 35<br/>> [Code] | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9492173/9239371/sumbu1-3031111-large.gif" width="300"></div><br/>The novel SD-RSIC approach addresses the issue of redundant information in remote sensing image captioning, which utilizes summarization techniques, adaptive weighting, and a combination of CNNs and LSTM networks to improve the mapping from the image domain to the language domain.<br/> |
Word–Sentence Framework for Remote Sensing Image Captioning<br/>Qi Wang;Wei Huang;Xueting Zhang;Xuelong Li<br/>> Northwestern Polytechnical University<br/>> TRGS 2021<br/>> word–sentence framework <br/>> Cited by 88 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9624468/9308980/wang1-3044054-large.gif" width="300"></div><br/>This article introduces a new explainable word-sentence framework for remote sensing image captioning (RSIC), consisting of a word extractor and a sentence generator.<br/> |
Denoising-Based Multiscale Feature Fusion for Remote Sensing Image Captioning<br/>Wei Huang; Qi Wang; Xuelong Li<br/>> Northwestern Polytechnical University<br/>> GRSL 2020<br/>> encoder–decoder , feature fusion , multiscale <br/>> Cited by 77 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/8859/9363032/9057472/wang2-2980933-large.gif" width="300"></div><br/>This paper proposes a denoising-based multi-scale feature fusion (DMSFF) mechanism for remote sensing image captioning, which improves caption quality by addressing the limitations caused by large-scale variation in remote sensing images. Experimental results on two public datasets validate the effectiveness of the proposed method. |
A Multi-Level Attention Model for Remote Sensing Image Captions<br/>Y Li; S Fang; L Jiao; R Liu; R Shang<br/>> Xidian University<br/>> Remote Sensing 2019<br/>> attention , encoder-decoder <br/>> Cited by 52 | <div align="center"><img src="https://pub.mdpi-res.com/remotesensing/remotesensing-12-00939/article_deploy/html/images/remotesensing-12-00939-ag-550.jpg" width="300"></div><br/>This paper proposes a multi-level attention model for remote sensing image captioning, which mimics human attention mechanisms by incorporating three attention structures for different areas of the image, words, and vision and semantics, achieving superior results compared to previous methods. |
Sound Active Attention Framework for Remote Sensing Image Captioning<br/>Xiaoqiang Lu; Binqiang Wang; Xiangtao Zheng<br/>> Chinese Academy of Sciences<br/>> TGRS 2019<br/>> sound activate attention <br/>> Cited by 86 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/9014411/8931249/lu2-2951636-large.gif" width="300"></div><br/>It proposes a novel sound active attention framework for more specific caption generation according to the interest of the observer. |
Semantic Descriptions of High-Resolution Remote Sensing Images<br/>Binqiang Wang; Xiaoqiang Lu; Xiangtao Zheng; Xuelong Li<br/>> University of Chinese Academy of Sciences<br/>> GRSL 2019<br/>> semantic embedding , sentence representation <br/>> Cited by 118 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/8859/8768247/8633358/lu2-2893772-large.gif" width="300"></div><br/>This paper proposes a framework that uses semantic embedding to measure the image representation and the sentence representation. |
VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning<br/>Zhengyuan Zhang; Wenkai Zhang; Wenhui Diao; Menglong Yan; Xin Gao; Xian Sun<br/>> University of Chinese Academy of Sciences<br/>> IEEE Access 2019<br/>> Visual Aligning Attention Model <br/>> Cited by 38 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/6287639/8600701/8843891/zhang1-2942154-large.gif" width="300"></div><br/>This paper proposes a Visual Aligning Attention model (VAA) and ensures that attention layers accurately focus on regions of interest. |
Multi-Scale Cropping Mechanism for Remote Sensing Image Captioning<br/>Xueting Zhang; Qi Wang; Shangdong Chen; Xuelong Li<br/>> Northwestern Polytechnical University<br/>> IGARSS 2019<br/>> multi-scale cropping <br/>> Cited by 43 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/8891871/8897702/8900503/zhang1-p4-zhang-large.gif" width="300"></div><br/>It proposes a training mechanism of multi-scale cropping for remote sensing image captioning in this paper, which can extract more fine-grained information from remote sensing images and enhance the generalization performance of the base model. |
Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning<br/>Zhenghang Yuan; Xuelong Li; Qi Wang<br/>> Northwestern Polytechnical University<br/>> IEEE Access 2019<br/>> Multi-level attention , Graph convolutional networks <br/>> Cited by 51 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/6287639/8948470/8943170/wang2-2962195-large.gif" width="300"></div><br/>This paper proposes a remote sensing image captioning framework based on multi-level attention and multi-label attribute graph convolution is proposed to improve the performance from two aspects. |
Intensive Positioning Network for Remote Sensing Image Captioning<br/>Shengsheng Wang; Jiawei Chen; Guagnyao Wang<br/>> Jilin University<br/>> Intelligence Science and Big Data Engineering 2018<br/>> Intensive positioning network (IPN) <br/>> Cited by 7 | <div align="center"><img src="https://media.springernature.com/full/springer-static/image/chp%3A10.1007%2F978-3-030-02698-1_49/MediaObjects/475114_1_En_49_Fig2_HTML.png?as=webp" width="300"></div><br/>The paper proposes a new network: intensive positioning network (IPN) , which can predict regions containing important information in the picture and output multiple region description blocks around these regions. |
Exploring Models and Data for Remote Sensing Image Caption Generation <br/>Xiaoqiang Lu; Binqiang Wang; Xiangtao Zheng; Xuelong Li<br/>> University of Chinese Academy of Sciences<br/>> TGRS 2017<br/>> RSICD <br/>> Cited by 510 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/8323433/8240966/lu2-2776321-large.gif" width="300"></div><br/>The paper constructs a remote sensing image captioning dataset (RSICD) and evaluates different caption methods based on handcrafted representations and convolutional features on different datasets. |
Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?<br/>Zhenwei Shi; Zhengxia Zou<br/>> Beihang University<br/>> TGRS 2017<br/>> Fully convolutional networks <br/>> Cited by 197 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/36/7932278/7891049/shi4-2677464-large.gif" width="300"></div><br/>The paper investigates an interesting question of can a machine automatically generate humanlike language description of remote sensing image and proposes a remote sensing image captioning framwork, where the experimental results on Google Earth and GF-2 images have demonstrated the superiority and transfer ability of the proposed method. |
Deep semantic understanding of high resolution remote sensing image<br/>Bo Qu; Xuelong Li; Dacheng Tao; Xiaoqiang Lu<br/>> University of the Chinese Academy of Sciences<br/>> CITS 2016<br/>> HSR image <br/>> Cited by 265 | <div align="center"><img src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/7536293/7546377/7546397/7546397-fig-1-source-large.gif" width="300"></div><br/>This paper proposes a deep multimodal neural network model to solve the problem of understanding HSR remote sensing images in the semantic level and the VGG19-layers network with LSTMs is the best combination for HSR remote sensing image caption generation. |