Awesome

Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning

Code for Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning published at AAAI 2023.

PyTorch Implementation of the paper:

Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning

Xian Zhong, Zipeng Li, Shuqin Chen, Kui Jiang, Chen Chen, Mang Ye.

[aaai.org]

Main Contribution

Emphasis on token imbalance to enhance the refined semantics for video captioning.
In the form of diffusion to learn infrequent tokens to alleviate the long-tailed problem.
Balancing different frequent tokens by leveraging distinctive semantics.

Methodology

As shown in Fig. 2, our overall framework follows the encoder-decoder structure. During the training process, Frequency-Aware Diffusion (FAD) encourages the model to add low-frequency token noise to learn its semantics. Then the diffusion features of tokens are fused with the corresponding visual features according to the cross-attention mechanism. At the head of the decoder, Divergent Semantic Supervisor (DSS) obtains distinctive semantic features by updating the gradient that adapts to the token itself. In the testing phase, only the original Transformer architecture is retained to generate captions. <img src="RSFD.png" alt="RSFD" style="zoom:100%;" />

<div style="color:orange; display: inline-block; color: black; ">Figure 2. Overview of the proposed RSFD architecture. It mainly consists of the encoder in the top-left box and the decoder with FAD and DSS modules in another box. In the training phase, FAD promotes the model comprehending the refined information by mapping the ground-truth caption to the semantic space and fusing it in frequency diffusion. DSS supervises the central word to obtain its distinctive semantics. In the testing phase, only the transformer-based parts are implemented for sentence generation.</div>

Environment

conda create -n RSFD python==3.7
conda activate RSFD
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Installation

git clone https://github.com/lzp870/RSFD.git
cd RSFD

Download datasets

Organize corpora and extracted features under VC_data/ in BaiduYun Extraction code: RSFD

VC_data
└── MSRVTT
        ├── feats
        │   ├── image_resnet101_imagenet_fps_max60.hdf5
        │   └── motion_resnext101_kinetics_duration16_overlap8.hdf5
        ├── info_corpus.pkl
        └── refs.pkl
└── Youtube2Text
        ├── feats
        │   ├── image_resnet101_imagenet_fps_max60.hdf
        │   └── motion_resnext101_kinetics_duration16_overlap8.hdf5
        ├── info_corpus.pkl
        └── refs.pkl

Training

python train.py --default --dataset MSRVTT --method ARB
python train.py --default --dataset MSVD --method ARB

Testing

python translate.py --default --dataset MSRVTT --method ARB
python translate.py --default --dataset MSVD --method ARB

Citation

If our research and this repository are helpful to your work, please [★star] this repo and [cite] with:

@inproceedings{DBLP:conf/aaai/ZhongLCJ0Y23,
  author       = {Xian Zhong and
                  Zipeng Li and
                  Shuqin Chen and
                  Kui Jiang and
                  Chen Chen and
                  Mang Ye},
  editor       = {Brian Williams and
                  Yiling Chen and
                  Jennifer Neville},
  title        = {Refined Semantic Enhancement towards Frequency Diffusion for Video
                  Captioning},
  booktitle    = {Thirty-Seventh {AAAI} Conference on Artificial Intelligence, {AAAI}
                  2023, Thirty-Fifth Conference on Innovative Applications of Artificial
                  Intelligence, {IAAI} 2023, Thirteenth Symposium on Educational Advances
                  in Artificial Intelligence, {EAAI} 2023, Washington, DC, USA, February
                  7-14, 2023},
  pages        = {3724--3732},
  publisher    = {{AAAI} Press},
  year         = {2023},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/25484},
  timestamp    = {Sun, 30 Jul 2023 19:22:30 +0200},
  biburl       = {https://dblp.org/rec/conf/aaai/ZhongLCJ0Y23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Acknowledgements

Code of the encoder part is based on yangbang18/Non-Autoregressive-Video-Captioning.