Awesome

PTSN: Progressive Tree-Structured Prototype Network

This repository contains the reference code for the ACM MM2022 paper "Progressive Tree-Structured Prototype Network for End-to-End Image Captioning"

<img src="images/framework.png" alt="Progressive Tree-Structured Prototype Network" width="850"/> <a href="https://dl.acm.org/doi/abs/10.1145/3503161.3548024" target="_blank">Paper Access </a> |<a href="https://competitions.codalab.org/competitions/3221#results"> MSCOCO Leaderboard (TeamName:CMG) </a> | <a href="https://pan.baidu.com/s/1SSIj7HnFH79GzgERm9aDcA" target="_blank">Baidu Disk </a>

Environment setup

Python 3.6
Pytorch 1.7.1 (strongly recommand)
Numpy 1.19

Data preparation

To run this code, pre-trained vision backbones, MSCOCO raw pictures and annotations should be downloaded.

mkdir $DataPath/coco_caption/
mkdir $DataPath/resume_model/
mkdir $DataPath/saved_models/
mkdir PTSN/saved_transformer_models/

Pre-trained vision backbones:

please download SwinT-B/16_22k_224x224 (password:swin) and put it in $DataPath/resume_model/ As for the other backbones(e.g. SwinT-L 384x384), you can download them at their offical link.
Raw data:

please download train2014.zip, val2014.zip and test2014.zip. Then unzip and put these files in $DataPath/coco_caption/IMAGE_COCO/ .
Annotations:

please download annotations and put it in $DataPath/coco_caption/annotations/
Other data:

please download trained_models (passwd:ptsn) It includes word_embeds.pth, hyper_protos.pth, trained checkpoints and training logs. Put word_embeds.pth and hyper_protos.pth in PTSN/. Put checkpoints in $DataPath/saved_models/

Inference procedure

To reproduce the results of our paper, do the following two steps:

modify the /path/to/data in ./test_ptsn.sh into $DataPath
please run the code below:

cd ./PTSN
sh test_ptsn.sh

Training procedure

To train a Swin-B version of our PTSN model, do the following two steps:

modify the /path/to/data in ./train_ptsn.sh into $DataPath
please run the code below:

cd ./PTSN
sh train_ptsn.sh

Note that it takes 4 v100 GPUs and around 50 hours to train this model.

Citation

To cite our paper, please use following BibTex:

@inproceedings{PTSN,
  author    = {Pengpeng Zeng and
               Jinkuan Zhu and
               Jingkuan Song and
               Lianli Gao},
  title     = {Progressive Tree-Structured Prototype Network for End-to-End Image
               Captioning},
  booktitle = {ACM MM},
  pages     = {5210--5218},
  year      = {2022},
}