Home

Awesome

<div align="center"> <h1>PMA-Net: Prototypical Memory Attention Network<br>(ICCV 2023)</h1> </div>

This repository contains the reference code for the paper With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning.

Please cite with the following BibTeX:

@inproceedings{sarto2023positive,
  title={{With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning}},
  author={Barraco, Manuele and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2023}
}
<p align="center"> <img src="model_pma_net.png" alt="PMA-Net" width="820" /> </p>

Environment Setup

Clone the repository and create the pma-net conda environment using the environment.yml file:

conda env create -f environment.yml
conda activate pma-net

Note: Python 3.9 is required to run our code.

Data Preparation

Checkpoints

XE and SCST checkpoints are available at the following links:

ModelCheckpoint
PMA-Net XEpma-net_xe.tar
PMA-Net SCSTpma-net_scst.tar

Download, extract, and place them in a folder anywhere. The path {CHECKPOINT_FOLDER} will be set as argument later.

Dataset

To run the code, annotations for the COCO dataset are needed. Please download the zip files containing the annotations (annotations.zip), extract them, and place them under the datasets/annotations folder.

To train and test our model, download the tar files containing the already extracted COCO image features using CLIP ViT-L/14 at the following links:

SplitCheckpoint
COCO Training (chunck 0)coco_training_CLIP-ViT-L14_cached_0.tar
COCO Training (chunck 1)coco_training_CLIP-ViT-L14_cached_1.tar
COCO Training (chunck 2)coco_training_CLIP-ViT-L14_cached_2.tar
COCO Training (chunck 3)coco_training_CLIP-ViT-L14_cached_3.tar
COCO Training (chunck 4)coco_training_CLIP-ViT-L14_cached_4.tar
COCO Training (chunck 5)coco_training_CLIP-ViT-L14_cached_5.tar
COCO Training for SCSTcoco_training_dict_CLIP-ViT-L14_cached.tar
COCO Validationcoco_validation_dict_CLIP-ViT-L14_cached.tar
COCO Testcoco_test_dict_CLIP-ViT-L14_cached.tar

Once the files are downloaded and extracted in a single folder, set the correct path in the configs/datasets/datasets.json.

These paths will be set as arguments later.

Evaluation

To evaluate our best model, use

torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps --generation_max_length 30 --generation_num_beams 5 --per_device_eval_batch_size {EVAL_BATCH_SIZE} --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --resume_from_checkpoint {CHECKPOINT_FOLDER}

Training Procedure

To train our best model with the parameters used in our experiments, use

torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps 
--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --custom_lr_scheduler CustomScheduler --steps_min 15000 --start_decreasing_steps 10000 --learning_rate 2.5e-4 --warmup_steps 1000 --lr_min 1e-5 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_lamb_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 

After XE pre-training, for the SCST step use:

torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_training_dict_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps 
--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --steps_min 15000  --learning_rate 5e-6 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_adam_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --scst --resume_from_checkpoint {CHECKPOINT_FOLDER}

Custom Arguments

The complete arguments list for using our code:

ArgumentDescription
--encoderAdd a BERT encoder.
--n_layerNumber of layer.
--n_embdEmbedding dimension.
--n_headNumber of head.
--custom_checkpoint_keeperHow many checkpoints keep on drive, default is 5.
--scstUse SCST phase.
--train_datasetsTraining datasets, default is coco_training.
--validation_datasetsValidation datasets, default is coco_validation_dict.
--test_datasetsTest datasets, default is coco_test_dict.
--scst_datasetsSCST datasets, default is coco_training_dict.
--custom_lr_schedulerWhich custom scheduler uses (CustomScheduler, TransformerScheduler), default is None.
--lr_multiplierLearning rate multiplier, default is 1.0.
--steps_minOnly with CustomScheduler.
--lr_minOnly with CustomScheduler.
--start_decreasing_stepsOnly with CustomScheduler.
--add_memory_slots_selfattnAdd memory slots in the self-attention blocks.
--n_memory_slotsHow many memory slots, default is 64.
--freeze_memoryFreeze the memories.
--kmeans_memoryCompute the memories using k-means.
--deque_itersMax number of iterations data in the deque, default is 10.
--windowOverlap window of new data, default is None.