Awesome

<div align="center"> <h1>PMA-Net: Prototypical Memory Attention Network<br>(ICCV 2023)</h1> </div>

This repository contains the reference code for the paper With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning.

Please cite with the following BibTeX:

@inproceedings{sarto2023positive,
  title={{With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning}},
  author={Barraco, Manuele and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2023}
}

Environment Setup

Clone the repository and create the pma-net conda environment using the environment.yml file:

conda env create -f environment.yml
conda activate pma-net

Note: Python 3.9 is required to run our code.

Data Preparation

Checkpoints

XE and SCST checkpoints are available at the following links:

Model	Checkpoint
PMA-Net XE	pma-net_xe.tar
PMA-Net SCST	pma-net_scst.tar

Download, extract, and place them in a folder anywhere. The path {CHECKPOINT_FOLDER} will be set as argument later.

Dataset

To run the code, annotations for the COCO dataset are needed. Please download the zip files containing the annotations (annotations.zip), extract them, and place them under the datasets/annotations folder.

To train and test our model, download the tar files containing the already extracted COCO image features using CLIP ViT-L/14 at the following links:

Split	Checkpoint
COCO Training (chunck 0)	coco_training_CLIP-ViT-L14_cached_0.tar
COCO Training (chunck 1)	coco_training_CLIP-ViT-L14_cached_1.tar
COCO Training (chunck 2)	coco_training_CLIP-ViT-L14_cached_2.tar
COCO Training (chunck 3)	coco_training_CLIP-ViT-L14_cached_3.tar
COCO Training (chunck 4)	coco_training_CLIP-ViT-L14_cached_4.tar
COCO Training (chunck 5)	coco_training_CLIP-ViT-L14_cached_5.tar
COCO Training for SCST	coco_training_dict_CLIP-ViT-L14_cached.tar
COCO Validation	coco_validation_dict_CLIP-ViT-L14_cached.tar
COCO Test	coco_test_dict_CLIP-ViT-L14_cached.tar

Once the files are downloaded and extracted in a single folder, set the correct path in the configs/datasets/datasets.json.

These paths will be set as arguments later.

Evaluation

To evaluate our best model, use

torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps --generation_max_length 30 --generation_num_beams 5 --per_device_eval_batch_size {EVAL_BATCH_SIZE} --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --resume_from_checkpoint {CHECKPOINT_FOLDER}

Training Procedure

To train our best model with the parameters used in our experiments, use

torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps 
--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --custom_lr_scheduler CustomScheduler --steps_min 15000 --start_decreasing_steps 10000 --learning_rate 2.5e-4 --warmup_steps 1000 --lr_min 1e-5 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_lamb_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25

After XE pre-training, for the SCST step use:

torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_training_dict_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps 
--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --steps_min 15000  --learning_rate 5e-6 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_adam_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --scst --resume_from_checkpoint {CHECKPOINT_FOLDER}

Custom Arguments

The complete arguments list for using our code:

Argument	Description
`--encoder`	Add a BERT encoder.
`--n_layer`	Number of layer.
`--n_embd`	Embedding dimension.
`--n_head`	Number of head.
`--custom_checkpoint_keeper`	How many checkpoints keep on drive, default is `5`.
`--scst`	Use SCST phase.
`--train_datasets`	Training datasets, default is `coco_training`.
`--validation_datasets`	Validation datasets, default is `coco_validation_dict`.
`--test_datasets`	Test datasets, default is `coco_test_dict`.
`--scst_datasets`	SCST datasets, default is `coco_training_dict`.
`--custom_lr_scheduler`	Which custom scheduler uses (`CustomScheduler`, `TransformerScheduler`), default is `None`.
`--lr_multiplier`	Learning rate multiplier, default is `1.0`.
`--steps_min`	Only with `CustomScheduler`.
`--lr_min`	Only with `CustomScheduler`.
`--start_decreasing_steps`	Only with `CustomScheduler`.
`--add_memory_slots_selfattn`	Add memory slots in the self-attention blocks.
`--n_memory_slots`	How many memory slots, default is `64`.
`--freeze_memory`	Freeze the memories.
`--kmeans_memory`	Compute the memories using k-means.
`--deque_iters`	Max number of iterations data in the deque, default is `10`.
`--window`	Overlap window of new data, default is `None`.