Awesome
<div align="center"> <h1>PMA-Net: Prototypical Memory Attention Network<br>(ICCV 2023)</h1> </div>This repository contains the reference code for the paper With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning.
Please cite with the following BibTeX:
@inproceedings{sarto2023positive,
title={{With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning}},
author={Barraco, Manuele and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2023}
}
<p align="center">
<img src="model_pma_net.png" alt="PMA-Net" width="820" />
</p>
Environment Setup
Clone the repository and create the pma-net
conda environment using the environment.yml
file:
conda env create -f environment.yml
conda activate pma-net
Note: Python 3.9 is required to run our code.
Data Preparation
Checkpoints
XE and SCST checkpoints are available at the following links:
Model | Checkpoint |
---|---|
PMA-Net XE | pma-net_xe.tar |
PMA-Net SCST | pma-net_scst.tar |
Download, extract, and place them in a folder anywhere. The path {CHECKPOINT_FOLDER}
will be set as argument later.
Dataset
To run the code, annotations for the COCO dataset are needed.
Please download the zip files containing the annotations (annotations.zip), extract them, and place them under the datasets/annotations
folder.
To train and test our model, download the tar files containing the already extracted COCO image features using CLIP ViT-L/14 at the following links:
Split | Checkpoint |
---|---|
COCO Training (chunck 0) | coco_training_CLIP-ViT-L14_cached_0.tar |
COCO Training (chunck 1) | coco_training_CLIP-ViT-L14_cached_1.tar |
COCO Training (chunck 2) | coco_training_CLIP-ViT-L14_cached_2.tar |
COCO Training (chunck 3) | coco_training_CLIP-ViT-L14_cached_3.tar |
COCO Training (chunck 4) | coco_training_CLIP-ViT-L14_cached_4.tar |
COCO Training (chunck 5) | coco_training_CLIP-ViT-L14_cached_5.tar |
COCO Training for SCST | coco_training_dict_CLIP-ViT-L14_cached.tar |
COCO Validation | coco_validation_dict_CLIP-ViT-L14_cached.tar |
COCO Test | coco_test_dict_CLIP-ViT-L14_cached.tar |
Once the files are downloaded and extracted in a single folder, set the correct path in the configs/datasets/datasets.json
.
These paths will be set as arguments later.
Evaluation
To evaluate our best model, use
torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps --generation_max_length 30 --generation_num_beams 5 --per_device_eval_batch_size {EVAL_BATCH_SIZE} --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --resume_from_checkpoint {CHECKPOINT_FOLDER}
Training Procedure
To train our best model with the parameters used in our experiments, use
torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps
--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --custom_lr_scheduler CustomScheduler --steps_min 15000 --start_decreasing_steps 10000 --learning_rate 2.5e-4 --warmup_steps 1000 --lr_min 1e-5 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_lamb_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25
After XE pre-training, for the SCST step use:
torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_training_dict_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps
--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --steps_min 15000 --learning_rate 5e-6 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_adam_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --scst --resume_from_checkpoint {CHECKPOINT_FOLDER}
Custom Arguments
The complete arguments list for using our code:
Argument | Description |
---|---|
--encoder | Add a BERT encoder. |
--n_layer | Number of layer. |
--n_embd | Embedding dimension. |
--n_head | Number of head. |
--custom_checkpoint_keeper | How many checkpoints keep on drive, default is 5 . |
--scst | Use SCST phase. |
--train_datasets | Training datasets, default is coco_training . |
--validation_datasets | Validation datasets, default is coco_validation_dict . |
--test_datasets | Test datasets, default is coco_test_dict . |
--scst_datasets | SCST datasets, default is coco_training_dict . |
--custom_lr_scheduler | Which custom scheduler uses (CustomScheduler , TransformerScheduler ), default is None . |
--lr_multiplier | Learning rate multiplier, default is 1.0 . |
--steps_min | Only with CustomScheduler . |
--lr_min | Only with CustomScheduler . |
--start_decreasing_steps | Only with CustomScheduler . |
--add_memory_slots_selfattn | Add memory slots in the self-attention blocks. |
--n_memory_slots | How many memory slots, default is 64 . |
--freeze_memory | Freeze the memories. |
--kmeans_memory | Compute the memories using k-means. |
--deque_iters | Max number of iterations data in the deque, default is 10 . |
--window | Overlap window of new data, default is None . |