Home

Awesome

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

In this repo, we provide code and pretrained models for the paper "A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval" which has been accepted for presentation at the 30th ACM International Conference on Multimedia (ACM MM).

Overview of the proposed multimodal data augmentation technique working on latent representations.

Python environment

Requirements: python 3, allennlp 2.8.0, h5py 3.6.0, pandas 1.3.5, spacy 2.3.5, torch 1.7.0 (also tested with 1.8), numpy 1.23.1, tensorboard, tqdm

# clone the repository
cd FSMMDA_VideoRetrieval
export PYTHONPATH=$(pwd):${PYTHONPATH}

Data

Training

To launch a training, first select a configuration file (e.g. prepare_mlmatch_configs_EK100_TBN_augmented_VidTxtLate.py) and execute the following:

python t2vretrieval/driver/configs/prepare_mlmatch_configs_EK100_TBN_augmented_VidTxtLate.py .

This will return a folder name (where config, models, logs, etc will be saved). Let that folder be $resdir. Then, execute the following to start a training:

python t2vretrieval/driver/multilevel_match.py $resdir/model.json $resdir/path.json --is_train --load_video_first --resume_file glove_checkpoint_path

About the config files

Config files are used to define details of the model and of the paths containing the annotations, features, etc. By running a "prepare_*" script, a folder containing two .json files is created.

Evaluating

To automatically check for the best checkpoint (after a training run):

python t2vretrieval/driver/multilevel_match.py $resdir/model.json $resdir/path.json --eval_set tst

To resume one of the checkpoints provided:

python t2vretrieval/driver/multilevel_match.py $resdir/model.json $resdir/path.json --eval_set tst --resume_file checkpoint.th

For instance, by unzipping the archive for the augmented HGR on EPIC-Kitchens-100, the following folder is obtained:

results/RET.released/mlmatch/ek100_TBN_aug0.5_VcTLate_thrPos0.15_mPos0.2_m0.2.vis.TBN.pth.txt.bigru.16role.gcn.1L.attn.1024.loss.bi.af.embed.4.glove.init.50ep/

Therefore, the evaluation can be done by running the following:

python t2vretrieval/driver/multilevel_match.py \
  results/RET.released/mlmatch/ek100_TBN_aug0.5_VcTLate_thrPos0.15_mPos0.2_m0.2.vis.TBN.pth.txt.bigru.16role.gcn.1L.attn.1024.loss.bi.af.embed.4.glove.init.50ep/model.json \
  results/RET.released/mlmatch/ek100_TBN_aug0.5_VcTLate_thrPos0.15_mPos0.2_m0.2.vis.TBN.pth.txt.bigru.16role.gcn.1L.attn.1024.loss.bi.af.embed.4.glove.init.50ep/path.json \
  --eval_set tst \
  --resume_file results/RET.released/mlmatch/ek100_TBN_aug0.5_VcTLate_thrPos0.15_mPos0.2_m0.2.vis.TBN.pth.txt.bigru.16role.gcn.1L.attn.1024.loss.bi.af.embed.4.glove.init.50ep/model/epoch.42.th

Pretrained models

On EPIC-Kitchens-100:

On YouCook2:

Acknowledgements

We thank the authors of Chen et al. (CVPR, 2020) (github), Wray et al. (ICCV, 2019) (github), Wray et al. (CVPR, 2021) (github), Falcon et al. (ICIAP, 2022) (github) for the release of their codebases.

Citations

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@article{falcon2022fsmmda,
  title={A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval},
  author={Falcon, Alex and Serra, Giuseppe and Lanz, Oswald},
  journal={ACM MM},
  year={2022}
}

License

MIT License