Home

Awesome

HANet: Hierarchical Alignment Networks for Video-Text Retrieval (ACMMM 2021)

This repository is the PyTorch implementation of our paper HANet.

Overview of HANet Model

Prerequisites

Python 3 and PyTorch 1.6.

# clone the repository
git clone git@github.com:Roc-Ng/HANet.git
cd HANet
export PYTHONPATH=$(pwd):${PYTHONPATH}

Datasets

Provided annotations and pretrained features on MSRVTT and VATEX video captioning datasets can be downloaded from OneDrive (code: hanet) or Baidu Netdisk (code:d2gr).

Annotations

  1. noun_gt.json; verb_gt.json; noun_gt_all.json; verb_gt_all.json
  2. ref_captions.json: dict, {videoname: [sent]}
  3. sent2rolegraph.augment.json: {sent: (graph_nodes, graph_edges)}

Features

Resnet152 and I3D features are used for MSR-VTT and VATEX respectively. These features are extracted by the authors of hgr and vatex, thanks for their wonderful work!

format: np array, shape=(num_fts, dim_ft) corresponding to the order in data_split names

format: hdf5 file, {name: ft}, ft.shape=(num_frames, dim_ft)

Training & Inference

Concept Vocabulary

We provided concept vocabulary. If you want to generate concept vocabularies for new datasets, please follow the following instructions.

  1. generate concepts and compute frequencies:
cd data
python concept_frequency.py ref_caption_file trn_name_file
  1. generate concept labels:
python make_gt.py trn_name_file

Semantic Graph Construction

We provided constructed role graph annotations. If you want to generate role graphs for new datasets, please follow the following instructions.

  1. semantic role labeling:
python misc/semantic_role_labeling.py ref_caption_file out_file --cuda_device 0
  1. convert sentence into role graph:
cd misc
jupyter notebook
# open parse_sent_to_role_graph.ipynd

Training and Evaluation

MSR-VTT

# setup config files
# you should modify data paths ==> data/msrvtt/model.json and data/msrvtt/path_msr.json


cd t2vretrieval/driver

# training
python multilevel_match.py ../../data/msrvtt/model.json ../../data/msrvtt/path.json --load_video_first --is_train --resume_file ../../data/msrvtt/word_embeds.glove32b.th

# inference
python multilevel_match.py ../../data/msrvtt/model.json ../../data/msrvtt/path.json --load_video_first --eval_set tst

VATEX

# setup config files
# you should modify data paths ==> data/vatex/model.json and data/vatex/path.json


cd t2vretrieval/driver

# training
python multilevel_match.py ../../data/vatex/model.json ../../data/vatex/path.json --load_video_first --is_train --resume_file ../../data/vatex/word_embeds.glove42b.th

# inference
python multilevel_match.py ../../data/vatex/model.json ../../data/vatex/path.json --load_video_first --eval_set tst

Citations

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@article{wu2021hanet,
  title={HANet: Hierarchical Alignment Networks for Video-Text Retrieval},
  author={Wu, Peng and He, Xiangteng and Tang, Mingqian and Lv, Yiliang and Liu, Jing},
  journal={arXiv preprint arXiv:2107.12059},
  year={2021}
}

Acknownledgements

Our code is based on the implementation of hgr cvpr2020.