Awesome
Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning
source code of our paper Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning
Table of Contents
- Environments
- Required Data
- NRCCR on VATEX
- NRCCR on MSRVTT10K-CN
- NRCCR on Multi-30K
- NRCCR on MSCOCO
- Reference
Environments
- CUDA 11.3
- Python 3.8.5
- PyTorch 1.10.2
We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install the required packages.
conda create --name nrccr_env python=3.8.5
conda activate nrccr_env
git clone https://github.com/LiJiaBei-7/nrccr.git
cd nrccr
pip install -r requirements.txt
conda deactivate
Required Data
We use three public datasets: VATEX, MSR-VTT-CN, and Multi-30K. The extracted feature is placed in $HOME/VisualSearch/
.
For Multi-30K, we have provided translation version (from Google Translate) of Task1 and Task2, respectively. [Task1: Applied to translation tasks. Task2: Applied to captioning tasks.].
In addition, we also provide MSCOCO dataset here, and corresponding performance below. The validation and test set on Japanese from STAIR Captions, and that on Chinese from COCO-CN.
Training set:
source(en) + translation(en2xx) + back-translation(en2xx2en)
Validation set and test set:
target(xx) + translation(xx2en)
<table> <tr align="center"> <th>Dataset</th><th>feature</th><th>caption</th> </tr> <tr align='center'> <td>VATEX</td> <td><a href='https://pan.baidu.com/s/1lg23K93lVwgdYs5qnTuMFg?pwd=p3p0'>vatex-i3d.tar.gz, pwd:p3p0</a></td> <td><a href='https://www.aliyundrive.com/s/xDrzCDNEHWP'>vatex_caption, pwd:oy27</a></td> </tr> <tr align="center"> <td>MSR-VTT-CN</td> <td><a href='https://pan.baidu.com/s/1lg23K93lVwgdYs5qnTuMFg?pwd=p3p0'>msrvtt10k-resnext101_resnet152.tar.gz, pwd:p3p0</a></td> <td><a href='https://www.aliyundrive.com/s/3sBNJqfTxcp'>cn_caption, pwd:oy27</a></td> </tr> <tr align="center"> <td>Multi-30K</td> <td><a href='https://pan.baidu.com/s/1AzTN6rFyabirACVkVEVKCQ'>multi30k-resnet152.tar.gz, pwd:5khe</a></td> <td><a href='https://www.aliyundrive.com/s/zGEbQAvqHGy'>multi30k_caption, pwd:oy27</a></td> </tr> <tr align="center"> <td>MSCOCO</td> <td></td> <td><a href='https://www.aliyundrive.com/s/PxToUYryguz'>mscoco_caption, pwd:13dc</a></td> </tr> </table>ROOTPATH=$HOME/VisualSearch
mkdir -p $ROOTPATH && cd $ROOTPATH
Organize these files like this:
# download the data of VATEX[English, Chinese]
VisualSearch/VATEX/
FeatureData/
i3d_kinetics/
feature.bin
id.txt
shape.txt
video2frames.txt
TextData/
xx.txt
# download the data of MSR-VTT-CN[English, Chinese]
VisualSearch/msrvttcn/
FeatureData/
resnext101-resnet152/
feature.bin
id.txt
shape.txt
video2frames.txt
TextData/
xx.txt
# download the data of Multi-30K[Englich, German, French, Czech]
# For Task2, the training set was translated from Flickr30K, which contains five captions per image, while for task1, each image corresponds to one caption.
# The validation and test set on French and Czech are same in both tasks.
VisualSearch/multi30k/
FeatureData/
train_id.txt
val_id.txt
test_id_2016.txt
resnet_152[optional]/
train-resnet_152-avgpool.npy
val-resnet_152-avgpool.npy
test_2016_flickr-resnet_152-avgpool.npy
TextData/
xx.txt
flickr30k-images/
xx.jpg
# download the data of MSCOCO[English, Chinese, Japanese]
VisualSearch/mscoco/
FeatureData/
train_id.txt
ja_val_id.txt
zh_val_id.txt
ja_test_id.txt
zh_test_id.txt
TextData/
xx.txt
all_pics/
xx.jpg
image_ids.txt
NRCCR on VATEX
Model Training and Evaluation
Run the following script to train and evaluate NRCCR
network. Specifically, it will train NRCCR
network and select a checkpoint that performs best on the validation set as the final model. Notice that we only save the best-performing checkpoint on the validation set to save disk space.
ROOTPATH=$HOME/VisualSearch
conda activate nrccr_env
# To train the model on the MSR-VTT, which the feature is resnext-101_resnet152-13k
# Template:
./do_all_vatex.sh $ROOTPATH <gpu-id>
# Example:
# Train NRCCR
./do_all_vatex.sh $ROOTPATH 0
<gpu-id>
is the index of the GPU where we train on.
Evaluation using Provided Checkpoints
Download trained checkpoint on VATEX from Baidu pan (url, pwd:ise6) and run the following script to evaluate it.
ROOTPATH=$HOME/VisualSearch/
tar zxf $ROOTPATH/<best_model>.pth.tar -C $ROOTPATH
./do_test_vatex.sh $ROOTPATH $MODELDIR <gpu-id>
# $MODELDIR is the path of checkpoints, $ROOTPATH/.../runs_0
Expected Performance
<table> <tr align="center"> <th rowspan='2'>Type</th><th colspan='5'>Text-to-Video Retrieval</th><th colspan='5'>Video-to-Text Retrieval</th> <th rowspan='2'>SumR</th> </tr> <tr align="center"> <th> R@1 </th> <th> R@5 </th> <th> R@10 </th> <th> MedR </th> <th> mAP </th> <th> R@1 </th> <th> R@5 </th> <th> R@10 </th> <th> MedR </th> <th> mAP </th> </tr> <tr align="center"> <td>en2cn</td><td>30.8</td><td>64.4</td><td>74.6</td><td>3.0</td><td>45.78</td> <td>43.1</td><td>72.3</td><td>81.4</td><td>2.0</td><td>32.57</td><td>366.5</td> </tr> </table>NRCCR on MSR-VTT-CN
Model Training and Evaluation
Run the following script to train and evaluate NRCCR
network on MSR-VTT-CN.
ROOTPATH=$HOME/VisualSearch
conda activate nrccr_env
# To train the model on the VATEX
./do_all_msrvttcn.sh $ROOTPATH <gpu-id>
Evaluation using Provided Checkpoints
Download trained checkpoint on MSR-VTT-CN from Baidu pan (url, pwd:ise6) and run the following script to evaluate it.
ROOTPATH=$HOME/VisualSearch/
tar zxf $ROOTPATH/<best_model>.pth.tar -C $ROOTPATH
./do_test_msrvttcn.sh $ROOTPATH $MODELDIR <gpu-id>
# $MODELDIR is the path of checkpoints, $ROOTPATH/.../runs_0
Expected Performance
<table> <tr align="center"> <th rowspan='2'>Type</th><th colspan='5'>Text-to-Video Retrieval</th><th colspan='5'>Video-to-Text Retrieval</th> <th rowspan='2'>SumR</th> </tr> <tr align="center"> <th> R@1 </th> <th> R@5 </th> <th> R@10 </th> <th> MedR </th> <th> mAP </th> <th> R@1 </th> <th> R@5 </th> <th> R@10 </th> <th> MedR </th> <th> mAP </th> </tr> <tr align="center"> <td>en2cn</td><td>28.9</td><td>56.3</td><td> 67.3</td><td>4.0</td><td>41.28</td> <td>28.9</td><td>57.6</td><td>69.0</td><td>4.0</td><td>42.02</td><td>308</td> </tr> </table>NRCCR on Multi-30K
Model Training and Evaluation
Run the following script to train and evaluate NRCCR
network on Multi-30K. Besides, if you want use the clip as the backbone to train, you need to download the raw images from here for Flickr30K.
ROOTPATH=$HOME/VisualSearch
conda activate nrccr_env
# To train the model on the Multi-30K
./do_all_multi30k.sh $ROOTPATH <task> <gpu-id>
Evaluation using Provided Checkpoints
Download trained checkpoint on Multi-30K from Baidu pan (url, pwd:ise6) and run the following script to evaluate it.
ROOTPATH=$HOME/VisualSearch/
tar zxf $ROOTPATH/<best_model>.pth.tar -C $ROOTPATH
./do_test_multi30k.sh $ROOTPATH $MODELDIR $image_path <gpu-id>
# $MODELDIR is the path of checkpoints, $ROOTPATH/.../runs_0
# $image_path is the path of the raw images for Flickr30K, if you use the frozen resnet-152, just set the None.
Expected Performance
Task1:
<table> <tr align="center"> <th rowspan='2'>Type</th><th colspan='5'>Text-to-Video Retrieval</th><th colspan='5'>Video-to-Text Retrieval</th> <th rowspan='2'>SumR</th> </tr> <tr align="center"> <th> R@1 </th> <th> R@5 </th> <th> R@10 </th> <th> MedR </th> <th> mAP </th> <th> R@1 </th> <th> R@5 </th> <th> R@10 </th> <th> MedR </th> <th> mAP </th> </tr> <tr align="center"> <td>en2de_clip</td><td>53.8</td><td>81.8</td><td>88.3</td><td>1.0</td><td>66.60</td> <td>53.8</td><td>82.7</td><td>90.3</td><td>1.0</td><td>66.66</td><td>450.7</td> </tr> <tr align="center"> <td>en2fr_clip</td><td>54.7</td><td>81.7</td><td>89.2</td><td>1.0</td><td>67.05</td> <td>54.9</td><td>82.7</td><td>89.7</td><td>1.0</td><td>67.29</td><td>452.9</td> </tr> <tr align="center"> <td>en2cs_clip</td><td>52.6</td><td>79.4</td><td>87.9</td><td>1.0</td><td>65.26</td> <td>52.3</td><td>78.7</td><td>87.8</td><td>1.0</td><td>64.68</td><td>438.7</td> </tr> <tr align="center"> <td>en2cs_resnet152</td><td>29.5</td><td>56.0</td><td>68.1</td><td>4.0</td><td>41.89</td><td>27.5</td><td>55.1</td><td>67.4</td><td>4.0</td><td>40.59</td><td>303.6</td> </tr> </table>Task2 :
(with clip)
<table> <tr align="center"> <th> en2de_SumR </th> <th> en2fr_SumR </th> <th> en2cs_SumR </th> </tr> <tr align="center"> <td>480.9</td> <td>482.1</td> <td>467.1</td> </tr> </table>NRCCR on MSCOCO
Model Training and Evaluation
Run the following script to train and evaluate NRCCR
network on MSCOCO.
ROOTPATH=$HOME/VisualSearch
conda activate nrccr_env
# To train the model on the Multi-30K
./do_all_mscoco.sh $ROOTPATH <gpu-id>
Expected Performance
(with clip)
<table> <tr align="center"> <th> en2cn_SumR </th> <th> en2ja_SumR </th> </tr> <tr align="center"> <td>512.4</td> <td>507.0</td> </tr> </table>Reference
If you find the package useful, please consider citing our paper:
@inproceedings{wang2022cross,
title={Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning},
author={Yabing Wang and Jianfeng Dong and Tianxiang Liang and Minsong Zhang and Rui Cai and Xun Wang},
journal={In Proceedings of the 30th ACM international conference on Multimedia},
year={2022}
}