Awesome

MIL-NCE End-to-End HowTo100M training on GPUs with PyTorch

This repo contains an open source PyTorch distributed training code for the CVPR'20 paper: End-to-End Learning of Visual Representations from Uncurated Instructional Videos [1]. The original codebase from [1] relies on Google and DeepMind's internal tools as well as the usage of TPU v3 accelerators which makes it challenging to release as is.

Instead, this repository provides an implementation of [1] using PyTorch / ffmpeg with a reasonable number of GPUs.

The training code was run on the French public AI cluster Jean-Zay (see Acknowledgements below). It was specifically designed to be run on a SLURM based cluster management for multi-node distributed training but can be easily modify for any other cluster management system.

This open source PyTorch implementation of the paper has several minor differences such as:

The use of a cosine learning rate decay instead of a stepwise decay described in [1].
There is no sharing of the batch normalization statistics across different GPUs and nodes as it is much slower to perform such operation on GPUs than TPUs.
The use of slightly different spatio-temporal training video resolution of the input video clips.

If you only plan to reuse the pretrained S3D model from [1], instead please visit the following repo If you use this code, we would appreciate if you could both cite [1] and [2].

Requirements

Python 3
PyTorch (>= 1.0)
python-ffmpeg with ffmpeg
pandas
tqdm (for evaluation only)
scikit-learn (for linear evaluation only)
SLURM cluster management for distributed training but can be easily modified for other cluster management system

Getting preliminary word2vec and HowTo100M data for training

You will first need to download the word2vec matrix and dictionary and unzip the file in the same directory as the code, in the data folder.

wget https://www.rocq.inria.fr/cluster-willow/amiech/word2vec.zip
unzip word2vec.zip

Then you will need to download the preprocessed HowTo100M captions and unzip the csv files somewhere.

To download the csv files:

wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/howto100m_captions.zip

Finally the preprocessed HowTo100M videos (12Tb in total) can be downloaded by filling this Google form: https://forms.gle/hztrfnFQUJWBtiki8. We advise you save the HowTo100M videos as well as the caption files on fast access disks such as SSDs disks to significantly speedup the training.

Training MIL-NCE on HowTo100M

The following command trains the S3D model on a single node, uses all of its GPU and checkpoints the model in the directory checkpoint/milnce, the log are written in the log directory. Do not forget to replace path_to_howto_csv by the path to the HowTo100M csv caption files and path_to_howto_videos to the path where the HowTo100M videos were downloaded.

python main_distributed.py --n_display=1 \
       --multiprocessing-distributed --batch_size=256 \
       --num_thread_reader=40 --cudnn_benchmark=1 --pin_memory \
       --checkpoint_dir=milnce --num_candidates=4 --resume --lr=0.001 \
       --warmup_steps=10000 --epochs=300 --caption_root=path_to_howto_csv --video_path=path_to_howto_videos

You can also monitor the evaluation on the zero-shot YouCook2 retrieval task by specifying the argument --evaluate as well as --eval_video_root=path_to_youcook2_video

Note 1: The batch size value set here is the total batch size for the node, so if batch size is 256 and there are 4 GPUs, the batch size for each GPU will be 64.
Note 2: An epoch here is equivalent of processing 1238911 video-text training samples, which is the number of different videos in HowTo100M. It is not the same as the number of different training video clips as there are more than 100M clips.
Note 3: The training code should be distributed over multiple tasks with SLURM for distributed training.

Linear evaluation of representation on HMDB-51 action recognition dataset

Download the videos

Please download the original HMDB videos at: https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#Downloads

Evaluation

This evaluation will extract video features on HMDB using a model checkpoint path (to replace with the_path_to_the_checkpoint) and train a linear SVM using scikit-learn on the features.

To run the evaluation:

python eval_hmdb.py --batch_size=16  --num_thread_reader=20 --num_windows_test=10 \
        --eval_video_root=path_to_the_videos --pretrain_cnn_path=the_path_to_the_checkpoint

You will need to replace path_to_the_videos by the root folder containing the downloaded HMDB videos.

This table compares the results of the linear evaluation of the representation on HMDB-51 with the original implementation and this one under various number of training epoch and training batch size.

<table><tbody>   <th valign="bottom">Implementation</th> <th valign="bottom">Epochs</th> <th valign="bottom">Total batch size</th> <th valign="bottom">Accelerator</th> <th valign="bottom">CPU cores</th> <th valign="bottom">Training input size</th> <th valign="bottom">Top-1 accuracy</th>  <tr><td align="left">TPU Tensorflow [1]</td> <td align="center">2645</td> <td align="center">8192</td> <td align="center">64 x Cloud TPU v3 128Gb</td> <td align="center">64 x N.A.</td> <td align="center">32 frames at 200x200</td> <td align="center">53.1</td> </tr> <tr><td align="left">TPU Tensorflow [1]</td> <td align="center">206</td> <td align="center">512</td> <td align="center">4 x Cloud TPU v3 128Gb</td> <td align="center">4 x N.A.</td> <td align="center">16 frames at 200x200</td> <td align="center">54.2</td> </tr> <tr><td align="left">This implementation</td> <td align="center">150</td> <td align="center">512</td> <td align="center">2 x 4 Tesla V100 32Gb</td> <td align="center">2 x 40</td> <td align="center">16 frames at 224x224</td> <td align="center">53.5</td> </tr> <tr><td align="left">This implementation</td> <td align="center">300</td> <td align="center">1024</td> <td align="center">4 x 4 Tesla V100 32Gb</td> <td align="center">4 x 40</td> <td align="center">16 frames at 224x224</td> <td align="center">54.3</td> </tr> </tbody></table>

Zero-Shot evaluation retrieval on MSR-VTT and YouCook2

Download the MSR-VTT videos

A mirror link of the MSR-VTT testing videos can be found at: https://www.mediafire.com/folder/h14iarbs62e7p/shared

Download the YouCook2 videos

A mirror link of our downloaded YouCook2 testing videos can be found at: https://www.rocq.inria.fr/cluster-willow/amiech/Youcook2_val.zip

Evaluation on MSR-VTT

This evaluation will run the zero-shot text-video retrieval on the MSR-VTT subset of the test set used in [1]. You will need to replace the_path_to_the_checkpoint by your model checkpoint path and path_to_the_msrvtt_videos to the root folder containing the downloaded MSR-VTT testing videos.

python eval_msrvtt.py --batch_size=16  --num_thread_reader=20 --num_windows_test=10 \
       --eval_video_root=path_to_the_msrvtt_videos --pretrain_cnn_path=the_path_to_the_checkpoint

This table compares the retrieval results with the original implementation and this one under various number of training epoch and training batch size.

<table><tbody>   <th valign="bottom">Implementation</th> <th valign="bottom">Epochs</th> <th valign="bottom">Total batch size</th> <th valign="bottom">Accelerator</th> <th valign="bottom">CPU cores</th> <th valign="bottom">Training input size</th> <th valign="bottom">R@1</th> <th valign="bottom">R@5</th> <th valign="bottom">R@10</th> <th valign="bottom">Median Rank</th>  <tr><td align="left">TPU Tensorflow [1]</td> <td align="center">2645</td> <td align="center">8192</td> <td align="center">64 x Cloud TPU v3 128Gb</td> <td align="center">64 x N.A.</td> <td align="center">32 frames at 200x200</td> <td align="center">9.9</td> <td align="center">24.0</td> <td align="center">32.4</td> <td align="center">29.5</td> </tr> <tr><td align="left">TPU Tensorflow [1]</td> <td align="center">206</td> <td align="center">512</td> <td align="center">4 x Cloud TPU v3 128Gb</td> <td align="center">4 x N.A.</td> <td align="center">16 frames at 200x200</td> <td align="center">8.6</td> <td align="center">21.5</td> <td align="center">28.6</td> <td align="center">36</td> </tr> <tr><td align="left">This implementation</td> <td align="center">150</td> <td align="center">512</td> <td align="center">2 x 4 Tesla V100 32Gb</td> <td align="center">2 x 40</td> <td align="center">16 frames at 224x224</td> <td align="center">7.7</td> <td align="center">19.0</td> <td align="center">27.3</td> <td align="center">37</td> </tr> <tr><td align="left">This implementation</td> <td align="center">300</td> <td align="center">1024</td> <td align="center">4 x 4 Tesla V100 32Gb</td> <td align="center">4 x 40</td> <td align="center">16 frames at 224x224</td> <td align="center">8.2</td> <td align="center">21.5</td> <td align="center">29.5</td> <td align="center">40</td> </tr> </tbody></table>

Evaluation on YouCook2

This evaluation will run the zero-shot text-video retrieval on the validation YouCook2 videos. Please replace the_path_to_the_checkpoint by your model checkpoint path and path_to_the_youcook_videos to the root folder containing the downloaded YouCook2 testing videos.

python eval_youcook.py --batch_size=16  --num_thread_reader=20 --num_windows_test=10 \
        --eval_video_root=path_to_the_youcook_videos --pretrain_cnn_path=the_path_to_the_checkpoint

This table compares the retrieval results with the original implementation and this one under various number of training epoch and training batch size. Note that as opposed to MSR-VTT and HMDB-51, we were not able to download same amount of YouCook2 cooking videos and thus the evaluation between the DeepMind implementation and this PyTorch implementation are evaluated on a slightly different number of validation videos (The DeepMind evaluation has slightly less validation videos).

<table><tbody>   <th valign="bottom">Implementation</th> <th valign="bottom">Epochs</th> <th valign="bottom">Total batch size</th> <th valign="bottom">Accelerator</th> <th valign="bottom">CPU cores</th> <th valign="bottom">Training input size</th> <th valign="bottom">R@1</th> <th valign="bottom">R@5</th> <th valign="bottom">R@10</th> <th valign="bottom">Median Rank</th>  <tr><td align="left">TPU Tensorflow [1]</td> <td align="center">2645</td> <td align="center">8192</td> <td align="center">64 x Cloud TPU v3 128Gb</td> <td align="center">64 x N.A.</td> <td align="center">32 frames at 200x200</td> <td align="center">15.1</td> <td align="center">38.0</td> <td align="center">51.2</td> <td align="center">10</td> </tr> <tr><td align="left">TPU Tensorflow [1]</td> <td align="center">206</td> <td align="center">512</td> <td align="center">4 x Cloud TPU v3 128Gb</td> <td align="center">4 x N.A.</td> <td align="center">16 frames at 200x200</td> <td align="center">8.2</td> <td align="center">24.6</td> <td align="center">36.2</td> <td align="center">23</td> </tr> <tr><td align="left">This implementation</td> <td align="center">150</td> <td align="center">512</td> <td align="center">2 x 4 Tesla V100 32Gb</td> <td align="center">2 x 40</td> <td align="center">16 frames at 224x224</td> <td align="center">7.4</td> <td align="center">21.5</td> <td align="center">32.2</td> <td align="center">29</td> </tr> <tr><td align="left">This implementation</td> <td align="center">300</td> <td align="center">1024</td> <td align="center">4 x 4 Tesla V100 32Gb</td> <td align="center">4 x 40</td> <td align="center">16 frames at 224x224</td> <td align="center">8.8</td> <td align="center">24.3</td> <td align="center">34.6</td> <td align="center">23</td> </tr> </tbody></table>

References

[1] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic and A. Zisserman, End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Presented at CVPR 2020

[2] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev and J. Sivic, HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Presented at ICCV 2019

Bibtex:

@inproceedings{miech19howto100m,
   title={How{T}o100{M}: {L}earning a {T}ext-{V}ideo {E}mbedding by {W}atching {H}undred {M}illion {N}arrated {V}ideo {C}lips},
   author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef},
   booktitle={ICCV},
   year={2019},
}

@inproceedings{miech19endtoend,
   title={{E}nd-to-{E}nd {L}earning of {V}isual {R}epresentations from {U}ncurated {I}nstructional {V}ideos},
   author={Miech, Antoine and Alayrac, Jean-Baptiste and Smaira, Lucas and Laptev, Ivan and Sivic, Josef and Zisserman, Andrew},
   booktitle={CVPR},
   year={2020},
}

Acknowledgements

This work was granted access to the HPC resources of IDRIS under the allocation 2020-AD011011325 made by GENCI.

The SLURM distributed training code was also mainly inspired from this great repo.