Home

Awesome

This repo provides code:

Requirements: The code assumes PyTorch 1.4 and Python 3.7 (other versions may work, but have not been tested). See the section on dependencies towards the end of this file for specific package requirements.

TeachText

TeachText diagram

TeachText results on MSRVTT Benchmark

ModelSplitTaskR@1R@5R@10R@50MdRMnRGeomLinks
CEFullt2v<sub><sup>11.0<sub>(0.0)</sub></sup></sub><sub><sup>30.8<sub>(0.1)</sub></sup></sub><sub><sup>43.3<sub>(0.3)</sub></sup></sub><sub><sup>73.1<sub>(0.2)</sub></sup></sub><sub><sup>15.0<sub>(0.0)</sub></sup></sub><sub><sup>81.8<sub>(0.2)</sub></sup></sub><sub><sup>24.4<sub>(0.1)</sub></sup></sub>config_TT, model_TT, log_TT
CE+Fullt2v<sub><sup>13.8<sub>(0.1)</sub></sup></sub><sub><sup>36.5<sub>(0.2)</sub></sup></sub><sub><sup>49.4<sub>(0.4)</sub></sup></sub><sub><sup>77.6<sub>(0.2)</sub></sup></sub><sub><sup>11.0<sub>(0.0)</sub></sup></sub><sub><sup>69.4<sub>(0.8)</sub></sup></sub><sub><sup>29.2<sub>(0.2)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CEFullt2v<sub><sup>11.8<sub>(0.1)</sub></sup></sub><sub><sup>32.7<sub>(0.2)</sub></sup></sub><sub><sup>45.3<sub>(0.2)</sub></sup></sub><sub><sup>74.9<sub>(0.1)</sub></sup></sub><sub><sup>13.0<sub>(0.0)</sub></sup></sub><sub><sup>74.9<sub>(0.4)</sub></sup></sub><sub><sup>25.9<sub>(0.1)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CE+Fullt2v<sub><sup>14.6<sub>(0.0)</sub></sup></sub><sub><sup>37.9<sub>(0.1)</sub></sup></sub><sub><sup>50.9<sub>(0.2)</sub></sup></sub><sub><sup>78.9<sub>(0.0)</sub></sup></sub><sub><sup>10.0<sub>(0.0)</sub></sup></sub><sub><sup>63.1<sub>(0.2)</sub></sup></sub><sub><sup>30.4<sub>(0.0)</sub></sup></sub>config_TT, model_TT, log_TT

Please note that the numbers are higher than in the original CE due to compression artefacts correction

Denoising results on MSRVTT

ModelSplitTaskR@1R@5R@10R@50MdRMnRGeomLinks
CE+Fullt2v<sub><sup>14.4<sub>(0.1)</sub></sup></sub><sub><sup>37.4<sub>(0.2)</sub></sup></sub><sub><sup>50.2<sub>(0.1)</sub></sup></sub><sub><sup>77.9<sub>(0.1)</sub></sup></sub><sub><sup>10.0<sub>(0.0)</sub></sup></sub><sub><sup>70.8<sub>(0.1)</sub></sup></sub><sub><sup>30.0<sub>(0.1)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CE+Fullt2v<sub><sup>14.9<sub>(0.1)</sub></sup></sub><sub><sup>38.3<sub>(0.1)</sub></sup></sub><sub><sup>51.5<sub>(0.1)</sub></sup></sub><sub><sup>79.2<sub>(0.1)</sub></sup></sub><sub><sup>10.0<sub>(0.0)</sub></sup></sub><sub><sup>62.5<sub>(0.5)</sub></sup></sub><sub><sup>30.9<sub>(0.1)</sub></sup></sub>config_TT, model_TT, log_TT

TeachText results on MSVD Benchmark

ModelSplitTaskR@1R@5R@10R@50MdRMnRGeomLinks
CEFullt2v<sub><sup>21.5<sub>(0.6)</sub></sup></sub><sub><sup>52.3<sub>(0.9)</sub></sup></sub><sub><sup>67.5<sub>(0.8)</sub></sup></sub><sub><sup>90.7<sub>(0.0)</sub></sup></sub><sub><sup>5.0<sub>(0.0)</sub></sup></sub><sub><sup>20.4<sub>(0.0)</sub></sup></sub><sub><sup>42.3<sub>(0.6)</sub></sup></sub>config_TT, model_TT, log_TT
CE+Fullt2v<sub><sup>25.1<sub>(0.9)</sub></sup></sub><sub><sup>56.5<sub>(1.4)</sub></sup></sub><sub><sup>70.9<sub>(1.6)</sub></sup></sub><sub><sup>92.4<sub>(0.5)</sub></sup></sub><sub><sup>4.0<sub>(0.0)</sub></sup></sub><sub><sup>17.8<sub>(0.6)</sub></sup></sub><sub><sup>46.5<sub>(1.0)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CEFullt2v<sub><sup>22.1<sub>(0.5)</sub></sup></sub><sub><sup>52.2<sub>(0.6)</sub></sup></sub><sub><sup>67.2<sub>(0.8)</sub></sup></sub><sub><sup>91.2<sub>(0.5)</sub></sup></sub><sub><sup>5.0<sub>(0.0)</sub></sup></sub><sub><sup>19.6<sub>(0.5)</sub></sup></sub><sub><sup>42.6<sub>(0.4)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CE+Fullt2v<sub><sup>25.1<sub>(0.6)</sub></sup></sub><sub><sup>56.8<sub>(0.6)</sub></sup></sub><sub><sup>71.2<sub>(0.6)</sub></sup></sub><sub><sup>92.7<sub>(0.3)</sub></sup></sub><sub><sup>4.0<sub>(0.0)</sub></sup></sub><sub><sup>16.8<sub>(0.3)</sub></sup></sub><sub><sup>46.6<sub>(0.5)</sub></sup></sub>config_TT, model_TT, log_TT

Denoising results on MSVD

ModelSplitTaskR@1R@5R@10R@50MdRMnRGeomLinks
CE+Fullt2v<sub><sup>26.2<sub>(0.5)</sub></sup></sub><sub><sup>57.7<sub>(1.0)</sub></sup></sub><sub><sup>72.2<sub>(1.2)</sub></sup></sub><sub><sup>92.2<sub>(0.4)</sub></sup></sub><sub><sup>4.0<sub>(0.0)</sub></sup></sub><sub><sup>17.9<sub>(0.5)</sub></sup></sub><sub><sup>47.8<sub>(0.6)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CE+Fullt2v<sub><sup>25.4<sub>(0.4)</sub></sup></sub><sub><sup>56.9<sub>(0.5)</sub></sup></sub><sub><sup>71.3<sub>(0.3)</sub></sup></sub><sub><sup>92.8<sub>(0.2)</sub></sup></sub><sub><sup>4.0<sub>(0.0)</sub></sup></sub><sub><sup>16.7<sub>(0.2)</sub></sup></sub><sub><sup>46.9<sub>(0.3)</sub></sup></sub>config_TT, model_TT, log_TT

TeachText results on DiDeMo Benchmark

ModelSplitTaskR@1R@5R@10R@50MdRMnRGeomLinks
CEFullt2v<sub><sup>17.1<sub>(0.9)</sub></sup></sub><sub><sup>41.9<sub>(0.2)</sub></sup></sub><sub><sup>56.0<sub>(0.5)</sub></sup></sub><sub><sup>83.4<sub>(0.9)</sub></sup></sub><sub><sup>8.0<sub>(0.0)</sub></sup></sub><sub><sup>42.8<sub>(2.8)</sub></sup></sub><sub><sup>34.2<sub>(0.4)</sub></sup></sub>config_TT, model_TT, log_TT
CE+Fullt2v<sub><sup>18.2<sub>(0.3)</sub></sup></sub><sub><sup>43.9<sub>(1.1)</sub></sup></sub><sub><sup>57.1<sub>(0.9)</sub></sup></sub><sub><sup>84.0<sub>(1.6)</sub></sup></sub><sub><sup>7.9<sub>(0.1)</sub></sup></sub><sub><sup>38.5<sub>(3.4)</sub></sup></sub><sub><sup>35.8<sub>(0.4)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CEFullt2v<sub><sup>21.0<sub>(0.7)</sub></sup></sub><sub><sup>47.5<sub>(1.1)</sub></sup></sub><sub><sup>61.9<sub>(0.6)</sub></sup></sub><sub><sup>86.4<sub>(1.0)</sub></sup></sub><sub><sup>6.0<sub>(0.0)</sub></sup></sub><sub><sup>35.1<sub>(1.0)</sub></sup></sub><sub><sup>39.5<sub>(0.5)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CE+Fullt2v<sub><sup>21.6<sub>(0.8)</sub></sup></sub><sub><sup>48.6<sub>(0.5)</sub></sup></sub><sub><sup>62.9<sub>(0.7)</sub></sup></sub><sub><sup>86.8<sub>(0.3)</sub></sup></sub><sub><sup>6.0<sub>(0.0)</sub></sup></sub><sub><sup>31.5<sub>(0.8)</sub></sup></sub><sub><sup>40.4<sub>(0.4)</sub></sup></sub>config_TT, model_TT, log_TT

TeachText results on LSMDC Benchmark

ModelSplitTaskR@1R@5R@10R@50MdRMnRGeomLinks
CEFullt2v<sub><sup>12.4<sub>(0.7)</sub></sup></sub><sub><sup>28.5<sub>(0.8)</sub></sup></sub><sub><sup>37.9<sub>(0.6)</sub></sup></sub><sub><sup>64.5<sub>(0.8)</sub></sup></sub><sub><sup>21.7<sub>(0.6)</sub></sup></sub><sub><sup>88.0<sub>(4.8)</sub></sup></sub><sub><sup>23.7<sub>(0.3)</sub></sup></sub>config_TT, model_TT, log_TT
CE+Fullt2v<sub><sup>14.9<sub>(0.7)</sub></sup></sub><sub><sup>33.7<sub>(0.2)</sub></sup></sub><sub><sup>44.1<sub>(0.7)</sub></sup></sub><sub><sup>67.3<sub>(0.8)</sub></sup></sub><sub><sup>15.3<sub>(0.6)</sub></sup></sub><sub><sup>77.8<sub>(6.7)</sub></sup></sub><sub><sup>28.1<sub>(0.3)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CEFullt2v<sub><sup>13.7<sub>(0.9)</sub></sup></sub><sub><sup>30.2<sub>(0.4)</sub></sup></sub><sub><sup>40.1<sub>(0.4)</sub></sup></sub><sub><sup>66.0<sub>(0.6)</sub></sup></sub><sub><sup>19.8<sub>(1.3)</sub></sup></sub><sub><sup>84.0<sub>(1.8)</sub></sup></sub><sub><sup>25.5<sub>(0.5)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CE+Fullt2v<sub><sup>17.2<sub>(0.5)</sub></sup></sub><sub><sup>36.5<sub>(0.7)</sub></sup></sub><sub><sup>46.3<sub>(0.4)</sub></sup></sub><sub><sup>68.8<sub>(0.4)</sub></sup></sub><sub><sup>13.7<sub>(0.6)</sub></sup></sub><sub><sup>72.3<sub>(0.1)</sub></sup></sub><sub><sup>30.7<sub>(0.3)</sub></sup></sub>config_TT, model_TT, log_TT

TeachText results on Activity-Net Benchmark

ModelSplitTaskR@1R@5R@10R@50MdRMnRGeomLinks
CEFullt2v<sub><sup>19.9<sub>(0.4)</sub></sup></sub><sub><sup>50.1<sub>(0.8)</sub></sup></sub><sub><sup>66.1<sub>(0.6)</sub></sup></sub><sub><sup>92.2<sub>(0.7)</sub></sup></sub><sub><sup>5.3<sub>(0.6)</sub></sup></sub><sub><sup>21.3<sub>(1.1)</sub></sup></sub><sub><sup>40.4<sub>(0.3)</sub></sup></sub>config_TT, model_TT, log_TT
CE+Fullt2v<sub><sup>19.4<sub>(0.2)</sub></sup></sub><sub><sup>49.3<sub>(0.5)</sub></sup></sub><sub><sup>65.4<sub>(0.4)</sub></sup></sub><sub><sup>92.1<sub>(0.2)</sub></sup></sub><sub><sup>6.0<sub>(0.0)</sub></sup></sub><sub><sup>22.5<sub>(0.4)</sub></sup></sub><sub><sup>39.7<sub>(0.0)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CEFullt2v<sub><sup>22.7<sub>(0.8)</sub></sup></sub><sub><sup>56.2<sub>(0.1)</sub></sup></sub><sub><sup>71.6<sub>(0.8)</sub></sup></sub><sub><sup>95.3<sub>(0.1)</sub></sup></sub><sub><sup>4.0<sub>(0.0)</sub></sup></sub><sub><sup>15.8<sub>(0.1)</sub></sup></sub><sub><sup>45.0<sub>(0.6)</sub></sup></sub>config_TT, model_TT, log_TT
TeachText - CE+Fullt2v<sub><sup>23.5<sub>(0.2)</sub></sup></sub><sub><sup>57.2<sub>(0.6)</sub></sup></sub><sub><sup>73.6<sub>(0.2)</sub></sup></sub><sub><sup>96.1<sub>(0.1)</sub></sup></sub><sub><sup>4.0<sub>(0.0)</sub></sup></sub><sub><sup>13.7<sub>(0.1)</sub></sup></sub><sub><sup>46.3<sub>(0.2)</sub></sup></sub>config_TT, model_TT, log_TT

You can download the high quality features used for TeachText from:

For MSRVTT:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-MSRVTT-experts.tar.gz
sha1sum: 734650c3b98509996da75cdedc12101836624917

For MSVD:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-MSVD-experts.tar.gz
sha1sum: c8eba8c5291dd6bb501757ed0cc327cd22217965

For DiDeMo:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-DiDeMo-experts.tar.gz
sha1sum: 8e128309f12cf3260fe538f82578b5ad91a46bd0

For ActivityNet:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-activity-net-experts.tar.gz
sha1sum: 2f3c7c2fe86bd6d0c6230464a940c429291a4012

Collaborative Experts

CE diagram

High-level Overview: The Collaborative Experts framework aims to achieve robustness through two mechanisms:

  1. The use of information from a wide range of modalities, including those that are typically always available in video (such as RGB) as well as more "specific" clues which may only occasionally be present (such as overlaid text).
  2. A module that aims to combine these modalities into a fixed size representation that in a manner that is robust to noise.

Requirements: The code assumes PyTorch 1.4 and Python 3.7 (other versions may work, but have not been tested). See the section on dependencies towards the end of this file for specific package requirements.

Important: A note on the updated results: A previous version of the codebase (and paper) reported results on the retrieval benchmarks that included a signficant software bug leading to an overestimate of performance. We are extremely grateful to Valentin Gabeur who discovered this bug (it has been corrected in the current codebase).

CVPR 2020: Pentathlon challenge

<p align="center"> <img width="300" alt="logo" src="figs/logo-centre.png"> </p>

We are hosting a video retrieval challenge as part of the Video Pentathlon Workshop. Find out how to participate here!

Pretrained video embeddings

We provide pretrained models for each dataset to reproduce the results reported in the paper [1] (references follow at the end of this README). Each model is accompanied by training and evaluation logs. Performance is evalauted for retrieval in both directions (joint-embeddings can be used for either of these two tasks):

In the results reported below, the same model is used for both the t2v and v2t evaluations. Each metric is reported as the mean and standard deviation (in parentheses) across three training runs.

MSRVTT Benchmark

ModelSplitTaskR@1R@5R@10R@50MdRMnRLinks
CEFullt2v<sub><sup>10.0<sub>(0.1)</sub></sup></sub><sub><sup>29.0<sub>(0.3)</sub></sup></sub><sub><sup>41.2<sub>(0.2)</sub></sup></sub><sub><sup>71.4<sub>(0.1)</sub></sup></sub><sub><sup>16.0<sub>(0.0)</sub></sup></sub><sub><sup>86.8<sub>(0.3)</sub></sup></sub>config, model, log
CE1k-At2v<sub><sup>20.9<sub>(1.2)</sub></sup></sub><sub><sup>48.8<sub>(0.6)</sub></sup></sub><sub><sup>62.4<sub>(0.8)</sub></sup></sub><sub><sup>89.1<sub>(0.4)</sub></sup></sub><sub><sup>6.0<sub>(0.0)</sub></sup></sub><sub><sup>28.2<sub>(0.8)</sub></sup></sub>config, model, log
CE1k-Bt2v<sub><sup>18.2<sub>(0.7)</sub></sup></sub><sub><sup>46.0<sub>(0.4)</sub></sup></sub><sub><sup>60.7<sub>(0.2)</sub></sup></sub><sub><sup>86.6<sub>(0.5)</sub></sup></sub><sub><sup>7.0<sub>(0.0)</sub></sup></sub><sub><sup>35.3<sub>(1.1)</sub></sup></sub>config, model, log
MoEE*1k-Bt2v<sub><sup>15.0<sub>(0.7)</sub></sup></sub><sub><sup>39.7<sub>(1.0)</sub></sup></sub><sub><sup>54.5<sub>(1.1)</sub></sup></sub><sub><sup>82.7<sub>(0.6)</sub></sup></sub><sub><sup>8.3<sub>(0.6)</sub></sup></sub><sub><sup>43.7<sub>(0.7)</sub></sup></sub>config, model, log
CEFullv2t<sub><sup>15.6<sub>(0.3)</sub></sup></sub><sub><sup>40.9<sub>(1.4)</sub></sup></sub><sub><sup>55.2<sub>(1.0)</sub></sup></sub><sub><sup>84.0<sub>(0.1)</sub></sup></sub><sub><sup>8.3<sub>(0.6)</sub></sup></sub><sub><sup>38.1<sub>(1.8)</sub></sup></sub>config, model, log
CE1k-Av2t<sub><sup>20.6<sub>(0.6)</sub></sup></sub><sub><sup>50.3<sub>(0.5)</sub></sup></sub><sub><sup>64.0<sub>(0.2)</sub></sup></sub><sub><sup>89.9<sub>(0.3)</sub></sup></sub><sub><sup>5.3<sub>(0.6)</sub></sup></sub><sub><sup>25.1<sub>(0.8)</sub></sup></sub>config, model, log
CE1k-Bv2t<sub><sup>18.0<sub>(0.8)</sub></sup></sub><sub><sup>46.0<sub>(0.5)</sub></sup></sub><sub><sup>60.3<sub>(0.5)</sub></sup></sub><sub><sup>86.4<sub>(0.3)</sub></sup></sub><sub><sup>6.5<sub>(0.5)</sub></sup></sub><sub><sup>30.6<sub>(1.2)</sub></sup></sub>config, model, log
MoEE*1k-Bv2t<sub><sup>14.5<sub>(0.8)</sub></sup></sub><sub><sup>40.4<sub>(0.8)</sub></sup></sub><sub><sup>54.9<sub>(1.0)</sub></sup></sub><sub><sup>83.8<sub>(0.5)</sub></sup></sub><sub><sup>8.8<sub>(0.4)</sub></sup></sub><sub><sup>38.7<sub>(0.9)</sub></sup></sub>config, model, log

Models marked with * use the features made available with the MoEE model of [2] (without OCR, speech and scene features), unstarred models on the 1k-B and Full splits make use of OCR, speech and scene features, as well slightly stronger text encodings (GPT, rather than word2vec - see [1] for details). The MoEE model is implemented as a sanity check that our codebase approximately reproduces [2] (the MoEE paper).

See the MSRVTT README for links to the train/val/test lists of each split.

MSVD Benchmark

ModelTaskR@1R@5R@10R@50MdRMnRLinks
CEt2v<sub><sup>19.8<sub>(0.3)</sub></sup></sub><sub><sup>49.0<sub>(0.3)</sub></sup></sub><sub><sup>63.8<sub>(0.1)</sub></sup></sub><sub><sup>89.0<sub>(0.2)</sub></sup></sub><sub><sup>6.0<sub>(0.0)</sub></sup></sub><sub><sup>23.1<sub>(0.3)</sub></sup></sub>config, model, log
CEv2t<sub><sup>23.9<sub>(1.4)</sub></sup></sub><sub><sup>50.2<sub>(0.8)</sub></sup></sub><sub><sup>59.6<sub>(1.2)</sub></sup></sub><sub><sup>82.3<sub>(0.7)</sub></sup></sub><sub><sup>5.6<sub>(0.5)</sub></sup></sub><sub><sup>41.2<sub>(3.4)</sub></sup></sub>config, model, log

See the MSVD README for descriptions of the train/test splits. Note that the videos in the MSVD dataset do not have soundtracks.

DiDeMo Benchmark

ModelTaskR@1R@5R@10R@50MdRMnRLinks
CEt2v<sub><sup>16.1<sub>(1.4)</sub></sup></sub><sub><sup>41.1<sub>(0.4)</sub></sup></sub><sub><sup>54.4<sub>(0.8)</sub></sup></sub><sub><sup>82.7<sub>(0.3)</sub></sup></sub><sub><sup>8.3<sub>(0.6)</sub></sup></sub><sub><sup>43.7<sub>(3.6)</sub></sup></sub>config, model, log
CEv2t<sub><sup>15.6<sub>(1.3)</sub></sup></sub><sub><sup>40.9<sub>(0.4)</sub></sup></sub><sub><sup>55.2<sub>(0.5)</sub></sup></sub><sub><sup>82.2<sub>(1.3)</sub></sup></sub><sub><sup>8.2<sub>(0.3)</sub></sup></sub><sub><sup>42.4<sub>(3.3)</sub></sup></sub>config, model, log

See the DiDeMo README for descriptions of the train/val/test splits.

ActivityNet Benchmark

ModelTaskR@1R@5R@10R@50MdRMnRLinks
CEt2v<sub><sup>18.2<sub>(0.3)</sub></sup></sub><sub><sup>47.7<sub>(0.6)</sub></sup></sub><sub><sup>63.9<sub>(0.5)</sub></sup></sub><sub><sup>91.4<sub>(0.4)</sub></sup></sub><sub><sup>6.0<sub>(0.0)</sub></sup></sub><sub><sup>23.1<sub>(0.5)</sub></sup></sub>config, model, log
CEv2t<sub><sup>17.7<sub>(0.6)</sub></sup></sub><sub><sup>46.6<sub>(0.7)</sub></sup></sub><sub><sup>62.8<sub>(0.4)</sub></sup></sub><sub><sup>90.9<sub>(0.2)</sub></sup></sub><sub><sup>6.0<sub>(0.0)</sub></sup></sub><sub><sup>24.4<sub>(0.5)</sub></sup></sub>config, model, log

See the ActivityNet README for descriptions of the train/test splits.

LSMDC Benchmark

ModelTaskR@1R@5R@10R@50MdRMnRLinks
CEt2v<sub><sup>11.2<sub>(0.4)</sub></sup></sub><sub><sup>26.9<sub>(1.1)</sub></sup></sub><sub><sup>34.8<sub>(2.0)</sub></sup></sub><sub><sup>62.1<sub>(1.5)</sub></sup></sub><sub><sup>25.3<sub>(3.1)</sub></sup></sub><sub><sup>96.8<sub>(5.0)</sub></sup></sub>config, model, log
CEv2t<sub><sup>11.7<sub>(0.5)</sub></sup></sub><sub><sup>25.8<sub>(1.5)</sub></sup></sub><sub><sup>34.4<sub>(1.7)</sub></sup></sub><sub><sup>61.4<sub>(0.7)</sub></sup></sub><sub><sup>28.0<sub>(2.6)</sub></sup></sub><sub><sup>97.6<sub>(2.8)</sub></sup></sub>config, model, log

See the LSMDC README for descriptions of the train/test splits. Please note that to obtain the features and descriptions for this dataset, you must obtain permission from MPII to use the data (this is process is described here. Once you have done so, please request that a member of the LSMDC team contacts us to confirm approval (via albanie at robots dot ox dot ac dot uk) - we can then provide you with a link to the features.

Ablation studies on MSRVTT

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the MSRVTT dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

ModelTaskR@1R@5R@10MdRParamsLinks
Concatt2v<sub><sup>0.0<sub>(0.0)</sub></sup></sub><sub><sup>0.0<sub>(0.0)</sub></sup></sub><sub><sup>0.0<sub>(0.0)</sub></sup></sub><sub><sup>1495.5<sub>(0.0)</sub></sup></sub>369.72kconfig, model, log
CE - MW,P,CGt2v<sub><sup>8.5<sub>(0.1)</sub></sup></sub><sub><sup>25.9<sub>(0.3)</sub></sup></sub><sub><sup>37.6<sub>(0.2)</sub></sup></sub><sub><sup>19.0<sub>(0.0)</sub></sup></sub>246.22Mconfig, model, log
CE - P,CGt2v<sub><sup>9.6<sub>(0.1)</sub></sup></sub><sub><sup>28.0<sub>(0.2)</sub></sup></sub><sub><sup>39.7<sub>(0.2)</sub></sup></sub><sub><sup>17.7<sub>(0.6)</sub></sup></sub>400.41Mconfig, model, log
CE - CGt2v<sub><sup>9.7<sub>(0.1)</sub></sup></sub><sub><sup>28.1<sub>(0.2)</sub></sup></sub><sub><sup>40.2<sub>(0.1)</sub></sup></sub><sub><sup>17.0<sub>(0.0)</sub></sup></sub>181.07Mconfig, model, log
CEt2v<sub><sup>10.0<sub>(0.1)</sub></sup></sub><sub><sup>29.0<sub>(0.3)</sub></sup></sub><sub><sup>41.2<sub>(0.2)</sub></sup></sub><sub><sup>16.0<sub>(0.0)</sub></sup></sub>183.45Mconfig, model, log
Concatv2t<sub><sup>0.0<sub>(0.0)</sub></sup></sub><sub><sup>0.0<sub>(0.0)</sub></sup></sub><sub><sup>0.0<sub>(0.0)</sub></sup></sub><sub><sup>29897.5<sub>(0.0)</sub></sup></sub>369.72kconfig, model, log
CE - MW,P,CGv2t<sub><sup>13.7<sub>(0.4)</sub></sup></sub><sub><sup>38.8<sub>(1.2)</sub></sup></sub><sub><sup>53.1<sub>(1.1)</sub></sup></sub><sub><sup>9.2<sub>(0.8)</sub></sup></sub>246.22Mconfig, model, log
CE - P,CGv2t<sub><sup>14.1<sub>(0.2)</sub></sup></sub><sub><sup>39.5<sub>(1.0)</sub></sup></sub><sub><sup>53.2<sub>(0.3)</sub></sup></sub><sub><sup>9.0<sub>(0.0)</sub></sup></sub>400.41Mconfig, model, log
CE - CGv2t<sub><sup>15.1<sub>(0.3)</sub></sup></sub><sub><sup>40.3<sub>(0.5)</sub></sup></sub><sub><sup>54.3<sub>(0.7)</sub></sup></sub><sub><sup>8.8<sub>(0.3)</sub></sup></sub>181.07Mconfig, model, log
CEv2t<sub><sup>15.6<sub>(0.3)</sub></sup></sub><sub><sup>40.9<sub>(1.4)</sub></sup></sub><sub><sup>55.2<sub>(1.0)</sub></sup></sub><sub><sup>8.3<sub>(0.6)</sub></sup></sub>183.45Mconfig, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

ExpertsTaskR@1R@5R@10MdRParamsLinks
Scenet2v<sub><sup>4.0<sub>(0.1)</sub></sup></sub><sub><sup>14.1<sub>(0.1)</sub></sup></sub><sub><sup>22.4<sub>(0.3)</sub></sup></sub><sub><sup>50.0<sub>(1.0)</sub></sup></sub>19.46Mconfig, model, log
Scene + Inst.t2v<sub><sup>7.2<sub>(0.1)</sub></sup></sub><sub><sup>22.3<sub>(0.3)</sub></sup></sub><sub><sup>33.0<sub>(0.2)</sub></sup></sub><sub><sup>25.3<sub>(0.6)</sub></sup></sub>41.12Mconfig, model, log
Scene + r2p1dt2v<sub><sup>6.8<sub>(0.1)</sub></sup></sub><sub><sup>21.7<sub>(0.1)</sub></sup></sub><sub><sup>32.4<sub>(0.1)</sub></sup></sub><sub><sup>25.7<sub>(0.6)</sub></sup></sub>39.95Mconfig, model, log
Scene + RGBt2v<sub><sup>5.0<sub>(0.2)</sub></sup></sub><sub><sup>16.6<sub>(0.7)</sub></sup></sub><sub><sup>25.5<sub>(1.0)</sub></sup></sub><sub><sup>40.7<sub>(2.1)</sub></sup></sub>41.12Mconfig, model, log
Scene + Flowt2v<sub><sup>5.3<sub>(0.3)</sub></sup></sub><sub><sup>17.6<sub>(0.8)</sub></sup></sub><sub><sup>27.1<sub>(0.9)</sub></sup></sub><sub><sup>36.0<sub>(1.7)</sub></sup></sub>40.34Mconfig, model, log
Scene + Audiot2v<sub><sup>5.6<sub>(0.0)</sub></sup></sub><sub><sup>18.7<sub>(0.1)</sub></sup></sub><sub><sup>28.2<sub>(0.1)</sub></sup></sub><sub><sup>33.7<sub>(0.6)</sub></sup></sub>40.34Mconfig, model, log
Scene + OCRt2v<sub><sup>4.1<sub>(0.1)</sub></sup></sub><sub><sup>14.1<sub>(0.1)</sub></sup></sub><sub><sup>22.2<sub>(0.2)</sub></sup></sub><sub><sup>50.3<sub>(1.2)</sub></sup></sub>49.49Mconfig, model, log
Scene + Speecht2v<sub><sup>4.6<sub>(0.1)</sub></sup></sub><sub><sup>15.5<sub>(0.2)</sub></sup></sub><sub><sup>24.4<sub>(0.2)</sub></sup></sub><sub><sup>44.7<sub>(1.2)</sub></sup></sub>43.94Mconfig, model, log
Scene + Facet2v<sub><sup>4.1<sub>(0.1)</sub></sup></sub><sub><sup>14.2<sub>(0.3)</sub></sup></sub><sub><sup>22.4<sub>(0.4)</sub></sup></sub><sub><sup>49.7<sub>(0.6)</sub></sup></sub>39.95Mconfig, model, log
Scenev2t<sub><sup>5.6<sub>(0.6)</sub></sup></sub><sub><sup>18.2<sub>(0.6)</sub></sup></sub><sub><sup>27.7<sub>(0.3)</sub></sup></sub><sub><sup>39.0<sub>(0.0)</sub></sup></sub>19.46Mconfig, model, log
Scene + Inst.v2t<sub><sup>10.1<sub>(0.3)</sub></sup></sub><sub><sup>29.7<sub>(0.5)</sub></sup></sub><sub><sup>41.9<sub>(0.7)</sub></sup></sub><sub><sup>15.2<sub>(0.9)</sub></sup></sub>41.12Mconfig, model, log
Scene + r2p1dv2t<sub><sup>9.4<sub>(0.3)</sub></sup></sub><sub><sup>27.8<sub>(0.6)</sub></sup></sub><sub><sup>40.1<sub>(1.1)</sub></sup></sub><sub><sup>17.2<sub>(1.1)</sub></sup></sub>39.95Mconfig, model, log
Scene + RGBv2t<sub><sup>6.9<sub>(0.5)</sub></sup></sub><sub><sup>21.2<sub>(0.9)</sub></sup></sub><sub><sup>31.1<sub>(1.9)</sub></sup></sub><sub><sup>28.7<sub>(3.8)</sub></sup></sub>41.12Mconfig, model, log
Scene + Flowv2t<sub><sup>7.3<sub>(0.6)</sub></sup></sub><sub><sup>22.3<sub>(1.4)</sub></sup></sub><sub><sup>33.4<sub>(1.7)</sub></sup></sub><sub><sup>25.2<sub>(2.0)</sub></sup></sub>40.34Mconfig, model, log
Scene + Audiov2t<sub><sup>8.2<sub>(0.4)</sub></sup></sub><sub><sup>24.8<sub>(0.4)</sub></sup></sub><sub><sup>36.0<sub>(0.1)</sub></sup></sub><sub><sup>21.7<sub>(0.6)</sub></sup></sub>40.34Mconfig, model, log
Scene + OCRv2t<sub><sup>5.4<sub>(0.5)</sub></sup></sub><sub><sup>18.6<sub>(1.2)</sub></sup></sub><sub><sup>26.6<sub>(1.2)</sub></sup></sub><sub><sup>40.0<sub>(1.0)</sub></sup></sub>49.49Mconfig, model, log
Scene + Speechv2t<sub><sup>6.0<sub>(0.2)</sub></sup></sub><sub><sup>20.4<sub>(0.5)</sub></sup></sub><sub><sup>30.3<sub>(1.0)</sub></sup></sub><sub><sup>33.0<sub>(2.0)</sub></sup></sub>43.94Mconfig, model, log
Scene + Facev2t<sub><sup>5.6<sub>(1.0)</sub></sup></sub><sub><sup>17.9<sub>(0.7)</sub></sup></sub><sub><sup>26.7<sub>(0.8)</sub></sup></sub><sub><sup>39.1<sub>(2.6)</sub></sup></sub>39.95Mconfig, model, log

We can also study their cumulative effect:

ExpertsTaskR@1R@5R@10MdRParamsLinks
Scenet2v<sub><sup>4.0<sub>(0.1)</sub></sup></sub><sub><sup>14.1<sub>(0.1)</sub></sup></sub><sub><sup>22.4<sub>(0.3)</sub></sup></sub><sub><sup>50.0<sub>(1.0)</sub></sup></sub>19.46Mconfig, model, log
Prev. + Speecht2v<sub><sup>4.6<sub>(0.1)</sub></sup></sub><sub><sup>15.5<sub>(0.2)</sub></sup></sub><sub><sup>24.4<sub>(0.2)</sub></sup></sub><sub><sup>44.7<sub>(1.2)</sub></sup></sub>43.94Mconfig, model, log
Prev. + Audiot2v<sub><sup>5.8<sub>(0.1)</sub></sup></sub><sub><sup>19.0<sub>(0.3)</sub></sup></sub><sub><sup>28.8<sub>(0.2)</sub></sup></sub><sub><sup>32.3<sub>(0.6)</sub></sup></sub>62.45Mconfig, model, log
Prev. + Flowt2v<sub><sup>6.7<sub>(0.2)</sub></sup></sub><sub><sup>21.8<sub>(0.4)</sub></sup></sub><sub><sup>32.5<sub>(0.5)</sub></sup></sub><sub><sup>25.3<sub>(0.6)</sub></sup></sub>80.96Mconfig, model, log
Prev. + RGBt2v<sub><sup>7.5<sub>(0.1)</sub></sup></sub><sub><sup>23.4<sub>(0.0)</sub></sup></sub><sub><sup>34.1<sub>(0.2)</sub></sup></sub><sub><sup>23.7<sub>(0.6)</sub></sup></sub>100.26Mconfig, model, log
Prev. + Instt2v<sub><sup>9.5<sub>(0.2)</sub></sup></sub><sub><sup>27.7<sub>(0.1)</sub></sup></sub><sub><sup>39.4<sub>(0.1)</sub></sup></sub><sub><sup>18.0<sub>(0.0)</sub></sup></sub>119.56Mconfig, model, log
Prev. + R2P1Dt2v<sub><sup>9.9<sub>(0.1)</sub></sup></sub><sub><sup>28.6<sub>(0.3)</sub></sup></sub><sub><sup>40.7<sub>(0.1)</sub></sup></sub><sub><sup>17.0<sub>(0.0)</sub></sup></sub>137.67Mconfig, model, log
Prev. + OCRt2v<sub><sup>10.0<sub>(0.1)</sub></sup></sub><sub><sup>28.8<sub>(0.2)</sub></sup></sub><sub><sup>40.9<sub>(0.2)</sub></sup></sub><sub><sup>16.7<sub>(0.6)</sub></sup></sub>165.33Mconfig, model, log
Prev. + Facet2v<sub><sup>10.0<sub>(0.1)</sub></sup></sub><sub><sup>29.0<sub>(0.3)</sub></sup></sub><sub><sup>41.2<sub>(0.2)</sub></sup></sub><sub><sup>16.0<sub>(0.0)</sub></sup></sub>183.45Mconfig, model, log
Scenev2t<sub><sup>5.6<sub>(0.6)</sub></sup></sub><sub><sup>18.2<sub>(0.6)</sub></sup></sub><sub><sup>27.7<sub>(0.3)</sub></sup></sub><sub><sup>39.0<sub>(0.0)</sub></sup></sub>19.46Mconfig, model, log
Prev. + Speechv2t<sub><sup>6.0<sub>(0.2)</sub></sup></sub><sub><sup>20.4<sub>(0.5)</sub></sup></sub><sub><sup>30.3<sub>(1.0)</sub></sup></sub><sub><sup>33.0<sub>(2.0)</sub></sup></sub>43.94Mconfig, model, log
Prev. + Audiov2t<sub><sup>8.6<sub>(0.2)</sub></sup></sub><sub><sup>26.1<sub>(0.6)</sub></sup></sub><sub><sup>37.8<sub>(0.8)</sub></sup></sub><sub><sup>19.8<sub>(0.8)</sub></sup></sub>62.45Mconfig, model, log
Prev. + Flowv2t<sub><sup>9.9<sub>(0.4)</sub></sup></sub><sub><sup>28.6<sub>(0.7)</sub></sup></sub><sub><sup>41.7<sub>(0.8)</sub></sup></sub><sub><sup>15.7<sub>(0.6)</sub></sup></sub>80.96Mconfig, model, log
Prev. + RGBv2t<sub><sup>11.2<sub>(0.3)</sub></sup></sub><sub><sup>32.1<sub>(0.8)</sub></sup></sub><sub><sup>45.4<sub>(0.6)</sub></sup></sub><sub><sup>13.7<sub>(0.6)</sub></sup></sub>100.26Mconfig, model, log
Prev. + Inst.v2t<sub><sup>14.7<sub>(0.6)</sub></sup></sub><sub><sup>38.9<sub>(0.8)</sub></sup></sub><sub><sup>53.1<sub>(1.0)</sub></sup></sub><sub><sup>9.3<sub>(0.6)</sub></sup></sub>119.56Mconfig, model, log
Prev. + R2P1Dv2t<sub><sup>15.5<sub>(0.6)</sub></sup></sub><sub><sup>40.1<sub>(1.2)</sub></sup></sub><sub><sup>54.4<sub>(1.3)</sub></sup></sub><sub><sup>8.7<sub>(0.6)</sub></sup></sub>137.67Mconfig, model, log
Prev. + OCRv2t<sub><sup>15.2<sub>(0.1)</sub></sup></sub><sub><sup>41.1<sub>(0.6)</sub></sup></sub><sub><sup>54.6<sub>(0.7)</sub></sup></sub><sub><sup>8.5<sub>(0.5)</sub></sup></sub>165.33Mconfig, model, log
Prev. + Facev2t<sub><sup>15.6<sub>(0.3)</sub></sup></sub><sub><sup>40.9<sub>(1.4)</sub></sup></sub><sub><sup>55.2<sub>(1.0)</sub></sup></sub><sub><sup>8.3<sub>(0.6)</sub></sup></sub>183.45Mconfig, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

DimensionTaskR@1R@5R@10MdRParamsLinks
384t2v<sub><sup>9.4<sub>(0.2)</sub></sup></sub><sub><sup>27.8<sub>(0.4)</sub></sup></sub><sub><sup>39.8<sub>(0.4)</sub></sup></sub><sub><sup>17.7<sub>(0.6)</sub></sup></sub>88.62Mconfig, model, log
512t2v<sub><sup>9.8<sub>(0.3)</sub></sup></sub><sub><sup>28.6<sub>(0.4)</sub></sup></sub><sub><sup>40.6<sub>(0.4)</sub></sup></sub><sub><sup>17.0<sub>(0.0)</sub></sup></sub>119.51Mconfig, model, log
640t2v<sub><sup>10.1<sub>(0.1)</sub></sup></sub><sub><sup>28.8<sub>(0.1)</sub></sup></sub><sub><sup>40.9<sub>(0.2)</sub></sup></sub><sub><sup>16.7<sub>(0.6)</sub></sup></sub>151.12Mconfig, model, log
768t2v<sub><sup>10.0<sub>(0.1)</sub></sup></sub><sub><sup>29.0<sub>(0.3)</sub></sup></sub><sub><sup>41.2<sub>(0.2)</sub></sup></sub><sub><sup>16.0<sub>(0.0)</sub></sup></sub>183.45Mconfig, model, log
1024t2v<sub><sup>9.9<sub>(0.1)</sub></sup></sub><sub><sup>28.6<sub>(0.3)</sub></sup></sub><sub><sup>40.7<sub>(0.4)</sub></sup></sub><sub><sup>17.0<sub>(0.0)</sub></sup></sub>250.27Mconfig, model, log
384v2t<sub><sup>14.0<sub>(0.5)</sub></sup></sub><sub><sup>38.7<sub>(0.5)</sub></sup></sub><sub><sup>52.7<sub>(1.4)</sub></sup></sub><sub><sup>9.3<sub>(0.6)</sub></sup></sub>88.62Mconfig, model, log
512v2t<sub><sup>14.8<sub>(0.4)</sub></sup></sub><sub><sup>40.4<sub>(0.6)</sub></sup></sub><sub><sup>53.9<sub>(0.4)</sub></sup></sub><sub><sup>8.8<sub>(0.3)</sub></sup></sub>119.51Mconfig, model, log
640v2t<sub><sup>15.6<sub>(0.6)</sub></sup></sub><sub><sup>41.3<sub>(0.7)</sub></sup></sub><sub><sup>55.0<sub>(0.5)</sub></sup></sub><sub><sup>8.3<sub>(0.6)</sub></sup></sub>151.12Mconfig, model, log
768v2t<sub><sup>15.6<sub>(0.3)</sub></sup></sub><sub><sup>40.9<sub>(1.4)</sub></sup></sub><sub><sup>55.2<sub>(1.0)</sub></sup></sub><sub><sup>8.3<sub>(0.6)</sub></sup></sub>183.45Mconfig, model, log
1024v2t<sub><sup>14.7<sub>(0.4)</sub></sup></sub><sub><sup>40.7<sub>(0.8)</sub></sup></sub><sub><sup>54.4<sub>(0.3)</sub></sup></sub><sub><sup>8.5<sub>(0.5)</sub></sup></sub>250.27Mconfig, model, log

Training with more captions: Rather than varying the number of experts, we can also investigate how performance changes as we change the number of training captions available per-video.

ExpertsCaps.TaskR@1R@5R@10MdRParamsLinks
RGB1t2v<sub><sup>2.6<sub>(0.1)</sub></sup></sub><sub><sup>9.3<sub>(0.4)</sub></sup></sub><sub><sup>15.0<sub>(0.7)</sub></sup></sub><sub><sup>101.3<sub>(15.5)</sub></sup></sub>56.7Mconfig, model, log
RGB20t2v<sub><sup>4.9<sub>(0.1)</sub></sup></sub><sub><sup>16.5<sub>(0.2)</sub></sup></sub><sub><sup>25.3<sub>(0.4)</sub></sup></sub><sub><sup>40.7<sub>(1.2)</sub></sup></sub>58.05Mconfig, model, log
All1t2v<sub><sup>4.8<sub>(0.2)</sub></sup></sub><sub><sup>16.2<sub>(0.5)</sub></sup></sub><sub><sup>25.0<sub>(0.7)</sub></sup></sub><sub><sup>43.3<sub>(4.0)</sub></sup></sub>183.45Mconfig, model, log
All20t2v<sub><sup>10.0<sub>(0.1)</sub></sup></sub><sub><sup>29.0<sub>(0.3)</sub></sup></sub><sub><sup>41.2<sub>(0.2)</sub></sup></sub><sub><sup>16.0<sub>(0.0)</sub></sup></sub>183.45Mconfig, model, log
RGB1v2t<sub><sup>3.7<sub>(0.3)</sub></sup></sub><sub><sup>13.5<sub>(0.6)</sub></sup></sub><sub><sup>20.8<sub>(0.4)</sub></sup></sub><sub><sup>60.0<sub>(2.0)</sub></sup></sub>56.7Mconfig, model, log
RGB20v2t<sub><sup>6.9<sub>(0.6)</sub></sup></sub><sub><sup>21.0<sub>(0.3)</sub></sup></sub><sub><sup>31.3<sub>(0.3)</sub></sup></sub><sub><sup>30.0<sub>(1.7)</sub></sup></sub>58.05Mconfig, model, log
All1v2t<sub><sup>8.4<sub>(0.5)</sub></sup></sub><sub><sup>25.6<sub>(0.7)</sub></sup></sub><sub><sup>37.1<sub>(0.2)</sub></sup></sub><sub><sup>20.3<sub>(0.6)</sub></sup></sub>183.45Mconfig, model, log
All20v2t<sub><sup>15.6<sub>(0.3)</sub></sup></sub><sub><sup>40.9<sub>(1.4)</sub></sup></sub><sub><sup>55.2<sub>(1.0)</sub></sup></sub><sub><sup>8.3<sub>(0.6)</sub></sup></sub>183.45Mconfig, model, log

Similar ablation studies for the remaining datasets can be found here.

Expert Zoo

For each dataset, the Collaborative Experts model makes use of a collection of pretrained "expert" feature extractors (see [1] for more precise descriptions). Some experts have been obtained from other sources (described where applicable), rather than extracted by us. To reproduce the experiments listed above, the experts for each dataset have been bundled into compressed tar files. These can be downloaded and unpacked with a utility script (recommended -- see example usage below), which will store them in the locations expected by the training code. Each set of experts has a brief README, which also provides a link from which they can be downloaded directly.

DatasetExpertsDetails and linksArchive sizesha1sum
MSRVTTaudio, face, flow, ocr, rgb, scene, speechREADME19.6 GiB<sup><sub><sup><sub>959bda588793ef05f348d16de26da84200c5a469</sub></sup></sub></sup>
LSMDCaudio, face, flow, ocr, rgb, sceneREADME6.1 GiB<sup><sub><sup><sub>7ce018e981752db9e793e449c2ba5bc88217373d</sub></sup></sub></sup>
MSVDface, flow, ocr, rgb, sceneREADME2.1 GiB<sup><sub><sup><sub>6071827257c14de455b3a13fe1e885c2a7887c9e</sub></sup></sub></sup>
DiDeMoaudio, face, flow, ocr, rgb, scene, speechREADME2.3 GiB<sup><sub><sup><sub>6fd4bcc68c1611052de2499fd8ab3f488c7c195b</sub></sup></sub></sup>
ActivityNetaudio, face, flow, ocr, rgb, scene, speechREADME3.8 GiB<sup><sub><sup><sub>b16685576c97cdec2783fb89ea30ca7d17abb021</sub></sup></sub></sup>

QuerYD

MODEL study on QUERYD

Importance of the model:

ModelTaskR@1R@5R@10R@50MdRMnRGeomparamsLinks
HowTo100m S3Dt2v<sub><sup>13.5<sub>(0.0)</sub></sup></sub><sub><sup>27.5<sub>(0.0)</sub></sup></sub><sub><sup>34.5<sub>(0.0)</sub></sup></sub><sub><sup>57.0<sub>(0.0)</sub></sup></sub><sub><sup>35.0<sub>(0.0)</sub></sup></sub><sub><sup>72.5<sub>(0.0)</sub></sup></sub><sub><sup>23.4<sub>(0.0)</sub></sup></sub>1config, model, log
CE - P,CGt2v<sub><sup>11.6<sub>(1.3)</sub></sup></sub><sub><sup>30.2<sub>(3.0)</sub></sup></sub><sub><sup>43.2<sub>(3.1)</sub></sup></sub><sub><sup>74.8<sub>(1.7)</sub></sup></sub><sub><sup>14.2<sub>(1.6)</sub></sup></sub><sub><sup>42.7<sub>(2.6)</sub></sup></sub><sub><sup>24.7<sub>(1.9)</sub></sup></sub>57.75Mconfig, model, log
CEt2v<sub><sup>13.9<sub>(0.8)</sub></sup></sub><sub><sup>37.6<sub>(1.2)</sub></sup></sub><sub><sup>48.3<sub>(1.4)</sub></sup></sub><sub><sup>78.8<sub>(0.7)</sub></sup></sub><sub><sup>11.3<sub>(0.6)</sub></sup></sub><sub><sup>35.1<sub>(1.6)</sub></sup></sub><sub><sup>29.3<sub>(0.8)</sub></sup></sub>30.82Mconfig, model, log
HowTo100m S3Dv2t<sub><sup>12.4<sub>(0.0)</sub></sup></sub><sub><sup>23.8<sub>(0.0)</sub></sup></sub><sub><sup>30.8<sub>(0.0)</sub></sup></sub><sub><sup>57.0<sub>(0.0)</sub></sup></sub><sub><sup>33.0<sub>(0.0)</sub></sup></sub><sub><sup>73.4<sub>(0.0)</sub></sup></sub><sub><sup>20.9<sub>(0.0)</sub></sup></sub>1config, model, log
CE - P,CGv2t<sub><sup>13.0<sub>(3.1)</sub></sup></sub><sub><sup>30.9<sub>(2.0)</sub></sup></sub><sub><sup>43.0<sub>(2.8)</sub></sup></sub><sub><sup>73.2<sub>(0.1)</sub></sup></sub><sub><sup>14.5<sub>(1.8)</sub></sup></sub><sub><sup>42.6<sub>(1.5)</sub></sup></sub><sub><sup>25.7<sub>(2.3)</sub></sup></sub>57.75Mconfig, model, log
CEv2t<sub><sup>13.7<sub>(0.7)</sub></sup></sub><sub><sup>35.2<sub>(2.7)</sub></sup></sub><sub><sup>46.9<sub>(3.2)</sub></sup></sub><sub><sup>78.3<sub>(2.8)</sub></sup></sub><sub><sup>12.3<sub>(1.5)</sub></sup></sub><sub><sup>35.8<sub>(2.4)</sub></sup></sub><sub><sup>28.3<sub>(1.5)</sub></sup></sub>30.82Mconfig, model, log

The influence of different pretrained experts for the performance of the CE model trained on QuerYD is studied. The value and cumulative effect of different experts for scene clas-sification (SCENE), ambient sound classification (AUDIO),image classification (OBJECT), and action recognition (ACTION) are presented. PREV. denotes the experts used in the previous row.

Ablation studies on QuerYD

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the QuerYD dataset.

ExpertsTaskR@1R@5R@10R@50MdRMnRGeomparamsLinks
Scenet2v<sub><sup>8.7<sub>(0.4)</sub></sup></sub><sub><sup>26.3<sub>(1.1)</sub></sup></sub><sub><sup>37.1<sub>(0.7)</sub></sup></sub><sub><sup>68.5<sub>(2.2)</sub></sup></sub><sub><sup>22.2<sub>(1.6)</sub></sup></sub><sub><sup>52.3<sub>(3.0)</sub></sup></sub><sub><sup>20.4<sub>(0.1)</sub></sup></sub>7.51Mconfig, model, log
Scene + Inst.t2v<sub><sup>11.7<sub>(1.4)</sub></sup></sub><sub><sup>31.6<sub>(0.9)</sub></sup></sub><sub><sup>43.4<sub>(1.3)</sub></sup></sub><sub><sup>74.5<sub>(0.9)</sub></sup></sub><sub><sup>14.0<sub>(1.0)</sub></sup></sub><sub><sup>41.1<sub>(2.1)</sub></sup></sub><sub><sup>25.2<sub>(0.8)</sub></sup></sub>17.25Mconfig, model, log
Scene + r2p1dt2v<sub><sup>11.7<sub>(2.1)</sub></sup></sub><sub><sup>32.1<sub>(3.0)</sub></sup></sub><sub><sup>45.3<sub>(3.3)</sub></sup></sub><sub><sup>74.6<sub>(0.4)</sub></sup></sub><sub><sup>13.7<sub>(1.9)</sub></sup></sub><sub><sup>42.9<sub>(2.2)</sub></sup></sub><sub><sup>25.7<sub>(2.4)</sub></sup></sub>16.07Mconfig, model, log
Scene + Audiot2v<sub><sup>7.6<sub>(2.7)</sub></sup></sub><sub><sup>27.4<sub>(1.4)</sub></sup></sub><sub><sup>40.4<sub>(0.9)</sub></sup></sub><sub><sup>69.1<sub>(0.9)</sub></sup></sub><sub><sup>17.0<sub>(1.7)</sub></sup></sub><sub><sup>49.0<sub>(1.9)</sub></sup></sub><sub><sup>20.2<sub>(2.3)</sub></sup></sub>17.25Mconfig, model, log
Scenev2t<sub><sup>9.1<sub>(0.8)</sub></sup></sub><sub><sup>25.4<sub>(0.9)</sub></sup></sub><sub><sup>35.3<sub>(1.5)</sub></sup></sub><sub><sup>68.2<sub>(2.2)</sub></sup></sub><sub><sup>23.2<sub>(0.3)</sub></sup></sub><sub><sup>52.6<sub>(2.6)</sub></sup></sub><sub><sup>20.1<sub>(0.5)</sub></sup></sub>7.51Mconfig, model, log
Scene + Inst.v2t<sub><sup>11.9<sub>(0.5)</sub></sup></sub><sub><sup>31.0<sub>(3.6)</sub></sup></sub><sub><sup>43.5<sub>(2.7)</sub></sup></sub><sub><sup>74.8<sub>(1.8)</sub></sup></sub><sub><sup>14.5<sub>(0.9)</sub></sup></sub><sub><sup>40.8<sub>(2.1)</sub></sup></sub><sub><sup>25.2<sub>(1.1)</sub></sup></sub>17.25Mconfig, model, log
Scene + r2p1dv2t<sub><sup>12.7<sub>(1.4)</sub></sup></sub><sub><sup>30.9<sub>(2.8)</sub></sup></sub><sub><sup>44.0<sub>(1.8)</sub></sup></sub><sub><sup>74.3<sub>(1.2)</sub></sup></sub><sub><sup>14.3<sub>(1.2)</sub></sup></sub><sub><sup>42.8<sub>(1.7)</sub></sup></sub><sub><sup>25.8<sub>(1.7)</sub></sup></sub>16.07Mconfig, model, log
Scene + Audiov2t<sub><sup>10.1<sub>(1.2)</sub></sup></sub><sub><sup>25.7<sub>(1.5)</sub></sup></sub><sub><sup>37.5<sub>(1.2)</sub></sup></sub><sub><sup>69.8<sub>(1.6)</sub></sup></sub><sub><sup>20.0<sub>(1.3)</sub></sup></sub><sub><sup>48.9<sub>(2.0)</sub></sup></sub><sub><sup>21.3<sub>(1.1)</sub></sup></sub>17.25Mconfig, model, log

We can also study their cumulative effect:

ExpertsTaskR@1R@5R@10R@50MdRMnRGeomparamsLinks
Scenet2v<sub><sup>8.7<sub>(0.4)</sub></sup></sub><sub><sup>26.3<sub>(1.1)</sub></sup></sub><sub><sup>37.1<sub>(0.7)</sub></sup></sub><sub><sup>68.5<sub>(2.2)</sub></sup></sub><sub><sup>22.2<sub>(1.6)</sub></sup></sub><sub><sup>52.3<sub>(3.0)</sub></sup></sub><sub><sup>20.4<sub>(0.1)</sub></sup></sub>7.51Mconfig, model, log
Prev. + Audiot2v<sub><sup>7.6<sub>(2.7)</sub></sup></sub><sub><sup>27.4<sub>(1.4)</sub></sup></sub><sub><sup>40.4<sub>(0.9)</sub></sup></sub><sub><sup>69.1<sub>(0.9)</sub></sup></sub><sub><sup>17.0<sub>(1.7)</sub></sup></sub><sub><sup>49.0<sub>(1.9)</sub></sup></sub><sub><sup>20.2<sub>(2.3)</sub></sup></sub>17.25Mconfig, model, log
Prev. + Instt2v<sub><sup>12.7<sub>(1.7)</sub></sup></sub><sub><sup>34.8<sub>(1.7)</sub></sup></sub><sub><sup>47.0<sub>(1.3)</sub></sup></sub><sub><sup>78.0<sub>(1.0)</sub></sup></sub><sub><sup>12.3<sub>(0.6)</sub></sup></sub><sub><sup>37.6<sub>(2.1)</sub></sup></sub><sub><sup>27.5<sub>(1.5)</sub></sup></sub>24.63Mconfig, model, log
Prev. + R2P1Dt2v<sub><sup>14.3<sub>(0.3)</sub></sup></sub><sub><sup>37.5<sub>(1.3)</sub></sup></sub><sub><sup>48.6<sub>(0.8)</sub></sup></sub><sub><sup>78.8<sub>(0.3)</sub></sup></sub><sub><sup>11.3<sub>(0.6)</sub></sup></sub><sub><sup>35.2<sub>(1.8)</sub></sup></sub><sub><sup>29.7<sub>(0.3)</sub></sup></sub>30.82Mconfig, model, log
Scenev2t<sub><sup>9.1<sub>(0.8)</sub></sup></sub><sub><sup>25.4<sub>(0.9)</sub></sup></sub><sub><sup>35.3<sub>(1.5)</sub></sup></sub><sub><sup>68.2<sub>(2.2)</sub></sup></sub><sub><sup>23.2<sub>(0.3)</sub></sup></sub><sub><sup>52.6<sub>(2.6)</sub></sup></sub><sub><sup>20.1<sub>(0.5)</sub></sup></sub>7.51Mconfig, model, log
Prev. + Audiov2t<sub><sup>10.1<sub>(1.2)</sub></sup></sub><sub><sup>25.7<sub>(1.5)</sub></sup></sub><sub><sup>37.5<sub>(1.2)</sub></sup></sub><sub><sup>69.8<sub>(1.6)</sub></sup></sub><sub><sup>20.0<sub>(1.3)</sub></sup></sub><sub><sup>48.9<sub>(2.0)</sub></sup></sub><sub><sup>21.3<sub>(1.1)</sub></sup></sub>17.25Mconfig, model, log
Prev. + Inst.v2t<sub><sup>12.8<sub>(1.3)</sub></sup></sub><sub><sup>33.5<sub>(2.8)</sub></sup></sub><sub><sup>46.6<sub>(1.0)</sub></sup></sub><sub><sup>76.7<sub>(1.7)</sub></sup></sub><sub><sup>11.8<sub>(0.8)</sub></sup></sub><sub><sup>37.6<sub>(1.9)</sub></sup></sub><sub><sup>27.1<sub>(0.6)</sub></sup></sub>24.63Mconfig, model, log
Prev. + R2P1Dv2t<sub><sup>14.0<sub>(0.3)</sub></sup></sub><sub><sup>35.4<sub>(2.9)</sub></sup></sub><sub><sup>47.2<sub>(2.8)</sub></sup></sub><sub><sup>78.7<sub>(2.4)</sub></sup></sub><sub><sup>12.3<sub>(1.5)</sub></sup></sub><sub><sup>35.8<sub>(2.4)</sub></sup></sub><sub><sup>28.6<sub>(1.2)</sub></sup></sub>30.82Mconfig, model, log

QuerYDSegments

MODEL study on QUERYDSEGMENTS

Importance of the model:

ModelTaskR@1R@5R@10R@50MdRMnRGeomparamsLinks
HowTo100m S3Dt2v<sub><sup>6.7<sub>(0.0)</sub></sup></sub><sub><sup>14.7<sub>(0.0)</sub></sup></sub><sub><sup>20.4<sub>(0.0)</sub></sup></sub><sub><sup>36.6<sub>(0.0)</sub></sup></sub><sub><sup>133.0<sub>(0.0)</sub></sup></sub><sub><sup>342.0<sub>(0.0)</sub></sup></sub><sub><sup>12.6<sub>(0.0)</sub></sup></sub>1config, model, log
CE - P,CGt2v<sub><sup>19.0<sub>(0.8)</sub></sup></sub><sub><sup>38.9<sub>(1.0)</sub></sup></sub><sub><sup>47.9<sub>(0.7)</sub></sup></sub><sub><sup>68.0<sub>(0.4)</sub></sup></sub><sub><sup>12.0<sub>(1.0)</sub></sup></sub><sub><sup>127.4<sub>(5.9)</sub></sup></sub><sub><sup>32.8<sub>(0.6)</sub></sup></sub>57.75Mconfig, model, log
CEt2v<sub><sup>18.2<sub>(0.5)</sub></sup></sub><sub><sup>38.1<sub>(0.8)</sub></sup></sub><sub><sup>46.8<sub>(0.4)</sub></sup></sub><sub><sup>67.3<sub>(0.7)</sub></sup></sub><sub><sup>13.3<sub>(0.6)</sub></sup></sub><sub><sup>127.5<sub>(3.9)</sub></sup></sub><sub><sup>31.9<sub>(0.4)</sub></sup></sub>30.82Mconfig, model, log
HowTo100m S3Dv2t<sub><sup>8.4<sub>(0.0)</sub></sup></sub><sub><sup>15.4<sub>(0.0)</sub></sup></sub><sub><sup>19.8<sub>(0.0)</sub></sup></sub><sub><sup>34.2<sub>(0.0)</sub></sup></sub><sub><sup>154.5<sub>(0.0)</sub></sup></sub><sub><sup>363.0<sub>(0.0)</sub></sup></sub><sub><sup>13.7<sub>(0.0)</sub></sup></sub>1config, model, log
CE - P,CGv2t<sub><sup>19.8<sub>(0.2)</sub></sup></sub><sub><sup>39.6<sub>(0.6)</sub></sup></sub><sub><sup>47.6<sub>(0.1)</sub></sup></sub><sub><sup>67.9<sub>(0.5)</sub></sup></sub><sub><sup>13.0<sub>(0.0)</sub></sup></sub><sub><sup>124.3<sub>(5.5)</sub></sup></sub><sub><sup>33.4<sub>(0.2)</sub></sup></sub>57.75Mconfig, model, log
CEv2t<sub><sup>18.1<sub>(0.6)</sub></sup></sub><sub><sup>37.3<sub>(0.5)</sub></sup></sub><sub><sup>45.9<sub>(0.6)</sub></sup></sub><sub><sup>67.2<sub>(0.2)</sub></sup></sub><sub><sup>14.0<sub>(1.0)</sub></sup></sub><sub><sup>123.9<sub>(3.3)</sub></sup></sub><sub><sup>31.4<sub>(0.4)</sub></sup></sub>30.82Mconfig, model, log

Evaluating a pretrained model

Evaluting a pretrained model for a given dataset requires:

  1. The pretrained experts for the target dataset, which should be located in <root>/data/<dataset-name>/symlinked-feats (this will be done automatically by the utility script, or can be done manually).
  2. A config.json file.
  3. A trained_model.pth file.

Evaluation is then performed with the following command:

python3 test.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>

where <gpu-id> is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.

For example, to reproduce the MSVD results described above, run the following sequence of commands:

# fetch the pretrained experts for MSVD 
python3 misc/sync_experts.py --dataset MSVD

# find the name of a pretrained model using the links in the tables above 
export MODEL=data/models/msvd-train-full-ce/5bb8dda1/seed-0/2020-01-30_12-29-56/trained_model.pth

# create a local directory and download the model into it 
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/collaborative-experts/${MODEL}"

# Evaluate the model
python3 test.py --config configs/msvd/train-full-ce.json --resume ${MODEL} --device 0 --eval_from_training_config

Training a new model

Training a new video-text embedding requires:

  1. The pretrained experts for the dataset used for training, which should be located in <root>/data/<dataset-name>/symlinked-feats (this will be done automatically by the utility script, or can be done manually).
  2. A config.json file. You can define your own, or use one of the provided configs in the configs directory.

Training is then performed with the following command:

python3 train.py --config <path-to-config.json> --device <gpu-id>

where <gpu-id> is the index of the GPU to train on. This option can be ommitted to run the training on the CPU.

For example, to train a new embedding for the LSMDC dataset, run the following sequence of commands:

# fetch the pretrained experts for LSMDC 
python3 misc/sync_experts.py --dataset LSMDC

# Train the model
python3 train.py --config configs/lsmdc/train-full-ce.json --device 0

Visualising the retrieval ranking

Tensorboard lacks video support via HTML5 tags (at the time of writing) so after each evaluation of a retrieval model, a simple HTML file is generated to allow the predicted rankings of different videos to be visualised: an example screenshot is given below (this tool is inspired by the visualiser in the pix2pix codebase). To view the visualisation, navigate to the web directory (this is generated for each experiment, and will be printed in the log during training) and run python3 -m http.server 9999, then navigate to localhost:9999 in your web browser. You should see something like the following:

visualisation

Note that the visualising the results in this manner requires that you also download the source videos for each of the datasets to some directory <src-video-dir>. Then set the visualizer.args.src_video_dir attribute of the training config.json file to point to <src-video-dir>.

Dependencies

Dependencies can be installed via pip install -r requirements/pip-requirements.txt.

References

[1] If you find this code useful or use the extracted features, please consider citing:

@inproceedings{croitoru2021teachtext,
  title={Teachtext: Crossmodal generalized distillation for text-video retrieval},
  author={Croitoru, I. and Bogolin, S. and Leordeanu, M. and Jin, H. and Zisserman, A. and Albanie, S. and Liu, Y.},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={11583--11593},
  year={2021}
}

@inproceedings{Liu2019a,
  author    = {Liu, Y. and Albanie, S. and Nagrani, A. and Zisserman, A.},
  booktitle = {arXiv preprint arxiv:1907.13487},
  title     = {Use What You Have: Video retrieval using representations from collaborative experts},
  date      = {2019},
}

[2] If you make use of the MSRVTT or LSMDC features provided by Miech et al. (details are given in their respective READMEs here and here), please cite:

@article{miech2018learning,
  title={Learning a text-video embedding from incomplete and heterogeneous data},
  author={Miech, Antoine and Laptev, Ivan and Sivic, Josef},
  journal={arXiv preprint arXiv:1804.02516},
  year={2018}
}

Acknowledgements

This work was inspired by a number of prior works for learning joint embeddings of text and video, but in particular the Mixture-of-Embedding-Experts method proposed by Antoine Miech, Ivan Laptev and Josef Sivic (paper, code). We would also like to thank Zak Stone and Susie Lim for their help with using Cloud TPUs. The code structure uses the pytorch-template by @victoresque.