Home

Awesome

:airplane: avion

AVION is short for A VIdeo model in ONe day. AVION (meaning plane in French and Spanish) is fast.

Training a Large Video Model on a Single Machine in a Day
Yue Zhao, Philipp Krähenbühl
UT Austin
arxiv | bibtex

Installation

See INSTALL.md to install this code.

Main results

  1. AVION enables video-language contrastive pre-training on Ego4D (original narratives) on a single node of 8× consumer-grade GPUs within a day.

    MethodBackbonebatch-size<br>per GPUGPU memoryHardwareGPU×hour^EK100 MIR<br>0-shot Avg. mAP
    EgoVLPTSF-B162232× A100153622.1
    OursViT-B256198× A500013027.4

    ^The reported GPU×hour is not normalized for GPU generations. The cost for EgoVLP is obtained from the original paper (Sec 6.1).

  2. AVION speeds up LLM-augmented video-language contrastive pre-training (LaViLa) on Ego4D.

    a. Pretraining cost and performance.

    MethodBackbonebatch-size<br>per GPUGPU memoryHardwareGPU×hour^EK100 MIR<br>0-shot Avg. mAP
    LaViLaTSF-B322532× V100182430.9
    OursViT-B256198× A500026033.2

    ^The reported GPU×hour is not normalized for GPU generations.

    b. Downstream performance.

    MethodBackboneEK100 MIR<br>Avg. mAPEK100 MIR<br>Avg. nDCGEK100 CLS<br>Action Top-1
    LaViLaTSF-B50.565.046.9
    OursViT-B51.766.849.5
    LaViLaTSF-L50.966.551.0
    OursViT-L54.569.054.5

    :trophy: LaViLa+AVION helps us win CVPR 2023 EPIC-Kitchens Challenges in both Action Recognition and Multi-Instance Retrieval Tasks by a significant margin.

  3. AVION speeds up VideoMAE pre-training.

    MethodBackboneEpochsGPU×hour^^top-1/top-5 (w/. FT)
    VideoMAEViT-B80099580.0/94.4
    OursViT-B80058380.1/94.5

    ^^Both GPU×hour are measured on the same hardware environment (4× A5000 GPU).

For more details, please refer to MODEL_ZOO.

License

MIT License.

Acknowledgements

Citing AVION

@article{zhao2023training,
  title={Training a large video model on a single machine in a day},
  author={Zhao, Yue and Kr{\"a}henb{\"u}hl, Philipp},
  journal={arXiv preprint arXiv:2309.16669},
  year={2023}
}
@inproceedings{zhao2023lavila,
  title={Learning Video Representations from Large Language Models},
  author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
  booktitle={CVPR},
  year={2023}
}