Home

Awesome

Learning Video Representations from Large Language Models

Learning Video Representations from Large Language Models
Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar
CVPR 2023 (Highlight, acceptance rate≈2.5%)
arxiv | bibtex | colab | 🤗 demo | website

LaViLa (Language augmented Video Language Pretraining) is a new approach to learning video representations from Large Language Models (LLMs). We repurpose LLMs to be visually conditioned "Narrators", and use them to automatically generate video-language paired data. We use this data to then learn a video-langauge representation, outperforming prior work by large margins.

Sample Generations:

VideoGeneration 1Generation 2
<img src="assets/mixkit-pastry-chef-cutting-a-loaf-into-slices-43015-medium.gif" height=128>so now we're going to slice the breadnow i'm going to do is just slice<br>this up into a nice chunk and<br>then we're going to place it<br>on the plate

Try out our Narrator to generate text descriptions for your own videos! You can also try out a web demo here: Hugging Face Spaces

The resulting video-language model sets a new state-of-the-art on a number of popular video tasks! <img width="400" alt="image" src="https://user-images.githubusercontent.com/1893429/205997492-a6cbc7c1-1f8e-4fad-9d94-f9e22920272d.png">

Introduction and installation

<span style="font-variant:small-caps;">LaViLa</span> leverages Large Language Models (LLMs) as "NARRATOR"s (and "REPHRASER"s) to densely narrate long videos, and uses these narrations to train strong dual-encoder models.

<img src="assets/lavila_ego4d.gif" height=384>

See INSTALL.md to install this code.

NARRATOR

NARRATOR is a visually conditioned LLM that takes videos frames as input and pseudo-labels this clip with narrations.

<img src="assets/narrator.gif" height=384>

NARRATOR Demo

We provide some generated samples by our NARRATOR:

<img src="assets/06919917-76bc-4adc-b944-2a722f165513.gif" height=128><img src="assets/cf7c12db-1a9e-46d3-96d6-38174bbe373c.gif" height=128><img src="assets/ab865129-78fa-47d4-8a50-ff8c5533246f.gif" height=128>
Human<br>narrationC separates the yarn.C lifts container.C opterates the camera.
NARRATOR generation (a)C stetches the thread with both hands.C wipes the countertop with a sponge.C takes a photo shot.
NARRATOR generation (b)C pulls out the yarn with her right hand.C moves the container.A man X looks at the camera.

Run the narrator demo using Colab (no GPU needed): Open In Colab
or on the web using 🤗 Spaces: Hugging Face Spaces (thanks to @nateraw!)

Since Colab free account offers very limited RAM, if you'd like to run the demo with a larger model, please run ./demo_narrator.py locally. For more technical details, please refer to Sec 4.1 in our paper.

# CPU mode
python demo_narrator.py [--video-path $TEST_VIDEO]

# GPU mode
python demo_narrator.py --cuda

Our narrator also works on third-person videos! Below are several examples generated by our NARRATOR that is pre-trained on HowTo100M Auto-Aligned (HTM-AA) and applied to some stock footage video clips. Note that since the text corpus in HowTo100M is ASR transcription, the style of narration is slightly different from that of ground-truth captions. However the generated results are generally reasonable.

<img src="assets/mixkit-pastry-chef-cutting-a-loaf-into-slices-43015-medium.gif" height=128><img src="assets/mixkit-hands-of-a-baker-kneading-a-dough-42467-medium.gif" height=128><img src="assets/mixkit-chef-preparing-a-sauce-in-a-blender-43034-medium.gif" height=128>
GT captionPastry chef cutting bread into<br>slices during the preparation<br>of a dessert, inside a kitchen.Close-up shot of the hands<br>of an experienced baker<br>skillfully kneading bread dough.Chef preparing a sauce in<br>a blender, adding different<br>ingredients while blending.
NARRATOR (a)so now we're going to slice the breadi'm gonna make a little hole<br>in the middle of the dough hereall right let's blend this up
NARRATOR (b)now i'm going to do is just slice<br>this up into a nice chunk and<br>then we're going to place it<br>on the plateyou just keep kneading itthe last step to making this<br>is to blend the ingredients<br>in the food processor

Below is a demo for 3rd-person videos.

python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda]

Dual-Encoder

The dual-encoder model contains a video encoder and a text encoder. It learns video-langauge representation from both human annotations and generated narrations using a contrastive loss like CLIP.

License

The majority of LAVILA is licensed under a MIT License, however portions of the project are available under separate license terms:

Citing LaViLa

@inproceedings{zhao2023lavila,
  title={Learning Video Representations from Large Language Models},
  author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
  booktitle={CVPR},
  year={2023}
}