Home

Awesome

VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners (NeurIPS 2022)

<img src="vidIL.gif" width="1200">

Download Datasets & Checkpoints

Set Up Environment


Generate Video Representation & GPT-3 Prompt

The following scripts runs the entire pipeline which, (1) generates frame captions; (2) generates visual tokens (3) generates few-shot prompt readily for GPT-3. The output folder have the following structure:

    {dataset_split}
    ├── frame_caption
    │   ├── config.yaml  # config for frame captioning
    │   ├── video_text_Cap.json  # frame captions w/o filtering
    │   ├── video_text_CapFilt.json  # frame captions w/ filtering
    ├── input_prompts 
    │   ├── {output_name}.jsonl  # config for frame captioning
    │   ├── {output_name}__idx_2_videoid.json  # line idx to video id
    │   ├── {output_name}__chosen_samples.json  # chosen examples in the support
    │   ... 
    ├── visual_tokenization_{encoder_name}           
    │   ├── config.yaml  # config for visual tokenization
    │   └── visual_tokens.json  # raw visual tokens of each frame
    └──

All scripts should be run at /src dir, namely, the root directory after running the docker image. The following are examples for running the pipeline with in-context example selection for some datasets. For additional notes on running pipeline scripts, please refer to Pipeline Instruction.

Standalone Pipeline for Frame Captioning and Visaul Tokenization

Since we need to sample few-shot support set from training sets, for each dataset, at the first time running the pipeline, we need to do frame captioning and visual tokenization on the training set.

For <dataset> in ["msrvtt","youcook2","vatex","msvd","vlep"]:

bash pipeline/scripts/run_frame_captioning_and_visual_tokenization.sh <dataset> train <output_root>

An example of the frame caption and visual token dir can be found at: output_example/msrvtt/frame_caption , output_example/msrvtt/visual_tokenization_clip

Video Captioning

For <dataset> in ["msrvtt","youcook2","vatex"]:

Video Question Answering

For <dataset> in ["msrvtt","msvd"]:

Video-Language Event Prediction (VLEP)

Semi-Supervised Text-Video Retrieval

For semi-supervised setting, we first generate pseudo label on the training set, we then train BLIP on the pseudo labeled dataset for retrieval.

Evaluation

Scripts for evaluating generation results from GPT-3:

Citation

@article{wang2022language,
  title={Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners},
  author={Wang, Zhenhailong and Li, Manling and Xu, Ruochen and Zhou, Luowei and Lei, Jie and Lin, Xudong and Wang, Shuohang and Yang, Ziyi and Zhu, Chenguang and Hoiem, Derek and others},
  journal={arXiv preprint arXiv:2205.10747},
  year={2022}
}

Acknowledgement

The implementation of VidIL relies on resources from BLIP, ALPRO, transformers. We thank the original authors for their open-sourced code and encourage users to cite their works when applicable.