Home

Awesome

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

This is the official PyTorch implementation of ProcedureVRL (CVPR 2023).

Paper | Slides | Poster

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations <br> Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan<sup>#</sup>, and Yin Li<sup>#</sup> <br> (<sup>#</sup>co-corresponding authors)<br> CVPR 2023<br>

<p align="center"> <img src="docs/ProcedureVRL.jpg" width=75% height=75% class="center"> </p>

Overview

We propose a video-and-language pretraining framework for learning procedure-aware video representation from a large-scale dataset of instructional videos and their narrations, without the need for human annotations. Our method jointly learns a video encoder that captures concepts of action steps, as well as a diffusion model that reasons about the temporal dependencies among steps. The resulting representations establish new state-of-the-art results on both step classification and forecasting tasks across the major benchmarks. Besides, our model supports zero-shot inference and is capable of generating diverse step predictions.

<p align="center"> <img src="docs/ProcedureVRL_Model.jpg" width=99% height=99% class="center"> </p>

Updates

Outline

  1. Installation
  2. Datasets
  3. Model Zoo
  4. Pretraining
  5. Zero-shot Inference
  6. Finetuning
  7. Finetuning Evaluation
  8. Visualization of Zero-shot Step Forecasting and Keyframe Generation
  9. License and Contributing
  10. Citation and Acknowledgement

Installation

Check INSTALL.md for installation instructions and environment information.

Datasets

Please use the dataset preparation instructions provided in DATASET.md.

Model Zoo

All trained models can be downloaded from this Google Drive. Please put the downloaded folders in ./exps directory. Each downloaded folder contains a trained model checkpoint that was saved in a .pyth file (e.g., our model pretrained on HowTo100M is saved at ./exps/procedurevrl-stg2/checkpoints/checkpoint_epoch_00025.pyth).

We provide our model pretrained on HowTo100M dataset with TimeSformer architecture for our video encoder (download link). This model achieves new state-of-the-art results in both zero-shot and finetuning settings across following benchmarks where NA suggusts that the model doesn't support such ability:

Step forecasting on COIN dataset:

ModelPretraining SupervisionPretraining DatasetZero-shot (%)Finetuning (%)
Previous SoTAUnsupervised: ASR + wikiHowHowTo100MNA39.4
OursUnsupervised: ASRHowTo100M11.346.8

Step classification on COIN dataset:

ModelPretraining SupervisionPretraining DatasetZero-shot (%)Finetuning (%)
Previous SoTAUnsupervised: ASR + wikiHowHowTo100MNA54.1
OursUnsupervised: ASRHowTo100M16.656.9

Procedural activity classification on COIN dataset:

ModelPretraining SupervisionPretraining DatasetFinetuning (%)
Previous SoTAUnsupervised: ASR + wikiHowHowTo100M88.9
OursUnsupervised: ASRHowTo100M90.8

Egocentric action recognition on EPIC-Kitchens-100 dataset:

ModelPretraining SupervisionPretraining DatasetVerb (%)Verb (%)Noun (%)
MoViNet----47.772.257.3
DistantSupUnsupervised: ASR + wikiHowHowTo100M44.467.158.1
OursUnsupervised: ASRHowTo100M47.769.560.3

Note that both DistantSup and Ours adopt TimeSformer as encoder backbone while MoViNet itself is a heavily optimized video backbone.

For future use, we also provide our model pretrained on HowTo100M dataset with MViT-v2 architecture for our video encoder (download link). We test pretrained models on COIN dataset with zero-shot setting:

ModelPretraining SupervisionPretraining DatasetClassification (%)Forecasting (%)
Ours (TimeSformer)Unsupervised: ASRHowTo100M16.611.3
Ours (MViT-v2-S)Unsupervised: ASRHowTo100M12.59.0

We conjecture that TimeSformer, which samples 8 frames from consecutive 256 frames, is better suited for recognizing actions with long durations, such as COIN steps. Conversely, MViT-S, which samples 16 frames from consecutive 96 frames, may perform better for recognizing actions with short durations and high-speed motion.

Pretraining

We pretrain our model on the unannotated videos from the Howto100M dataset. Please refer to DATASET.md to ensure that you have already prepared the data and filled the fields in config yaml files. Our pretraining consists of two stages: stage one using SGD for 5 epochs and stage two using Adamw for 25 epochs.

We provide pretraining scripts for single-node training and multi-node distributed training (recommended).

<details> <summary> Single-node training: If you have only one compute node (e.g., a local computer), then you can run the following command. </summary>
# pretraining: stage 1
python tools/run_net.py \
--cfg configs/HowTo100M/procedurevrl_sgd.yaml \
OUTPUT_DIR exps/procedurevrl-stg1/ \
NUM_SHARDS 1 \
NUM_GPUS 8 

# pretraining: stage 2
python tools/run_net.py \
--cfg configs/HowTo100M/procedurevrl_adamw.yaml \
OUTPUT_DIR exps/procedurevrl-stg2/ \
NUM_SHARDS 1 \
NUM_GPUS 8 \
TIMESFORMER.PRETRAINED_MODEL exps/procedurevrl-stg1/checkpoints/checkpoint_epoch_00005.pyth
</details> <details> <summary> Multi-node distributed training: If you have multiple compute nodes (e.g., a cluster with Slurm), then you can run the following command using submitit. </summary>
# pretraining: stage 1
JOB_NAME=procedurevrl-stg1
python tools/submit.py \
--cfg configs/HowTo100M/procedurevrl_sgd.yaml \
--name ${JOB_NAME} \
--job_dir ./exps/${JOB_NAME}/  \
--partition YOUR_PARTITION_NAME \
--num_shards 8 \
--num_gpus 8 \

# pretraining: stage 2
JOB_NAME=procedurevrl-stg2
python tools/submit.py \
--cfg configs/HowTo100M/procedurevrl_adamw.yaml \
--name ${JOB_NAME} \
--job_dir ./exps/${JOB_NAME}/  \
--partition YOUR_PARTITION_NAME \
--num_shards 8 \
--num_gpus 8 \
TIMESFORMER.PRETRAINED_MODEL exps/procedurevrl-stg1/checkpoints/checkpoint_epoch_00005.pyth \

--partition is the partition name in your cluster. --num_shards specifies how many nodes to run on. --num_gpus represents the number of GPUs on each node.

</details>

If you would like to run pretraining on the HowTo100M subset, set DATA.PATH_TO_DATA_DIR to ./data_csv/howto100m_subset/.

If you would like to use MViT-v2 architecture for video encoder (default is TimeSformer), set --cfg to MViT-v2 config files accordingly (procedurevrl_mvitv2_sgd.yaml, procedurevrl_mvitv2_adamw.yaml).

Zero-shot Inference

After pretraining, our model is capable of zero-shot inference for both step classification and step forecasting. This is done by matching the video representation with language embeddings of step candidates, such as the extracted video features from the video encoder (step classification), and the reconstructed video features from the diffusion model (step forecasting).

We test our pretrained model on COIN dataset. Please refer to DATASET.md to ensure that you have already prepared the data and filled the fields in config yaml files.

<details> <summary> Zero-shot step classification </summary>
python tools/run_net.py \
--cfg configs/COIN/step_classification.yaml \
OUTPUT_DIR exps/zs_coin_cls \
NUM_SHARDS 1 \
NUM_GPUS 8 \
TRAIN.ENABLE False \
DEV.MATCH_LANG_EMB True \
TIMESFORMER.PRETRAINED_MODEL exps/procedurevrl-stg2/checkpoints/checkpoint_epoch_00025.pyth \
</details> <details> <summary> Zero-shot step forecasting </summary>
python tools/run_net.py \
--cfg configs/COIN/step_forecasting.yaml \
OUTPUT_DIR exps/zs_coin_fcst \
NUM_SHARDS 1 \
NUM_GPUS 8 \
TRAIN.ENABLE False \
DEV.MATCH_LANG_EMB True \
TEST.NUM_ENSEMBLE_VIEWS 1 \
TIMESFORMER.PRETRAINED_MODEL exps/procedurevrl-stg2/checkpoints/checkpoint_epoch_00025.pyth \
</details>

Note that TIMESFORMER.PRETRAINED_MODEL should point to the correct folder where your pretrained model locate.

Finetuning

After pretraining, our model can be further finetuned with additional human annotations. For more details on finetuning setting, please refer to our paper.

We provide finetuning scripts for COIN and EPIC-Kitchens-100 datasets. Please refer to DATASET.md to ensure that you have already prepared the data and filled the fields in config yaml files.

<details> <summary> Step classification on COIN </summary>
JOB_NAME=procedurevrl-coin-cls
python tools/submit.py \
--cfg configs/COIN/step_classification.yaml \
--name ${JOB_NAME} \
--job_dir ./exps/${JOB_NAME}/  \
--partition YOUR_PARTITION_NAME \
--num_shards 4 \
--num_gpus 8 \
TIMESFORMER.PRETRAINED_MODEL exps/procedurevrl-stg2/checkpoints/checkpoint_epoch_00025.pyth \
</details> <details> <summary> Step forecasting on COIN </summary>
JOB_NAME=procedurevrl-coin-fcst
python tools/submit.py \
--cfg configs/COIN/step_forecasting.yaml \
--name ${JOB_NAME} \
--job_dir ./exps/${JOB_NAME}/  \
--partition YOUR_PARTITION_NAME \
--num_shards 4 \
--num_gpus 8 \
TIMESFORMER.PRETRAINED_MODEL exps/procedurevrl-stg2/checkpoints/checkpoint_epoch_00025.pyth \
</details> <details> <summary> Procedural activity classification on COIN </summary>
JOB_NAME=procedurevrl-coin-taskcls
python tools/submit.py \
--cfg configs/COIN/task_classification.yaml \
--name ${JOB_NAME} \
--job_dir ./exps/${JOB_NAME}/  \
--partition YOUR_PARTITION_NAME \
--num_shards 4 \
--num_gpus 8 \
TIMESFORMER.PRETRAINED_MODEL exps/procedurevrl-stg2/checkpoints/checkpoint_epoch_00025.pyth \
</details> <details> <summary> Action recognition on EPIC-Kitchens-100 </summary>
JOB_NAME=procedurevrl-ek-cls
python tools/submit.py \
--cfg configs/EK/egocentric_action_classification.yaml \
--name ${JOB_NAME} \
--job_dir ./exps/${JOB_NAME}/  \
--partition YOUR_PARTITION_NAME \
--num_shards 4 \
--num_gpus 8 \
TIMESFORMER.PRETRAINED_MODEL exps/procedurevrl-stg2/checkpoints/checkpoint_epoch_00025.pyth \
</details>

Note that TIMESFORMER.PRETRAINED_MODEL should point to the correct folder where your pretrained model locate.

Finetuning Evaluation

To evaluate the finetuned models, you can either run on multi nodes (re-use finetuning scripts by additionally setting TRAIN.ENABLE to False), or run on a single node (re-use zero-shot inference scripts by additionally setting DEV.MATCH_LANG_EMB to False).

Note that TIMESFORMER.PRETRAINED_MODEL should point to the correct folder where your finetuned model locate.

Visualization of Zero-shot Step Forecasting and Keyframe Generation

After pretraining, our model supports zero-shot step forecasting. As shown below, given a video recording previous steps (left), our model can forecast multiple meaningful predictions by sampling from our diffusion model (right: step descriptions). Going futher, we leverage the trained Stable Diffusion model to vision what the future looks like (right: keyframes). Specifically, to keep the generated images visually consistent with the input video, we let stable diffusion model take one input video frame and the description of predicted step as input and generate an image.

<p align="center"> <img src="docs/visualization.png" width=99% height=99% class="center"> </p>

License and Contributing

The majority of this work is licensed under CC-BY-NC-SA 3.0. However, portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license.

We actively welcome your pull requests. Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Citation and Acknowledgement

This repo is built on top of a series of codebases, including Distant Supervision, TimeSformer, PySlowFast and pytorch-image-models. Our EPIC-Kitchen dataloader is partially borrowed from Motionformer. We parse the verb phrases from ASR sentences by using the scripts provided by VisualNarrationProceL. Our transformer architecture partially refers to CLIP. We thank all authors for releasing their code.

If you find this repo useful, please consider citing our paper:

@inproceedings{zhong2023learning,
  title={Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations},
  author={Zhong, Yiwu and Yu, Licheng and Bai, Yang and Li, Shangwen and Yan, Xueting and Li, Yin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={14825--14835},
  year={2023}
}