Home

Awesome

This is an official implementation of Procedure-Aware Pretraining for Instructional Video Understanding. In this repository, we provide PyTorch code for pre-processing, training and evaluation as described in the paper.

The proposed model Paprika was trained from Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We evaluate Paprika on COIN and CrossTask for downstream procedure understanding tasks such as task recognition, step recognition, and step forecasting. Paprika yields a video representation that improves over the state of the art: up to 11.23% gains in accuracy in 12 evaluation settings.

If you find our repo useful in your research, please use the following BibTeX entry for citation.

@article{zhou2023paprika,
  title={Procedure-Aware Pretraining for Instructional Video Understanding},
  author={Zhou, Honglu and Martin-Martin, Roberto and Kapadia, Mubbasir and Savarese, Silvio and Niebles, Juan Carlos},
  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}
<p align="left"> <img src="figures/teaser.png"/> </p>

Installation

Please run the following commands in your shell:

conda create -n paprika python=3.8
conda activate paprika
pip install -r requirements.txt

Dataset Preparation

wikiHow

wikiHow is a text-based procedural knowledge database. The version of the wikiHow dataset that we used for both of the Procedural Knowledge Graph construction and model pre-training has a total of 10, 588 step headlines from 1, 053 task articles. These wikiHow tasks have at least 100 video samples in the HowTo100M dataset. Please follow the instructions provided here to obtain the data (i.e., step_label_text.json).

The wikiHow dataset stored in step_label_text.json is a list with a length of 1053, i.e., 1053 sublists representing 1053 task articles. The length of each sublist corresponds to the number of steps of that task article; each element in the sublist is a dictionary with detailed step information. Note that we only used the step 'headline'.

HowTo100M

HowTo100M is a large-scale unlabeled instructional video corpus. Please follow instructions from dataset provider to download videos, metadata, and subtitles if needed.

Particularly, for pretraining on the subset of HowTo100M with 85K videos that we mentioned in our paper, we used train.csv provided by authors of TimeSformer to obtain the list of video IDs.

COIN

The COIN dataset is one of the datasets that we used for downstream tasks. COIN contains 11K instructional videos covering 778 individual steps from 180 tasks in various domains. The average number of steps per video is 3.9.

Please follow the instructions provided by the authors of the COIN dataset to download the raw videos from YouTube, and their annotations in JSON format.

After downloading the dataset, please populate the coin_video_dir and coin_annoataion_json fields in the config files of the downstream evaluation settings accordingly when you need to start a downstream evaluation setting (described below).

CrossTask

The CrossTask dataset is the other dataset that we used for downstream tasks. CrossTask has 4.7K instructional videos annotated with task name for each video spanning 83 tasks with 105 unique steps. 2.7K videos have steps' class and temporal boundary annotations; these videos are used for the Step Recognition and Step Forecasting tasks. 8 steps per video on average.

Please follow the instructions provided by the authors of the CrossTask dataset to download the raw videos from YouTube, and their annotations.

After downloading the dataset, please populate the cross_task_video_dir and cross_task_annoataion_dir fields in the config files of the downstream evaluation settings accordingly when you need to start a downstream evaluation setting (described below).

Feature Extraction

We used a practical framework that trains a light-weight procedure-aware model (i.e., Paprika) that refines the video segment feature extracted from a frozen general-purpose video foundation model. The frozen general-purpose video foundation model we used is the S3D Text-Video model trained from scratch on HowTo100M using MIL-NCE.

To obtain the frozen S3D model: https://github.com/antoine77340/S3D_HowTo100M#getting-the-data

We used this frozen S3D model to (1) extract visual features of video segments in HowTo100M, COIN and CrossTask using the video branch of S3D; and (2) extract text features of step headlines in wikiHow using the text branch of S3D. To perform the feature extraction using the frozen S3D model, please refer to the following repositories:

In addition, MPNet was also used to extract text features of step headlines in wikiHow. Please refer to https://huggingface.co/sentence-transformers/all-mpnet-base-v2#usage-sentence-transformers.

Video segments are set to be 9.6 seconds long. There is no temporal overlapping or spacing between segments of one video.

For flexible usage and ease of reproducibity, we provide the extracted text features of step headlines in wikiHow, and the extracted visual featurs of video segments for a sample HowTo100M dataset containing 6 videos. Please use this link to download the extracted features in order to run the following pre-processing code.

Reproducibility

In the following, we describe how to pre-train and evaluate Paprika.

The entire pipeline contains 3 stages: (1) pre-processing, (2) pre-training and (3) downstream evaluation.

<p align="left"> <img src="figures/overview.png"/> </p>

Pre-processing

The pre-processing stage must be completed before running the model pre-training. The pre-processing stage mainly extracts the pseudo labels for each video segment in the pre-training video corpus beforehand. The pre-processing stage does not require GPUs.

The pseudo labels that Paprika requires are generated from an automatically constructed Procedural Knowledge Graph (PKG). The PKG is automatically built by combining information from wikiHow and HowTo100M.

If you hope to try with the pre-processing code using our provided HowTo100M sample dataset containing only 6 videos, please first modify and verify paths in configs/PKG_sampledata_pseudo_labels.yml and then run the following command in your shell:

python engines/main_PKG_pseudo_labels.py --cfg configs/PKG_sampledata_pseudo_labels.yml

To perform the pre-processing for the HowTo100M subset with 85K videos that we have used in our paper for pre-training Paprika, please change the config file into configs/PKG_pseudo_labels.yml in the above command. Modify the config correspondingly for the full HowTo100M data.

After the pre-processing code is completed:

NOTE!!! To prevent files from being overwritten (especially when dealing with large datasets such as the HowTo100M subset), please pay attention to the paths in the config file.

Regarding the pre-processing config file:

PKG_sampledata_pseudo_labels.yml and PKG_pseudo_labels.yml are the pre-processing config files for the HowTo100M sample dataset and the HowTo100M subset respectively.

The current configurations in the files are the ones we used in the paper. Please verify and modify if needed the following configurations in the config file:

Pretraining

We used 8 NVIDIA A100 GPUs for Paprika pre-training on the HowTo100M subset. The following shell command was the one we used to start the Paprika pre-training:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nnodes=1 --nproc_per_node=8 engines/main_Paprika_pretrain.py --cfg configs/Paprika_pretrain.yml --use_ddp 1 --use_wandb 0

The following is the command to use if you hope to use only 1 GPU:

CUDA_VISIBLE_DEVICES=0 python engines/main_Paprika_pretrain.py --cfg configs/Paprika_pretrain.yml --use_wandb 0

Regarding the pre-training config file (i.e., configs/Paprika_pretrain.yml):

The current configurations in the configs/Paprika_pretrain.yml file are the ones we used in the paper that obtained the best overall result, which means that the Paprika model will be pre-trained using the Video-Node Matching (VNM), Video-Task Matching (VTM wikiHow + HowTo100M), Task Context Learning (TCL wikiHow) and Node Relation Learning (NRL 2 hops).
Please verify and modify if needed the following configurations in the config file:

Downstream Evaluation

Our downstream evaluation involves 2 downstream datasets (COIN and CrossTask), 3 downstream tasks (SF: Step Forecasting, SR: Step Recognition, and TR: Task Recognition), and 2 types of downstream models (MLP or Transformer). In total, there are 2x3x2 = 12 evaluation settings. The config file for each of the evaluation setting is availbe in the subfolder configs/downstream/.

The shell commands across the 12 evaluation settings only differ in the config file name. For example, the following shell command is the one to use to start the downstream model training and testing on the evaluation setting of MLP x COIN x SF (i.e., use a MLP downstream task model; downstream dataset is the COIN dataset; and the downstream task is Step Forecasting):

CUDA_VISIBLE_DEVICES=0 python engines/main_train_taskhead.py --cfg configs/downstream/mlp_coin_sf.yml --checkpoint /export/share/hongluzhou/data/checkpoints/Paprika_howto100m_fullset.pth

Please use the correct path to the pre-trained Paprika model in the above command.

Regarding the downstream evaluation config file (e.g., configs/downstream/mlp_coin_sf.yml):

Please verify and modify if needed the following configurations in the config file:

Acknowledgements

We thank the authors of the following repositories/sources for releasing their data or implementations: