Awesome
Introduction
This repository contains the Adverbs in Recipes (AIR) dataset and the code for the CVPR 23 paper:
Learning Action Changes by Measuring Verb-Adverb Textual Relationships
Cite
If you find our work useful, please cite our CVPR 23 paper:
@article{moltisanti23learning,
author = {Moltisanti, Davide and Keller, Frank and Bilen, Hakan and Sevilla-Lara, Laura},
title = {{Learning Action Changes by Measuring Verb-Adverb Textual Relationships}},
journal = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023}}
You can read our paper on arXiv.
Authors
All authors are at the University of Edinburgh.
Videos
The AIR (Adverbs in Recipes) dataset
Adverbs in Recipes (AIR) is our new dataset specifically collected for adverb recognition.
AIR is a subset of HowTo100M where recipe videos show actions
performed in ways that change according to an adverb (e.g. chop thinly/coarsely
).
AIR was carefully reviewed to ensure reliable annotations. Below are some properties of AIR:
Property | Value |
---|---|
Videos | 7003 |
Verbs | 48 |
Adverbs | 10 |
Pairs | 186 |
Avg. video duration | 8.4s |
AIR is organised in three CSV files, which you will find under the ./air
folder of this repository:
antonyms.csv
: lists the 10 adverbs and the respective antonymstest_df.csv
: test split of the datasettrain_df.csv
: train split of the dataset
The train and test CSV files contain the following columns:
Column name | Explanation |
---|---|
seg_id | unique ID of the trimmed video segment |
youtube_id | YouTube ID of the untrimmed video |
sentence | parsed caption sentence aligned with the segment |
verb_pre_mapping | the parsed verb before clustering |
adverb_pre_mapping | the parsed adverb before clustering |
clustered_verb | the parsed verb after clustering |
clustered_adverb | the parsed adverb after clustering |
start_time | start time in seconds, relative to the untrimmed video |
end_time | end time in seconds, relative to the untrimmed video |
duration | duration of the segment in seconds |
rgb_frames | number of frames in the segment (extracted with original FPS) |
verb_label | numeric verb label |
adverb_label | numeric adverb label |
AIR video samples
You can watch our short video on YouTube to have a look at a few samples from AIR.
S3D features
We provide pre-extracted S3D features here.
These were obtained with stack size equal to 16 frames and stride equal to 1 second (as specified in the paper).
Features are stored as PyTorch dictionaries with the following structure:
features
:${seg_id}
:s3d_features
: Tx1024 PyTorch tensor, where T is the number of stacks. These are the video features namedmixed_5c
in S3Dvideo_embedding_joint_space
: Tx512 PyTorch tensor, where T is the same as above. These are the video-text joint embeddings namedvideo_embedding
in S3D
metadata
:${seg_id}
:start_time
: start time of the segment (in seconds)end_time
: end time of the segment (in seconds)clustered_verb
: verb associated with the segmentclustered_adverb
: adverb associated with the segmentframe_samples
: frame indices (relative to the trimmed video) sampled to extract the features
Code
You can use the Python script run.py
to try our method. Usage below:
usage: run.py [-h] [--train_batch TRAIN_BATCH] [--train_workers TRAIN_WORKERS] [--test_batch TEST_BATCH] [--test_workers TEST_WORKERS] [--lr LR] [--dropout DROPOUT] [--weight_decay WEIGHT_DECAY] [--epochs EPOCHS] [--run_tags RUN_TAGS] [--tag TAG] [--test_frequency TEST_FREQUENCY] [--hidden_units HIDDEN_UNITS]
[--s3d_init_folder S3D_INIT_FOLDER] [--no_antonyms] [--fixed_d] [--cls_variant]
train_df_path test_df_path antonyms_df features_path output_path
Code to run the model presented in the CVPR 23 paper "Learning Action Changes by Measuring Verb-Adverb Textual Relationships"
positional arguments:
train_df_path Path to the datasets train set
test_df_path Path to the datasets test set
antonyms_df Path to the datasets antonyms csv
features_path Path to the pre-extracted S3D features
output_path Where you want to save logs, checkpoints and results
options:
-h, --help show this help message and exit
--train_batch TRAIN_BATCH
Training batch size
--train_workers TRAIN_WORKERS
Number of workers for the training data loader
--test_batch TEST_BATCH
Testing batch size
--test_workers TEST_WORKERS
Number of workers for the testing data loader
--lr LR Learning rate
--dropout DROPOUT Dropout for the model
--weight_decay WEIGHT_DECAY
Weight decay for the optimiser
--epochs EPOCHS Number of training epochs
--run_tags RUN_TAGS What arguments should be used to create a run id, i.e. an output folder name
--tag TAG Any additional tag to be added to the run id
--test_frequency TEST_FREQUENCY
How often the model should be evaluated on the test set
--hidden_units HIDDEN_UNITS
Number of layers and hidden units for the models MLP, to be specified as a string of comma-separated integers. For example, 512,512,512 will create 3 hidden layers, each with 512 hidden units
--s3d_init_folder S3D_INIT_FOLDER
Path to the S3D checkpoint. This will be downloaded automatically if you run setup.bash. Otherwise, you can download them from https://github.com/antoine77340/S3D_HowTo100M
--no_antonyms Whether you want to discard antonyms. See paper for more details
--fixed_d Runs the regression variant with fixed delta equal to 1. See paper for more details
--cls_variant Runs the classification variant. See paper for more details
We provide the following bash training scripts to run the variants of the method:
scripts/train_cls.bash
: classification variant, trains the model with Cross Entropyscripts/train_reg.bash
: regression variant, trains the model with MSE with delta calculated measuring verb-adverb textual relationshipsscripts/train_reg_fixed_d.bash
: regression variant, trains the model with MSE with fixeddelta=1
scripts/train_reg_no_ant.bash
: likeREG
, but discards antonyms.
See the paper for more details.
Setup
You can set up your environment with bash setup.bash
. This will use conda to install the necessary dependencies,
which you can find below and in the environment.yaml
file:
name: air_adverbs_cvpr23
channels:
- defaults
dependencies:
- pytorch::pytorch
- pytorch::torchvision
- nvidia::cudatoolkit=11
- numpy
- pip
- scikit-learn
- tqdm
- pandas
- tensorboard
- pip:
- opencv-python
- yt-dlp
The setup script will also download AIR's S3D features and S3D's weights.
Output files
The code will save the following files at the specified output location:
├── logs
│ └── events.out.tfevents.1678189879.your_pc.1005191.0
├── model_output
│ ├── best_test_acc_a.pth
│ ├── best_test_map_m.pth
│ └── best_test_map_w.pth
├── model_state
│ ├── best_test_acc_a.pth
│ ├── best_test_map_m.pth
│ └── best_test_map_w.pth
├── summary.csv
Details about each file:
logs
: contains Tensorboard log filesmodel_output
: contains the output of the model (as a PyTorch dictionary)- one file for the best version of the model for each of the 3 main metrics
model_state
: contains the model parameters (as a PyTorch dictionary)- one file for the best version of the model for each of the 3 main metrics
summary_csv
: CSV file containing training and testing information for each epoch:train_loss
: average training losstest_loss
: average test losstest_map_w
: mAP W, i.e. mean average precision with weighted averagingtest_map_m
: mAP M, i.e. mean average precision with macro averagingtest_acc_a
: Acc-A, i.e. binary accuracy of the GT adverbs versus its antonymtest_no_act_gt_map_w
: mAP W calculated discarding the GT verb action labeltest_no_act_gt_map_m
: mAP M calculated discarding the GT verb action labeltest_no_act_gt_acc_a
: Acc-A calculated discarding the GT verb action label
License
We release our code, annotations and features under the MIT license.
Other repositories
Some small bits of code come from:
These bits are clearly marked in the code.
Pre-processing AIR
You don't need to download videos, extract frames or S3D features to use our code.
Nevertheless, we provide our pre-processing scripts in this repository. You can learn how to use these scripts below.
Download videos
You can download and trim the videos from YouTube using the Python script download_clips.py
. Usage below:
usage: download_clips.py [-h] [--n_proc N_PROC] [--use_youtube_dl] train_df_path test_df_path output_path yt_cookies_path
Download and trims videos from YouTube using either youtube-dlp or yt-dl (which you should install yourself). It uses
multiprocessing to speed up downloading and trimming
positional arguments:
train_df_path Path to the train sets csv file
test_df_path Path to the test sets csv file
output_path Where you want to save the videos
yt_cookies_path Path to a txt file storing your YT cookies. This is needed to download some age-restricted videos. You can use https://github.com/hrdl-github/cookies-txt to store cookies in a text file
options:
-h, --help show this help message and exit
--n_proc N_PROC How many process you want to run in parallel to download videos
--use_youtube_dl Use youtube-dl instead of the default yt-dlp (not recommended as yt-dlp is now deprecated in favour of youtube-dl)
you can also refer to the bash script located at scripts/download_air_clips.bash
.
Extract RGB frames
To extract RGB frames from the video (e.g. if you want to then extract features yourself) you can use the Python
script extract_rgb_frames.py
. Usage below:
usage: extract_rgb_frames.py [-h] [--frame_height FRAME_HEIGHT] [--quality QUALITY] [--ext EXT] [--cuda] train_df_path test_df_path videos_path output_path
Extract RGB frames from videos using ffmpeg. If you have an ffmpeg compiled with cuda specify --cuda. You might need to deactivate the conda environment to call the cuda-enabled ffmpeg
positional arguments:
train_df_path Path to the train sets csv file
test_df_path Path to the test sets csv file
videos_path Path to the videos
output_path Where you want to save the frames
options:
-h, --help show this help message and exit
--frame_height FRAME_HEIGHT
Height of the frame for resizing, in pixels
--quality QUALITY ffmpeg quality of the extracted image
--ext EXT Video file extension
--cuda Use ffmpeg with cuda hardware acceleration. If you have an ffmpeg compiled with cuda specify --cuda. You might need to deactivate the conda environment to call the cuda-enabled ffmpeg
you can also refer to the bash script located at scripts/extract_air_rgb_frames.bash
.
Extract S3D features
In case you want to extract features yourself you can use the Python script extract_s3d_features.py
. Usage below:
usage: extract_s3d_features.py [-h] [--stack_size STACK_SIZE] [--stride STRIDE] [--batch_size BATCH_SIZE] [--workers WORKERS] [--jpg_digits JPG_DIGITS] [--jpg_prefix JPG_PREFIX] [--rgb_height RGB_HEIGHT] [--crop_size CROP_SIZE] train_df_path test_df_path frames_path output_path s3d_init_folder
Extract S3D features
positional arguments:
train_df_path Path to the train sets csv file
test_df_path Path to the test sets csv file
frames_path Path to the extracted frames
output_path Where you want to save the features
s3d_init_folder Path to the S3D checkpoint folder. You can find the model weights at https://github.com/antoine77340/S3D_HowTo100M
options:
-h, --help show this help message and exit
--stack_size STACK_SIZE
Number of frames to be stacked as input to S3D
--stride STRIDE Stride in seconds
--batch_size BATCH_SIZE
Batch size to extract features in parallel
--workers WORKERS Number of workers for the data loader
--jpg_digits JPG_DIGITS
Number of digits in the JPG frames file name
--jpg_prefix JPG_PREFIX
Prefix of the JPG frames file name
--rgb_height RGB_HEIGHT
Height of the frame, in pixels
--crop_size CROP_SIZE
Size of the central crop to be fed to S3D, in pixels
You will need to download the S3D weights from the original repository.
You can also refer to the bash script located at scripts/extract_air_s3d_features.bash
, which will download S3D's
weights automatically.
Bash scripts
All bash scripts should be run from the repository's root, e.g. bash ./scripts/train_reg.bash
.
Help
If you need help don't hesitate to open a GitHub issue!
Errata 09/05/2023
We found out that there was an issue with our experiments on AIR, ActivityNet Adverbs and MSR-VTT Adverbs. These datasets have videos with variable duration, thus we used zero-padding to pack the pre-extracted S3D video features. The problem was that we were not masking the padding elements during attention, so the model was allowed to look at padding 0 elements. This caused the model to learn some spurious correlation, and caused results to also depend on the batch size and order during testing.
We updated the code on 09/05/2023, so experiments run before this date will suffer from this issue. Please update the code by pulling the repository and retrain your model.
You can find the new results with attention masking here. We will also update our paper on arXiv with the new results.