Awesome
Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition (accepted by ICCV-2023)
This repository holds the Pytorch implementation of "Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition" by Qitong Wang, Long Zhao, Liangzhe Yuan, Ting Liu, Xi Peng.
Overview of our SUM-L
<div align="center"> <img src="overview.png" width="1000px"/> </div>Requirements
pip install -r requirements.txt
Datasets
Epic-Kitchen-55
-
Download the RGB frames from EPIC Kitchens 55.
-
Download the frame list from the following links: (train, val, trainval. Note that the train/val split is following temporal-binding-network
Please set DATA.PATH_TO_DATA_DIR
to point to the folder containing the frame lists, and DATA.PATH_PREFIX
to the folder containing RGB frames. For example, we set the symlinks as follow:
mkdir -p data/epic-55/split
ln -s /path/to/epic-kitchen-55/rgb_extracted/train data/epic-55/train_rgb_frames
ln -s /path/to/ego-exo/dataset_split_files/epic_55_split/ data/epic-55/split
Note: in order to reproduce our results on Epic-Kitchen-100, you have to download Epic-Kitchen-55 dataset, since Epic-Kitchen-100 extends EPIC-Kitchens to 100 hours.
Epic-Kitchen-100
-
Download the RGB frames and annotations from EPIC Kitchens 100.
-
Please set
DATA.PATH_TO_DATA_DIR
to point to the folder containing the frame lists, andDATA.PATH_PREFIX
to the folder containing RGB frames. For example, we set the symlinks as follow:
mkdir -p data/epic-100/
ln -s /path/to/EPIC-KITCHENS-100 data/epic-100/dataset
ln -s /path/to/epic-kitchens-100-annotations data/epic-100/annotations
Testing our trained weights on Epic-Kitchen-100:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 23393 tools/run_net.py --cfg configs/epic-kitchen-100/Ego_Exo_SLOWFAST_8x8_R101.yaml TRAIN.CHECKPOINT_FILE_PATH PATH_TO_WEIGHTS/EE_EP100_SF101_checkpoint_epoch_00030.pyth
Note: you might notice there are some "extra" parameters when you load our model. These are from the third-person video backbone and the network of our proposed method. During the testing phase, we only need parameters from the first-person video backbone.
Method | verb-top1 | verb-top5 | noun-top1 | noun-top5 | Weights |
---|---|---|---|---|---|
EE-SlowFast_R101 | 67.0 | 90.7 | 53.4 | 76.9 | Google Drive |
Codebase References:
https://github.com/facebookresearch/Ego-Exo/tree/main
https://github.com/facebookresearch/SlowFast
More code and pre-trained weights will be released soon. Please stay tuned. :)
If you find our code or paper useful in your research, please consider citing:
@InProceedings{Wang_2023_ICCV,
author = {Wang, Qitong and Zhao, Long and Yuan, Liangzhe and Liu, Ting and Peng, Xi},
title = {Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {3307-3317}
}