Home

Awesome

[ICCV2023] Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Chuhan Zhang, Ankush Gupta, Andrew Zisserman
(ICCV 2023 | arxiv | bibtex)

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this).

arch

💾 Data Preparation

🏋️ Pretraining

python train.py   --backbone LaviLa  --num_queries 12  --name_prefix /work/czhang/EgoVLP/github/ --data_dir /path/to/ego4d/videos  --meta_dir ./data/EgoClip

🔑 Zero-shot evaluation

Zero-shot on EgoMCQ

python test_EgoMCQ.py    --data_dir ./data/EgoClip --meta_dir /path/to/meta/dir --num_queries 12 --num_frames 16  --resume /path/to/model_weights --lavila_weights_path  /path/to/lavila_weights

Zero-shot on Epic-Kitchens MIR

python test_epic.py  --data_dir ./data/epic_kitchens --meta_dir /path/to/meta/dir --num_queries 12 --num_frames 16  --resume /path/to/model --lavila_weights_path  /path/to/lavila_weights

Zero-shot on EGTEA

python test_epic.py  --data_dir ./data/epic_kitchens --meta_dir /path/to/meta/dir --num_queries 12 --num_frames 16  --resume /path/to/model --lavila_weights_path  /path/to/lavila_weights

🛠 Finetuning on Episodic Memory

Finetuning on EgoNLQ/EgoMCQ we extract the features from pretrained model using code here and do finetuning with code here

👁 Visualization of Grounding Results

*Please note the objective of the paper is to do vision-language understanding, and the model was trained on grounding as an auxiliary tasks with noisy boxes as supervision, it may not predict perfect grounding results.

We extracted hand and object boxes (download) on EgoClip using our model with 4 object queries (model_weights). The file has information of video and the boxes in it as:

Dict{
  video_uid: Str                                                             # the uid of video in egoclip
  start_sec: Float                                                           # the start timestamp of the clip
  end_sec: Float                                                             # the end timestamp of the clip
  samples_sec: List[Floats]                                                  # a list of four timestamps where the frames are sampled from
  object_boxes: Dict{object1_name:[object1_box_t1,object1_box_t2,] ....}     # nouns in the narrations and their corresponding trajectory 
  hand_boxes: Dict{hand1:[hand1_box_t1,hand1_box_t2,...], hand2:...}         # trajectory of left and right hand 
}

To visualize the boxes on RGB frames, please download the boxes and put them into run /path/to/boxes the following command:

cd demo
python visualize_box.py --video_dir /path/to/egoclip/videos --anno_file /path/to/boxes

🙏 Acknowledgements

This code is based on EgoVLP and LaViLa, If you use this code, please consider citing them.

📰 Citation

@inproceedings{zhanghelpinghand,
  title={Helping Hands: An Object-Aware Ego-Centric Video Recognition Model},
  author={Chuhan Zhang and Ankush Gputa and Andrew Zisserman},
  booktitle={International Conference on Computer Vision (ICCV)},
  year={2023}
}

📩 Contact

If you have any question, please contact czhang@robots.ox.ac.uk .