Home

Awesome

PWC PWC PWC

Affordance Grounding from Demonstration Video to Target Image

This repository is the official implementation of Affordance Grounding from Demonstration Video to Target Image:

@inproceedings{afformer,
  author  = {Joya Chen and Difei Gao and Kevin Qinghong Lin and Mike Zheng Shou},
  title   = {Affordance Grounding from Demonstration Video to Target Image},
  booktitle = {CVPR},
  year    = {2023},
}

Install

1. PyTorch

We now support PyTorch 2.0. Other version should be okay.

conda install -y pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia

NOTE: If you want to use PyTorch 2.0, you should install CUDA >= 11.7. See https://pytorch.org/.

2. PyTorch Lightning

We use PyTorch Lightning 2.0 as the training and inference engines.

pip install lightning jsonargparse[signatures] --upgrade

3. xFormers

We use memory-efficient attention in xformers. Currently PyTorch 2.0 does not support memory-efficient attention relative positional encoding (see pytorch/issues/96099). We will update this repo when PyTorch supports this.

pip install triton --upgrade
pip install --pre xformers

4. Timm, Detectron2, Others

We borrow some implementations from timm and detectron2.

pip install timm opencv-python av imageio --upgrade
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

Dataset

datasets
└── opra
    ├── annotations
    │   ├── test.json
    │   ├── train.json
    ├── clips
    │   ├── aocom
    │   ├── appliances
    │   ├── bestkitchenreview
    │   ├── cooking
    │   ├── eguru
    │   └── seattle
    └── images
        ├── aocom
        ├── appliances
        ├── bestkitchenreview
        ├── cooking
        ├── eguru
        └── seattle

Afformer Model

Hint: We recommend you to read LightningCLI if you firstly use it. That helps you better use these commands.

1. ResNet-50-FPN encoder

python main.py fit --config configs/opra/r50fpn.yaml --trainer.devices 8 --data.batch_size_per_gpu 2
tensorboard --logdir outputs/ --port 2333
# Then you can see real-time losses, metrics at http://localhost:2333/ 
python main.py validate --config configs/opra/r50fpn.yaml --trainer.devices 8 --data.batch_size_per_gpu 2 --ckpt outputs/opra/r50fpn/lightning_logs/version_0/checkpoints/xxxx.ckpt

2. ViTDet-B encoder

python main.py fit --config configs/opra/vitdet.yaml --trainer.devices 8 --data.batch_size_per_gpu 2
tensorboard --logdir outputs/ --port 2333
# Then you can see real-time losses, metrics at http://localhost:2333/ 
python main.py validate --config configs/opra/vitdet_b.yaml --trainer.devices 8 --data.batch_size_per_gpu 2 --ckpt outputs/opra/vitdet_b/lightning_logs/version_0/checkpoints/xxxx.ckpt

3. Visualization

python demo.py --config configs/opra/vitdet_b.yaml --weight weights/afformer_vitdet_b_v1.ckpt --video demo/video.mp4 --image demo/image.jpg --output demo/output.gif

MaskAHand Pre-training

NOTE: A detailed tutorial will be done as soon as possible.

1. Hand Interaction Detection

2. Hand Interaction Clip Mining

3. Target Image Synthesis and Transformation

This would be done during training. You can set the hyper-parameters in configs/opra/maskahand/pretrain.yaml:

mask_ratio: 1.0
num_masks: 2
distortion_scale: 0.5
num_frames: 32
clip_interval: 16
contact_threshold: 0.99

4. MaskAHand Pre-training

python main.py fit --config configs/opra/maskahand/pretrain.yaml

5. Fine-tuning or Zero-shot Evaluation

python main.py fit --config configs/opra/maskahand/finetune.yaml 
python main.py validate --config configs/opra/maskahand/pretrain.yaml

6. Visualization

You can refer to demo.py to visualize your model results.

Contact

This repository is developed by Joya Chen. Questions and discussions are welcome via joyachen@u.nus.edu.

Acknowledgement

Thanks to all co-authors of the paper, Difei Gao, Kevin Qinghong Lin, and Mike Shou (my supervisor). Also appreciate the assistance from Dongxing Mao and Jiawei Liu.